Adapting natural language processing for technical text

Despite recent dramatic successes, natural language processing (NLP) is not ready to address a variety of real-world problems. Its reliance on large standard corpora, a training and evaluation paradigm that favors the learning of shallow heuristics, and large computational resource requirements, makes domain-specific application of even the most successful NLP techniques difficult. This paper proposes technical language processing (TLP) which brings engineering principles and practices to NLP specifically for the purpose of extracting actionable information from language generated by experts in their technical tasks, systems, and processes. TLP envisages NLP as a socio-technical system rather than as an algorithmic pipeline. We describe how the TLP approach to meaning and generalization differs from that of NLP, how data quantity and quality can be addressed in engineering technical domains, and the potential risks of not adapting NLP for technical use cases. Engineering problems can benefit immensely from the inclusion of knowledge from unstructured data, currently unavailable due to issues with out of the box NLP packages. We illustrate the TLP approach by focusing on maintenance in industrial organizations as a case-study.

For engineers and technical analysts wishing to use NLP as part of their analyses of technical processes, there is less reason to be optimistic. Despite impressive results with standard challenge data sets, an open question remains as to what state-of-the-art (SOTA) models are actually learning. 6 In particular, claims that NLP systems understand language or the meaning of text are overblown as evidenced by the failure of SOTA models to generalize learned knowledge in a human-like manner. 1,3,6,7 There is also concern that the current NLP training and evaluation paradigm naturally favors models for which large amounts of data are available. 3 This may not be an issue for academic or research NLP systems: they are often successful when trained on "standard" text that comes from, for example, English news wire and other literature. 8,9 However, text encountered in technical applications, such as in industrial operations, differs significantly from these benchmarks, causing performance of deployed NLP systems to drop, 8,10 often to unacceptable levels.
Despite the volume of text data in industrial engineering, it is in many ways a low-resource domain from the NLP perspective. The traditional response to addressing these domains in machine learning is transfer learning in which the models generated from annotated data from resource-rich domains are adapted for the low-resource domain. [11][12][13] However, these approaches often assume that the differences between two different domains are constrained in particular ways. For example, the lexical, grammatical, and terminological differences between "standard" English and that found in industrial maintenance logs have spawned a whole set of domain-specific NLP adaptations that are largely outside of mainstream NLP. 9,14 The classical NLP goal of having computers attain human-like language abilities 3 may also bias NLP toward impressive-but complex and resource-intensive-technologies, while ignoring those that are more in line with practical engineering needs. 15,16 With all this in mind, we sought an approach which will help bridge the gap between the promise of NLP and the realities confronted in many technical domains.
Technical language processing (TLP) is our proposed human-in-the-loop, iterative approach to tailor NLP tools to technical data that explicitly considers industrial engineering use cases as inputs along with the raw text ( Figure 1. Our intention is to address perceived shortcomings of applying standard NLP to technical text data. As an engineering discipline, TLP includes explicit notions of process and can catalog and disseminate successful patterns of application. The TLP process builds specialized resources from existing components including NLP techniques such as tokenizers and embeddings. Some of the burden on domain experts is alleviated via computational support tools that elicit expert input when necessary. Analysts also benefit from TLP resources such as industry standards and technical dictionaries. TLP strives to improve its resources and computational support tools to reduce error and increase confidence in analyses through collaboration between analysts and domain experts. Community-driven TLP resource development is iterative and influenced by text analysis.
Our goal for this paper is to further argue for the creation of an NLP field that focuses on the technical text that appears in the computer-mediated communication used to support business processes within specialized domains. We

Available data
Outputs as actionable results

TLP resources and inputs
"Fortuitous" data TLP resources Use case(s)

NLP Pipeline
Text Preprocessing Analysis pipeline

Domain expert
Computational support tools Analyst F I G U R E 1 Technical language processing expands the system boundary beyond the traditional natural language processing (NLP) pipeline to include users, engineering use cases, technical language processing (TLP) resources such as dictionaries, as well as other "fortuitous" data sources (section 6) which aid in the interpretation of the primary text data will focus on industrial maintenance as our motivating example and we consider the need for TLP when analyzing the text found in maintenance management systems. The remainder of this paper is organized as follows. We will discuss maintenance, along with its records and text, and the challenges that they present in section 2. We will then question in section 3 whether algorithms have the ability to generalize what they learn and show how TLP addresses this concern. Section 4 introduces two issues related to the use of large data sets and section 5 examines the problems with domain adaptation for technical text. We discuss the benefits of "fortuitous" data in section 6 and discuss computational costs and TLP's strategies for mitigating them in section 7. Using a set of ethical concerns, we present three general risks of applying existing NLP to technical text and why we believe that TLP can help in section 8. We close with section 9, a summary of how TLP addresses the challenges of applying NLP to technical text.

| WHAT IS MAINTENANCE?
The health and prosperity of a nation is built on its infrastructure. Consider our roads, water and power networks, buildings, and manufacturing capacity. The assets that provide these services require maintenance. The management of this maintenance is often an invisible process until something fails. 17 Asset maintenance involves a wide variety of stakeholders such as asset owners, operators, contractors, original equipment manufacturers, and specialist service providers. All these stakeholders keep their own records about assets. Maintenance records are created by maintenance technicians, engineers, and operators. Collectively, we call this group "maintainers." To become a maintainer requires years of training involving learning the language of engineering and maintenance, developing physical, chemical, structural, electrical and digital knowledge of how assets and asset systems function, and how they fail. 18 Maintainers' training enables them to share information efficiently using common mental models, often codified in standards and standardized or well-known procedures. Maintainers use their expertise to describe the maintenance work they perform, usually in a free text format. Much information, especially about relationships is implicit, and jargon and abbreviations are widely used. 19 The language of engineering and maintenance is challenging for nonmaintainers (and computers) to understand.

| Maintenance records and text
A maintenance work order (MWO) is created for every maintenance activity. It may be generated by a maintainer on noticing that an asset needs maintenance work or by a computerized maintenance management system, in which case the original work order text would have been generated as semistructured text by a maintenance planner. 20 Examples of both are shown in Table 1. Hundreds, sometimes thousands, of MWOs are generated each month depending on the complexity of the organization. Currently, without NLP tools that are fit for the purpose, all these MWO records need to be read by humans in order to be planned, scheduled, and executed. In the past, records were kept on paper, but are nowadays stored in unstructured text fields in relational database systems and spreadsheets. These MWO records are akin to medical records for an individual, 21 and are vital to efforts that estimate the reliability of the asset and potential for functional failures. However, there are a number of challenges in extracting knowledge from these texts.

| Maintenance text challenges
The text taken from maintenance management systems deviates from "standard" English in a number of ways. As shown in Table 1, the sample MWOs describe the state of an asset and/or the work that needs to be done. Work order description fields can usually be characterized as containing at least one verb such as "replace" to describe desired action or word that describes the asset state such as "plugged." In general, such entities in MWO corpora are unbalanced with a relatively small number of verbs describing maintenance work and the observed state and a large number of n-grams 22 used to describe the assets. As yet there is no widely agreed structure for named entity recognition for those seeking to create annotated data sets. A number of different named entity recognition classes are being used for MWO annotation: item-activity-state 10 and item-problem/symptom-solution/action. 23,24 The familiar assumptions of NLP often mislead in the analysis of maintenance text. For example, although the overall number of maintenance records can be similar to the number of documents in an NLP corpora, the MWO text tends to be much smaller ( Table 2). Maintenance text itself is often more similar to shorthand notation than the standard English text. 9,10 Stop words, commonly removed in NLP, provide important context for the interpretation of MWO. 14 As seen in Table 1, many of the words are domain-specific and most are abbreviations or acronyms, some created by specific individuals or groups of maintainers, 9,10,30 that are used inconsistently and interchangeably 14,31 and are not consistently marked with periods. 9 Words can be misspelled, omitted, or run together and longer words are often contracted with sporadic apostrophes. 9,14 Unlike "standard" English where each form of punctuation has a specific use, punctuation in maintenance data is typically used interchangeably to separate distinct ideas. 14 These lexical issues can lead to semantic ones. Multiple instances of parts, actions, and symptoms coexist in a single record and their correct associations must be established. 31 Many individual concepts are expressed using multiple words, that must be parsed as a single unit to get the intended meaning. 14 The same concept can also be referred to in many different ways, "frontShockAbsorber," "shockAbsorbedFront," "shockFrtAbsorber," and "brakeAbsorber" all refer to the same part but are lexically inconsistent. 24 As a result of the many challenges associated with maintenance text, there have been domain adaptations, largely ad hoc, some of which we will discuss in section 5.

| DO ALGORITHMS UNDERSTAND?
The excitement of SOTA NLP is often conveyed with claims that these systems understand or capture meaning of the text being analyzed. 1 But what does this really mean? The "symbol grounding problem," which occurs when symbols are interpreted based on other symbols in a circular fashion rather than their meaning in the external world, is a concern when evaluating the ability of computational machines to understand the intrinsic meaning in language. 32 NLP systems operate under the distributional hypothesis that words surrounding a word in question give clues to its meaning and when taken in aggregate, all of its contexts appear to give us what we seek. 2 This may especially be not true in technical text although techniques using contextual information have been developed in the automotive industry. 31 The ability to generalize, when a model behaves as expected in novel situations beyond the training context, is closely related to the problem of meaning. 33 Challenges with proper generalization of SOTA NLP systems suggest that such systems are not able to meaningfully learn from their training data, evidenced by inconsistent results when input data differs in distribution from training data and the need for significant retraining to adapt models to new tasks. 3,34,35 Some of this semantic disability can be traced to the current training and evaluation paradigm which does not encourage human-like generalization by having the test data drawn from the same distribution as the training data. 3 Under these conditions, many SOTA learning systems learn shallow heuristics that work for the training data instead of really learning the expected generalizations. 6,7,[34][35][36][37][38] The current paradigm and shallow heuristics conspire to create models that are, in a sense, overfitted to particular data sets and lack the ability to generalize as their creators intended. 3,39 As a result, claims that these models offer a human-level capacity for real-world meaning and understanding are exaggerated. 1 TLP is an adaptation of and is firmly rooted in NLP. There is nothing precluding the use of any and all useful NLP approaches. By expanding the system boundary away from algorithms and data pipelines to include humans in the loop, we hope to overcome grounding issues.
Unlike NLP approaches that learn from text in an exclusively unsupervised fashion, 2 TLP allows and encourages iterative human intervention and supervision at every stage; we describe this aspect of TLP in detail in Reference 21. This connection to the outside world goes beyond merely interfacing with sensors which some believe to be sufficient. 32 We see TLP as leveraging humans to provide a rich source of semantic information and meaningful action through their ability to discriminate among manipulate, identify, describe, and respond to real world objects, events, and states. This will allow us to inject meaning into analyses.
TLP can help tackle the problem of generalization by promoting the use and development of computational resources such as annotation tools that support hybrid datafication via artificial intelligence-assisted human tagging, where datafication refers to the process of structuring text information to facilitate the understanding of its context. 40 These NLP-based tools allow for the manual injection of real-world knowledge into the learning process by providing ontological information that can guide categorization and generalization. Two such tools, Nestor 41 and Redcoat, 42 allow for the tagging of short technical text, such as found in MWO descriptions, with annotations that facilitate processing. Machine learning systems can then use these tags as a signal to promote generalization by helping to mitigate the shallow heuristics and spurious correlations that could otherwise affect learning.

| MORE DATA ISN'T THE ANSWER
We believe that the problem of learning of shallow heuristics is further exacerbated by two issues associated with analyses of large amounts of data: spurious correlations and the low probability of regularities due to the underlying phenomena of interest. 43 Very large data will contain spurious correlations that exist solely due to the size of the data and not because of any other intrinsic property. Such correlations cannot be distinguished algorithmically from other types of correlations and can overwhelm detection of the "true correlations." Second, even though "true correlations" are the signals sought during analyses, the probability of regularities due to the underlying phenomena appearing in the data is low. The larger the data analyzed, the greater the chance that spurious correlations dominate the results and lead to erroneous conclusions.
The domain-specific preprocessing and data normalization steps in TLP help improve the visibility of system behaviors against their naturally noisy backdrop as well as reducing the opportunity for spurious correlations. Because NLP is focused on "standard" English, we believe that preprocessing and data normalization do not receive sufficient attention. In TLP, these issues become areas of active interest and we expect that practitioners across different TLP domains will share and evaluate experiences and approaches. Over time, TLP will develop a systematic framework for preprocessing and normalization that can be easily adapted to new technical domains.

| WHAT ABOUT DOMAIN ADAPTATION?
Domain adaptation is a class of approaches that attempt to transfer learning from a task in a source domain with abundant annotated data to a similar task in a target domain, one with little or no annotated data. 12 An underlying assumption is that there exists a resource-rich domain that is similar enough to the low-resource domain; this is unclear for the technical domains that we are considering, such as maintenance. Adding to the uncertainty, the NLP literature sometimes equates domain adaptation with transfer learning, 13 it lacks a consistent definition for the concept of a domain, and its notion of domain adaptation focuses on assumptions that are unrealistic for technical text.
One such assumption is that syntactic structures and parts of speech (POS) are stable between two domains because they reflect intrinsic properties of a shared, clean natural language whose only differences are the appearance, roles, or distributions of certain domain-specific words. 12,13 Shared features can then be leveraged. So for example, there are known shared words whose POS tags can be used to predict POS tags for unknown words.
There is then an expectation that NLP systems will work sufficiently well when trained either using annotated source domain data alone or with a combination of a small set of annotated data from the new domain combined with the annotated data from the source domain. 12 Normalizing the target domain's data to make it more closely resemble the data used to originally train the system also seems viable. 8 However, given the grammatical, spelling, and usage issues present in technical text, these approaches will likely not work in general, although they might be useful in some contexts. For maintenance, not only are typical NLP systems not suited, 9 but neither are standard domain adaptation techniques.
Like other technical domains, a variety of bespoke maintenance-specific NLP adaptations have appeared in the literature. Out-of-the-box preprocessing pipelines require modifications. As part of their work with military aircraft maintenance, Bokinsky et al 14 and McKenzie et al 9 made adaptations to Natural Language Toolkit functionalities, such as introducing a token "sterilizer," which addressed observed challenges of inconsistent punctuation, necessary punctuation and words with no semantic difference through injection of special rules-replacing all punctuation with an identical special punctuation token and all tokens containing numbers with a special identical code token.
We see the presence of bespoke NLP adaptations as evidence that the lack of a well-developed notion of a domain is a central issue that hampers domain adaptation for TLP. We favor the approach taken by Plank 8 which critiques the current approaches to domain adaptation as focusing on the dichotomy between the source and target domains without a real interest in their essential differences. 8 She states that there is little research that addresses how text varies and how these variations affect the use of NLP and proposes a definition of a domain as a region in a high-dimensional variety space. This space is an unknown high-dimensional space whose dimensions include many latent variables beyond the text itself such as social factors. The concept of a variety space is defined by a set of variables, some of which are latent, that describe the different ways that texts and their contexts can differ. 8 Variety spaces can accommodate variables that exist outside of the text like gender or geographic location, the medium used, or area of domain expertise. A domain is a region in this space where it can be said that texts are similar; it is a bounded cluster of points in this variety space.
In the conventional NLP conception of domains, all that can be considered formally is the text by itself. One problem with this restricted way of thinking about domains is that two texts can appear to be very similar. By using the variety space definition, one can formalize the need for using two separate dictionaries to decode the terms found in them and process them accordingly.

| MAKING USE OF FORTUITOUS DATA
To enable interpretation, Plank 8 also argues for the value of "fortuitous" data associated with text that includes metadata and data from other sources which is usually ignored during NLP analyses. In maintenance, these data include data extracted from other fields in the maintenance management system, such as cost or time spent, as well as information obtained from purchase systems, weather databases, and maintenance manuals. She claims that the pairing of fortuitous data with learning algorithms allows for rapid adaptation to new varieties of language. In particular, she argues for rapidly gathering annotated data and the increased use of unsupervised and weakly supervised methods.
Although for many use cases, a rules-based approach can handle the presence of zero, one, or multiple labels on a single MWO, this can challenge supervised learning approaches 23,30 and performance depends on the handling of class imbalance. Seale 44 handled the challenge of 1200 different component classes by injecting additional information relevant to the physical systems into the model training systems through "privileged information" which is a form of fortuitous data. 8 A common example of this knowledge in engineering is that components have natural hierarchical structuring and this taxonomic information can be used to identify correct and incorrect components. Another example is knowledge of the cause and event relationships to predict components involved in a failure or repair activity.
We believe that TLP can further develop the idea of fortuitous data by encouraging community development and use of shared computational resources. The creation and use of knowledge dictionaries, typified by ConceptNet, 45 to improve the semantic processing of natural language and provide additional assistance in preprocessing the data, managing word tagging, and/or any special rules have gained traction. Such dictionaries are reusable, developed/tuned as a data preprocessing step across the data, and often use common NLP tools to assist in their creation. 10,23,40 Sexton et al 40 developed an importance based vocabulary tagging system using term frequency-inverse document frequency weighting. 22 Gao et al 10 used spellcheckers (pyspeller) and string distance (fuzzywuzzy) to support dictionary creation process for domain specific uses. Such dictionaries have helped manage misspellings and variations of the same terms in preprocessing for word representations. 46,47 POS tagging has also been customized, examples include the use of a modified version of the widely used Penn Treebank Set and custom tags for domain-specific concepts 9,14 and contextrelevant State-Activity-Item tags. 10 On the surface, these resources can help mitigate the lexical variations in technical text and simplify domain adaptation between similar technical domains by constraining terminological variation to the intrinsic differences found between facilities. 21 But from a deeper perspective, they provide a source of standardized fortuitous data; they represent shared knowledge that can be used to understand the latent variables associated with a domain and help define the proper context for their interpretation. This knowledge can also help with the comparison of domains and further foster the sharing and adaptation of analysis approaches.

| DOING MORE WITH LESS
Engineering researchers have started to widely use SOTA NLP approaches to mine text data. 15,16 However, there is also a tendency to gloss over its high computational costs. 15 For example, the article introducing GPT-3 5 does not mention its estimated cost of 355 graphics processing unit (GPU) years or $4.6 M (USD). 48 Even smaller efforts can incur large costs; Strubell, Ganesh, and McCallum 49 examined the cost of a representative NLP research project: 27 GPU-years for training and tuning costing in excess of $100 K (USD) for cloud compute time and $9870 for electricity.
How can we ensure accessibility to the benefits of NLP to those who cannot afford large computing clusters? Can something be done to mitigate the computational requirements?
In some domains, crowd sourcing, 50 the use of large numbers of people across a network performing an information processing task, has successfully added complex NLP 51 to an analysis with minimal computational cost. In technical domains, however, the data are often proprietary business information that cannot be shared outside of the organization. This dramatically limits the usefulness of crowd sourcing.
We will instead focus on TLP's engineering mindset that encourages discussions about the most practical approaches for achieving real-world goals. For example, Xu et al 52 used neural word embeddings and convolutional neural networks (CNN) to perform text classification. Their CNN model took 14 hours to train. Following an approach that is congruent with TLP, Fu and Menzies 15 performed a replication study that used an optimizer to fine tune a traditional support vector machine (SVM) to achieve similar performance while decreasing training time by a factor of 84.
Subsequently, Majumder et al 16 repeated the replication study using local learning via clustering the data prior to training an SVM on each cluster. They reported a 570Â speed up on a single core and a 965Â speed up eight cores relative to Xu et al while achieving F1 score results within 2%. While from an NLP perspective, the classification scores did not improve, the accessibility and usefulness did by mitigating the need for large computational resources while achieving useful results.
These results show the value of applying an engineering perspective to an application domain text analysis problem instead of solely using the current NLP SOTA. Because the NLP literature focuses on advancements along its frontiers, applied results, particularly those which address computational costs, are relegated to the literature of disparate domain-specific communities, such as software engineering. TLP allows for the aggregation and dissemination of these patterns of usage within its community.

| THE RISKS OF THE STATUS QUO
There are always risks which accompany the application of technology, tools, and techniques to any domain, including maintenance text. Some of these risks are of a more practical nature. One central risk is lack of trust-due to missing, incomplete, and inconsistent information, practitioners do not trust their maintenance data, and by extension, do not trust the outputs from application of NLP to these data. Another risk is that many groups will likely develop ad hoc solutions to particular issues and, in general, mistakes will be made and solutions will be reinvented many times. This reiteration and reinvention effectively represents a tax on an entire industry, one that could be greatly reduced by shared conventions and standards.
A more pernicious set of risks can be articulated using a set of ethical concerns which were originally intended for the broad societal use of algorithms but they apply to this more focused use as well. 53 The concerns are unjustified actions, inscrutable analyses, and systemic bias; maintenance-based examples and their consequences are shown in Table 3.
Correlations emerge from the analysis of data and actions may be taken from these findings. When the causal link is unknown or not determined, the action may be unjustified as well as costly and ineffective. The use of correlation to guide actions is not without pitfalls; spurious correlations in the data coupled with the low-probability of finding legitimate regularities and the tendency of SOTA NLP to learn shallow heuristics can result in unjustified actions to address questionable concerns identified by mining text-based records. For maintenance, this is likely to be further exacerbated by the lexical noisiness due to variations in spelling, abbreviation, and punctuation found in the data.
Actions can be justified by examining the relationships between data and conclusions. Although often hard to discern, it is reasonable to expect that they are available for inspection. Such scrutiny can help decide between competing conclusions drawn from different analyses of the same data, or identify ungeneralizable conclusions drawn from accidental features of the data. Responsibility is an important component of engineering ethics and we see ability to analyze and justify technical actions as key to being able to accept meaningful responsibility.
With the large amounts of complex data and machine learning that are used by SOTA NLP, the rationale behind analysis results can easily be obscured inside of inscrutable algorithmic black boxes that impede human understanding and criticism. The results and implied courses of action then have to be accepted at face value with a lack of confidence. With competing analyses, a final course of action must then be determined by outside means driven by the personal biases of those left to make the decision.
It is well known that analyses follow the "garbage in, garbage out" principle and the quality of the results is heavily dependent on the quality of the data. However, analyses are also inherently biased by assumptions baked into tools and methodologies. These biases are often propagated into the conclusions. For maintenance, any tendency of the NLP analytics pipeline to overlook certain issues results in resources being instead allocated to other activities. Because of their dependence on large amounts of annotated training data, the use of popular SOTA NLP techniques in technical domains such as maintenance may cause a type of sampling bias; issues for which training data is readily available will lead to machine learning systems that can find them. Text preprocessing can also affect analysis results 54 and due to the large lexical and grammatical variations in maintenance text, the use of common, out-of-the-box NLP preprocessing techniques may not work well for this domain. This means that the apparent importance of certain maintenancerelated issues could be systematically diminished or exaggerated because of the mismatch between the assumptions behind commonly used NLP techniques and the requirements of maintenance.
With its emphasis on iterative, human-in-the-loop style analyses, TLP naturally fosters human understanding throughout the pipeline. By adapting NLP to the domain under investigation, we see increased opportunity for simpler T A B L E 3 Examples of risks and consequences with natural language processing (NLP)-based analyses in industrial maintenance

Risks
Example situations Possible consequences

Unjustified actions
The failure mode causing the most unplanned downtime was not identified because of large variation in misspellings Improvement initiatives were not focused on highest opportunity areas

Inscrutable analyses
An analysis uses complex algorithms and large amounts of data that are hard to understand Lack of confidence in analysis results; final course of action decided by other means

Systemic bias
A company's maintenance analytics pipeline tends to overlook certain issues Resources routinely allocated to other areas and more understandable analyses. Tailoring earlier stages of the pipeline to the domain, such as preprocessing and parsing, allows important semantics earlier into the analysis to simplify the algorithms used later. For example, instead of using an opaque deep neural network to classify messy text data, a simpler, interpretable classifier can be used on the preprocessed and normalized text that facilitates understanding the rationale behind analyses, justifying the resulting actions, and finding hidden biases.

| SUMMARY
NLP has made significant progress in recent years toward achieving human-level performance on a variety of natural language tasks. Despite this, engineers and technical analysts seeking to use SOTA NLP for real-world tasks face concerns that it may not live up to expectations, require more annotated training data than is available, be too complex to understand the rationale behind its analyses, require excessive computational resources, and inject biases into the final results.
We have proposed a human-centered, iterative approach to NLP, TLP to address these issues for engineering domains. By focusing on the needs of engineering text analysis and not being driven to achieve human-level language performance, TLP practitioners are free to choose the most practical techniques to address the challenge at hand while achieving a thoughtful balance between raw analytical performance and the available resources. In place of the aesthetic of stringing together complex algorithmic black boxes and hoping for the desired outcome, TLP encourages human intervention to inject domain knowledge and meaning at each stage of the analysis as detailed in our previous paper. 21 This can help mitigate the accumulation of systemic technical bias in the final analysis. By adapting NLP to focus on the challenges of engineering text, TLP can bring the promise of text analysis to industry.