Assessing changes in reliability methods over time: an unsupervised text mining approach

Reliability engineering faces many of the same challenges today that it did at its inception in the 1950s. The fundamental issue remains uncertainty in system representation, speciﬁcally related to performance model structure and parameterization. Details of a design are unavailable early in the development process and therefore performance models must either account for the range of possibilities or be wrong. Increasing system complexity has compounded this uncertainty. In this work, we seek to understand how the reliability engineering literature has shifted over time. We execute a systematic literature review of 30,543 reliability engineering papers (covering roughly a third of the reliability papers indexed by Elsevier’s Engineering village. Topic modeling was performed on the abstracts of those papers to identify 279 topics. Hierarchical topic reduction resulted in the identiﬁcation of 8 top-level method topics (prognostics, statistics, maintenance, quality control, management, physics of failure, modeling, and risk assessment) as well as 3 domain-speciﬁc topics (nuclear, infrastructure, and software). We found that topics more associated with later phases in the development process (such as prognostics, maintenance, and quality control) have increased in popularity over time relative to other topics. We propose that this is a response to the challenges posed by model uncertainty and increasing complexity.


INTRODUCTION
Product reliability weighs heavily on total product cost.Unreliable products drive warranty cost ( 1 ) while trying to attain unrealistic reliability targets on new designs adds significant cost to the development process ( 2 ).Traditionally, the strategy has been to attempt prediction of reliability early to inform the design ( 3 ).The field of reliability engineering was born with this initial purpose, though it has shifted focus over time to align with the products it addresses.
Reliability engineering faces many of the same problems in today it did in its founding in the 1950s.Most of these problems stem from the challenge of representing a system early in the development process, when there is significant uncertainty.Details of a design are unavailable early in the development process and therefore performance models must either account for a range of possibilities or choose a singular prediction of the design that is likely incomplete.Increasing system complexity has highlighted how poor our predictions can be, since complex systems compound uncertainties.Said another way, there exists a large set of possible designs that satisfy basic functional requirements.Each design will have unique properties, including reliability.
Reliability model uncertainty stems from a combination of failing to identify the entire set of possible designs and from failing to create sufficient model resolution on known designs.
As an example, consider a designing the first screwdriver.The basic functional requirements are that it must impart a certain torque on a fastener with a certain head geometry.Early in the development process, the set of possible designs would include everything from a traditional manual twist screwdriver to an electric screwdriver.The complexity and reliability of these two is vastly different.As the development process continues, requirements are refined, concepts are eliminated, and more design detail is added to the front-runner (or front-runners for set-based design).In this example, the company lands on a manual screwdriver but with a ratcheting mechanism.More complex than a two-piece manual screwdriver, the reliability can only be predicted once the ratcheting mechanism has been designed.After prototypes are made, the decision is made to use a sintering process for a component rather than machining.This process change for manufacturing scale weakens the mechanism and changes the ultimate reliability.In the end, it is extremely unlikely the topology of the model (ratcheting screwdriver) and the parameters (sintered components) would have been accounted for at the beginning of the development process.
The field of reliability engineering was born to predict performance over time of vacuum tube electronic systems ( 4 ), at the time one of the most complex products yet created.While more complex than a manual screwdriver, these systems pale in comparison to the complexity of their modern replacements.In general, the constituent parts of new systems are more reliable, but their count is greater therefore so is the number of possible interactions.
1.1 Defining reliability 5 cites Robert Lussar's 1950s definition of reliability as "the probability that a system will continue to work, for a stated period of time, given defined operating conditions."Indeed, this is the most common formal definition of reliability, specifically calling out 1. probability, 2. time, and 3. operating conditions.For example, 6 describe reliability as "the probability of performing all the functions (including safety functions) satisfactorily for a specified time and specified use conditions." 7specifically tackles the field of "reliability engineering," noting that while there is no single definition, the following casts a wide enough net to get most of them: "reliability engineering is all activities carried out to obtain the right reliability of a technical system, through the various life cycle phases of the system."This is potentially a more appealing definition to the present work since it encompasses activities other than probabilistic modeling.
The certification body American Society for Quality ( 8 ) mirrors the formal definition of reliability, stating "Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time, or will operate in a defined environment without failure."They offer a "certified reliability engineer" credential which identifies professionals "who understands the principles of performance evaluation and prediction to improve product/ systems safety, reliability and maintainability."The full body of knowledge includes "design review and control; prediction, estimation, and apportionment methodology; failure mode effects and analysis; the planning, operation and analysis of reliability testing and field failures, including mathematical modeling; understanding human factors in reliability; and the ability to develop and administer reliability information systems for failure analysis, design and performance improvement and reliability program management over the entire product life cycle." What of seemingly related terms like durability, performance, maintenance, changeability, robustness, and quality?At a high level, reliability can be thought of as the addition of a time dimension to each of these.For example, with durability and robustness, we are interested in the resilience of a product in edge conditions.Reliability is concerned with how that resilience deteriorates over the life of the product, ultimately resulting in failure.Similarly, quality is a broad field which focuses on ensuring a product meets stakeholders' needs.Reliability is concerned with how specified properties change over time, often especially when they stop meeting those needs ( 9 ).

Why reliability is hard
At a high level, we believe that the fundamental challenge of reliability is the need to predict emergent system behavior over long timescales with limited information.Reliability prediction is most valuable early in the development process, where changing course is less costly ( 10 ).Conversely, it is much less impactful to know your product is unreliable after it's in customers' hands ( 6 ).This is at odds with the reality that little is known about the performance of a system early in the development process.The outcome is great uncertainty at a time when reliability predictions would be most valuable.
The reliability community recognizes the challenges faced. 11asserts we are still dealing with many fundamental problems in reliability while adding new complexity.Some of the key challenges discussed deal with soft failures associated with multi-state systems, network effects that are difficult to model, and software reliability.Zio proposes system health monitoring (prognostics) and other "dynamic modeling" techniques as a way to manage the increasing complexity of systems.Zio later asserts dynamic modeling as a potential solution ( 12 ). 13notes many challenges which relate to the reliability discipline as a subset of systems engineering.Specifically, emphasis on stakeholder requirements, analysis of the system against those requirements, and general engineering management pervade their list.
Uniquely, 14 focuses on the challenges associated with experimental design.Two of the fundamental challenges align with those discussed by Zio, namely that complex systems are designed to operate in a wide range of conditions and may continue to operate in a degraded state without total failure.The former significantly expands the test variable space, while the later makes clear requirements mandatory. 15demonstrates with specific examples the challenges associated with quantitative reliability prediction.They specifically target the conventional MIL-HDBK-217 prediction scheme (series-parallel models with failure rates from a standard table ) with several examples of how off it can be.To counteract this, they suggest several approaches including the use of physics of failure.They also discuss many of the same issues with testing reliability raised by Freeman. 7frames the core problems in reliability engineering as 1. understanding system reliability and 2. determining the right reliability level.The former deals with issues of system representation and uncertainty while the later comes down to requirements. 16discusses reliability in the context of big data, calling high-resolution operational information "the next generation of reliability data."Specifically, they suggest application of dynamic modeling and prognostics.This same theme is discussed by 17 , where it's given the buzzword "industry 4.0."Mentioned previously, literature on "design for reliability" advocates for perhaps the most radical changes from the status quo. 6summarizes their recommended changes nicely with the following eight paradigms.They are paraphrased and reproduced here due to their importance.

How has reliability changed over time?
This work seeks to map the focus of reliability engineering literature over time.The purpose is similar to some of the challenges/future direction review papers previously discussed, but we add rigor through the use of the systematic literature review framework as well as unsupervised topic modeling to collect and map articles over time. 18executed a very similar piece of research (unsupervised clustering to model quality and reliability topics over time).In their case, a single journal was studied with a small number (1,475) of articles.No secondary analysis or discussion was executed beyond the identification of topic trends, and the intermingling of quality and reliability topics clouds the results.In this paper, we go deeper to provide a window into how the field has changed and where it might be heading, along with proposed rationale.After identifying trends, we map them to two dimensions: timing (proactive/reactive) and practicality (theoretical/applied).This higher-level analysis suggests where the field may be going beyond which topics are growing in popularity.The structure of this paper is as follows: 1. Introduction: context and motivation for this work.

THEORY AND METHODS
While this work's primary goal is to address research questions pertaining to reliability engineering, exploration and development of a portable literature analysis procedure was a necessary secondary goal achieved along the way.This section describes this procedure as well as considerations from its development.The scripts used for analysis are included in the appendices with the intent that this process could be pointed at other similar problems and yield insight with minimal modification.

Methods overview
At a high level, our research questions (enumerated in 1.3) hinge on grouping papers.Because we intend to include a large number of papers, manual grouping is not feasible and may not be reliable.Thus, we rely on machine learning techniques.Figure 1 below schematically shows the overall analysis flow from research questions through analysis and result.Briefly, we consider the totality of academic literature to be our input.This is filtered through our query, which results in a corpus of article abstracts.From here, the corpus is analyzed via two parallel paths.
First, topic modeling identifies latent commonalities amongst abstracts and places them into groups.These groups are reduced to result in a more manageable number of aggregated topics, one of the main outputs.We also samples labeled abstracts and validate the modeling by blindly labeling the samples and calculating accuracy.We can also ascribe a timing or proactive/reactive score to the topics through a mapping to a standardized product development process, resulting in the second main output, a timing score.
The second path begins by classifying abstracts to assign a practicality score based on the type of examples mentioned (or not).As before, we sample a set of classified abstracts and perform a blind validation to assess classifier performance.In this case, we do not need an additional mapping step as the classification directly applies the practicality score.

Systematic literature review
The systematic literature review is a technique for synthesizing a body of scientific literature.It originates from medicine, where often many studies are combined to produce a more powerful secondary analysis.The field recognized the importance of rigor in this work ( 19 ), ultimately resulting in the publication and acceptance of standards like the Cochrane Handbook ( 20 ).
Fundamentally, a systematic literature review consists of 12 stages as discussed in 21 .They are:

Data sources and query
The research questions cast a wide net across all reliability publications.As a result, the primary metric for data sources is the count of applicable papers.Query authoring is inextricably linked with data source identification since data sources cannot be assessed without a pool of results.Review of previous reliability literature studies 22 performed a review of reliability allocation methods.Theirs was a traditional systematic literature review wherein papers were manually gathered and read for analysis.They pulled data from Elsevier Scopus 1 , ResearchGate 2 , and Google Scholar 3 using the base query "reliability allocation" (in quotes).Their initial document population included 1,670 papers of which 93 were included in the final analysis.They further added subtopics with Boolean AND statements to group papers by specific allocation techniques.Their research question was limited to reliability allocation, specifically identifying which methods were most used and which industries were most discussed.They found that most reliability allocation takes place in the electronics industry, closely followed by the machinery industry.They did not attempt to map trends over time. 23performed a review of resilience engineering, a field which could be considered a subset of reliability engineering.In their case, they used all available sources to them at their institution (16 journals and databases).Their query was "resilience engineering" (in quotes) and they noted this was selected as not using the quotes or using only "resilience" yielded papers outside their scope.Their document population included 637 articles, or 237 after de-duplication.They focused on categorization of papers, dividing them into two dimensions: domain and research area.They also did not attempt to map trends over time. 24performed a review of fault analysis models, specifically those that rely on machine learning.They leveraged Scopus and Web of Science 4 .Their paper lists their query as ("fault" AND "reliability") OR ("machine learning" AND "artificial intelligence).It is not clear how this query would produce the intended results, so we assume their actual query was ("fault" OR "reliability") AND ("machine learning" OR "artificial intelligence").
Their initial document population included 552 articles, of which 243 were analyzed.They performed focused on citations, not only to identify the most influential authors but also to create clusters.That is, rather than clustering based on text analysis, a network diagram was created purely from papers citing each other.They labeled their clusters and found that detection and diagnosis represented the largest fraction of papers.

Selected query
Query authoring is an iterative process.Additional iteration is also required since different data sources provide different query fields and features.For example, many database providers (e.g., Elsevier Engineering Village5 or JSTOR Constellate6 ) include some degree of labeling or tagging.This in itself could be used to query an area and identify topics without additional analysis.
More concretely, Engineering Village calls its topics "controlled vocabulary."Most applicable to this paper is the vocabulary term "reliability."Though there is a volume of papers resulting from the use of this term, subsequent execution of topic modeling shows that it casts too wide a net, including papers which mention or relate to reliability but do not focus on it.This is consistent with the fact that these vocabulary terms do not partition papers, they label them.
Ultimately, the search query of "reliability engineering" was selected (with quotes).All surveyed data sources interpret this as a literal string of "reliability" followed by a space, followed by "engineering."This query returns results which recognize reliability engineering as a discipline rather than only discussing reliability as a quality.Since this paper is focused on mapping reliability engineering as a discipline, this produces a more targeted document pool.For this study, Elsevier's Engineering Village is used as the data source via its API.

Inclusion/exclusion criteria
In addition to data sources and search queries (effectively primary and secondary filters, respectively), systematic literature reviews explicitly define inclusion and exclusion criteria to permit further refinement of the document pool.For the purpose of this analysis, we attempt to limit these tertiary filters to the extent possible, relying mostly on data source selection and search query to provide desired results.
The criteria that are applied are necessary for successful application of aggregation and analysis tools.Namely, 1. Documents must be written in English to permit topic modeling.If documents were present in multiple languages, they would appear as separate topics.
2. Documents must include a publication year.This enables mapping topic trends over time.
3. Documents must include an abstract.These abstracts comprise the corpus for subsequent text mining.

Aggregation methods
Having defined data sources, search queries, and inclusion/exclusion criteria we are now left with a set of documents.Part of the magic of systematic literature reviews is that they distill these documents, providing insight greater than the sum of their parts.
This paper uses two machine learning techniques to perform this aggregation: topic modeling and zero-shot text classification.These are described in Sections 2.3 and 2.4, respectively.

Overview of techniques
This is not the first work to employ topic modeling in the context of a systematic literature review. 21notes that some degree of clustering provides value when considering which papers to include and exclude.More to the point, 18 applies clustering to map topics in quality and reliability literature over time, though with a narrower scope and less resolution than we will achieve.In their case, they used k-means clustering and identified 8 top-level clusters (quality management, quality functional deployment, process capability, quality, reliability, ISO standards, service, and six sigma).They performed an additional analysis of the reliability cluster, breaking it into 8 subtopics (fuzzy methods, reliability systems, sampling and inspection, software, maintenance, failure, warranty/repairs, and models).

Latent Dirichlet allocation
First described in 25 , latent Dirichlet allocation (LDA) is generally considered to be the baseline of topic modeling at the time of writing.It provides generally good performance with minimal overhead (e.g., training).
At a high level, LDA works by assuming documents can be described by a mixture of latent topics.Topics in documents and words in topics are iteratively allocated using Bayesian inference.Briefly, the probability of each document belonging to each topic is calculated, given the other documents already associated with that topic.The process continues until documents are no longer reassigned to new topics.
Practically, LDA can be executed with a number of Python libraries such as Gensim (described in 26 ).Purpose-built analysis and visualization libraries such as LDAvis ( 27) provide nearly turnkey analysis.Though execution of analysis is straightforward, it still requires nontrivial pre-processing (preparing the raw corpus to a form rife for analysis) is not automatic or built-in.This includes removal of stop words (common words which won't be related to topics), lemmatization (collecting different conjugations of words such as "model", "modelling", and "modeller"), and generation of n-grams (grouping common words together that will have a different meaning, like "reliability engineering").
One of the main advantages of LDA is that standard analyses rely on principal component analysis for dimensionality reduction.Ultimately, the downfall of LDA for this study was that the number of topics must be prescribed.One could iterate through all possible numbers of topics, but the technique does not natively understand hierarchy.Practically, this means that if topics of interest appear at multiple levels (e.g., with different target numbers of topics) we cannot tell which subtopics exist for them.For example, if the corpus includes a mix of domain-agnostic articles along with a large number of articles that describe reliability tools in the context of a single domain (construction, for example), it's possible the algorithm would identify "construction" as a topic rather than grouping those domain-specific papers based on tools discussed.The importance of this is made clear in Section 3.2.3.

BERTopic (sentence-transformers, UMAP, and HDBSCAN)
One of the perceived limitations of older techniques like LDA is that they do not consider word order.They are commonly called "bag of words" embeddings (embeddings are numeric encodings of words) since the words could be jumbled up and produce the same results.At the time of writing, the modern solution for this is called the transformer.First described in 28 , this is the concept that enabled large language models such as OpenAI's GPT ( 29 , Google's PaLM ( 30 , and Meta's LLaMA ( 31 ).BERT is one of the earliest implementations of the transformer concept, first described in 32 .
At a high level, transformers (and therefore BERT) extend the bag of words concept by weighting the importance of each word in a sentence and each sentence in a paragraph, et cetera.This is accomplished by iteratively assigning importance scores between each pair of words in the corpus.This enables the type of context-awareness that humans rely on to parse text.BERTopic ( 33 ) combines the powerful embeddings of transformers with concepts from top2vec, discussed in the next section.Embeddings are generated using BERT, creating context-aware representations of each document.At this point, each document is represented by a high-dimensional vector.Dimension reduction is accomplished by UMAP ( 34 ) to aggregate less unique dimensions and clustering with HDBSCAN ( 35 ) identifies areas of high density in the population, or topics.Term frequency inverse document frequency (TF-IDF) is used to label the topics with the most unique words.
BERTopic has many qualities needed for this research, namely automatic topic identification and topic hierarchy.Ultimately it was not selected because the general purpose sentence-transformer model may not be sensitive to the word choice used in academic abstracts about reliability engineering.

Top2Vec (doc2vec, UMAP, and HDBSCAN)
Another modular topic modeling framework, top2vec, was first described in 36 .Though it initially used the doc2vec model for embeddings ( 37 ), it now supports newer alternatives including BERT.Also like BERTopic, dimension reduction is accomplished by UMAP and clustering with HDBSCAN.Also similar to BERTopic, hierarchical topic reduction is possible and the resulting hierarchy can be used to consider topics at multiple levels.Finally, unlike LDA, no preprocessing is required so the analysis process is streamlined.With all topics identified, the distance to each centroid for each document is calculated to establish relative similarity to each topic.Also like BERTopic, top terms in each topic are determined with TF-IDF and used to label the topics.
Since doc2vec is the main difference between BERTopic and top2vec, it's worth explaining the possible advantages for the present application.At the most basic level, doc2vec is an algorithm that extends the traditional bag of words technique by making it distributed.That is, rather than only calculating the probability of a word based on all of the words in a document, it calculates the probability of that word being near other words in a sentence in that document.Thus, the vector representation describes the entire document including its structure.Conversely, BERT (or transformers in general) represent a much more complex neural network architecture that can result in similar context-awareness like doc2vec, though with more limited scope and with significantly more overhead.

Topic modeling execution plan
We choose to use top2vec for this study as the doc2vec embedding should perform well with a moderate corpus of domainspecific terminology.The anticipated outputs will be a list of topics identified amongst the documents, top words unique to each topic, documents associated with each topic, and for each document the distance to every other topic.

Text classification
The topic modeling discussed previously addresses the problem of identifying topics from a group of documents and assigning documents to those topics.Classification differs from this in that the topics (or classes in this case) are defined before any analysis begins.One canonical instance is sentiment analysis, for example determining whether product reviews are negative, neutral, or positive.

Overview of text classification techniques
Text classification is a field essentially as old as natural language processing and as such we do not attempt to consider all options for this paper.Rather, we bookend the space with the oldest and newest techniques to understand the field's evolution.

Bag of words methods
The most basic technique for classification is to create a list of terms for each possible class and check which list is most represented in a given document.If this sounds familiar it's because this is the essence of TF-IDF, described in Section 2.3.1.More advanced versions of this technique would generate the lists automatically, also known as training the model.
This requirement for training is the biggest downfall, particularly when the classifications are unique.Domains with significant interest (like sentiment analysis) have robust pre-trained models.For this application, we want the flexibility of classifying documents more loosely and without extensive training.

Large language models
Large language models (LLMs) are deep learning language models trained on very large datasets with the intent of understanding and generating natural language.The significance of large language models (LLMs) continues to ripple through many domains.It should not be surprising that one domain is NLP itself, including classification.OpenAI recognized this application early on and initially provided a dedicated endpoint for classification tasks, later incorporating it into the more general fine-tuning API 7 .
The power of LLMs is that they are capable of zero-or few-shot training for classification tasks.That is, one can feed a corpus into an LLM and ask it to assign a classification with no specific training beyond the base model.If needed, one can also provide some examples to shape its output (few-shot training), or many samples to optimize the model for your application (fine-tuning).

Classification execution plan
Both fine-tuning (using the OpenAI text-davinci-003 model and zero-shot prompting (using the OpenAI gpt-3.5-turbomodel) were assessed for classification tasks.Ultimately, the fine-tuning model could not match the accuracy of the zero-shot prompt, so the latter was selected.

Additional assessment dimensions
The final research question, RQ3, does not immediately follow from topic modeling or text classification.Answering this question requires an additional layer of inference beyond those topics/classifications.

Proactive/reactive
RQ3 can be interpreted as asking how many papers are published in reactive topics versus proactive topics.One strategy would be to assign a proactive/reactive score to each topic.The primary limitation of this strategy is that it lacks grounding since scores are arbitrarily determined and scaled.
For this research, we establish an intermediate mapping from topic to traditional product development phase.Since traditional product development occurs in a linear fashion, this establishes a time dimension.Additionally, topics that occur prior to product launch can be considered increasingly proactive while those that occur subsequent to product launch are reactive.
The product development phases and their corresponding scores are shown in Figure 2 .Example topics are included for reference.Actual topics and justification for their scoring is described are Section 3.4.

FIGURE 2
Traditional product development process including proactive/reactive scores and example topics.This process is adapted from 10 .The main distinction is that we include all post-development activities (such as service) in the final category.Additionally, we note that most management activities are captured by the "Planning" phase, as described by 38 .
Note that some topics occur at different intensities throughout the development process.For example, management permeates all phases.We choose to classify each topic in the earliest phase in which it provides significant impact.We justify these classifications in Section 3.4 through illustrative examples and the literature.Validation of this dimension occurs through validation of the topic modeling, since topics are transparently mapped to development phases.

Limitations
While we believe this to be the most comprehensive survey of reliability engineering literature to date, we acknowledge several limitations of the methods described here.Due to the structure of the academic literature industry, we are not including all

TABLE 1
The reference lists from six survey papers were compared to the document population to estimate coverage.Titles were matched with a simple character ratio (number matching/total number), with values above 0.9 considered matching.Note that this result indicates a significant lack of coverage across the reliability engineering literature.
published works in our survey.That is, we include only documents to which we have access which may not be a representative sample of the whole.This is discussed more in Section 4.4.
The next impactful limitation is that we rely on paper abstracts to contain sufficient information to cluster and classify their associated documents.Implicitly then, we are relying on the documents' authors to accurately represent their contents.This is likely the case regarding topic modeling (and therefore the timing or proactive/reactive dimension), but perhaps more tenuous when considering classification based on whether an example is mentioned in the abstract.
Finally, our mapping between topics and timing score are subjective, though based on the literature.Our decision to represent topics which span multiple phases in their first phase is an arbitrary decision, one could just as easily pick the weighted average, weighting by perceived importance.

RESULTS
We now execute the methods described in the previous section and consider their output.Specifically, we describe the documents collected through the systematic literature review process, perform an analysis of those documents through topic modeling and classification, and apply of second-order scoring (e.g., timing and practicality).Interpretation and ascription of meaning to these results is reserved for the discussion.

Document population
The overall document population described in this work includes 30,543 papers.Of these, 20,634 are journal articles, 7,764 are conference papers, with the balance being book chapters and miscellaneous reports.These papers span publication years from 1955 through 2023.The most papers were published in 2022, with a total of 2,141.A plurality (11,789) of these papers come from the Reliability Engineering and System Safety journal.The next most popular source is Quality and Reliability Engineering International with 5,829 papers.Each of the other sources contribute fewer than 1,000 papers.

Validation of document population
To validate our document population, we manually select six recent review papers from the reliability engineering field.These papers were selected as they represent different specialties in the field and include robust reference lists.We then search the document pool for their references and determine what fraction of references are present in the document pool.These results are shown in Table 1 .
All of these papers cite references outside of the reliability engineering field, so we should not expect 100% coverage.Particularly in fields which are more mathematical (e.g., modeling and statistics as in the case of 22 40 , respectively), there is a high chance of referencing mathematical or computer science publications which would not be included in the reliability engineering document population.

Label
Top

Topic modeling results
With a document pool we are now able to begin the second phase of analysis, topic modeling.As discussed in Section 2.3.1, we elect to use the top2vec library for this purpose.Our corpus is the set of abstracts from the 30,543 papers.This corpus is fed into the topic modeling function, with identified topics along with how close each paper is to each topic as output.The algorithm identified 297 topics amongst these papers.These 297 topics were identified with default top2vec parameters, namely topic_merge_delta=0.1, equivalent to the epsilon parameter of HDBSCAN.The merges topics which have a cosine distance of less than 0.1.Another key parameter left default is min_count=50, which filters out infrequent words.

Hierarchical topic reduction
Following topic modeling, hierarchical topic reduction is performed to aggregate topics into larger and potentially more meaningful groups.The reduced topics and paper counts are enumerated in Table 2 and shown graphically in Figure 3 .The three terms included in each topic are the most frequent words unique to that topic (TF-IDF), while the label is an human interpretation of the documents in that topic.This list of topics addresses RQ1, "what topics (areas of common subject matter) comprise the body of reliability engineering academic literature?"While the remainder of this work will discuss documents as if they belong to a single topic, we anticipate documents are in fact a mixtures of topics.The mixture of topics in a given paper can be described by the distance of it to the centroid of each topic in the dimension-reduced vector space.We choose to label it with the topic it is closest to.For example, consider 43 's "Optimal repairable spare-parts procurement policy under total business volume discount environment."This paper is a 38% match to the Maintenance category and so it is counted amongst those ranks.However, it is also a 22 % match to the Management category and a 20 % match to the Modeling category.Indeed, the paper appears roughly between the middle of these three clusters in the point cluster plot (Figure 4 ).
We note that three of the topics in this group are domain specific: nuclear, infrastructure, and software.Since we are unable to assign a clear product development timing element to these papers (and indeed, subtopics among each of these likely have their own timing elements), we elect to perform a separate analysis on each of them, described in Section 3.2.2.For the first pass, we will only consider the eight remaining topics.Conveniently, this places us within the magic range of 7 ± 2 described by 44 .
These 8 remaining topics are prognostics, statistics, maintenance, quality control, management, physics of failure, modeling, and risk assessment.There are 22,275 documents in these topics.We can visualize the clustering of documents shaded by these topics using uniform manifold approximation and projection (UMAP) as described in 34 .This is shown in Figure 4 .
The following describes each topic in brief.

Software
This topic focuses on the related but unique field of software reliability, also including site reliability engineering.Subject matter ranges from development and testing (hence the "bug" and "developers" terms) to deployment of software.Subtopics for  3 .The most representative paper (closest to the cluster centroid) is 45 's "A Runtime Monitoring Based Fuzzing Framework for Temporal Properties."

Management
The second most populous topic included papers which describe engineering management practices which impact reliability.We note that the relatively broad term of "engineering" is represented as the most frequent term.Since this term was likely present in other areas, that it showed up here indicates it must have been truly over-represented.The most representative paper is 46 's "An introduction to quality assurance in the information processing industry."

Statistics
The authors expected this topic (with modeling) to have the largest document population, but it ranked third.This topic included papers discussing reliability from a probabilistic standpoint, focused on prediction and estimation.That the oft-used Weibull distribution appears as the most frequent term is not a surprise.The most representative paper is 47 's "Bayesian design of life testing plans under hybrid censoring scheme." Modeling Statistics are often applied to modeling, which is the next topic identified.These papers discussed representation of complex systems in an effort to model and predict reliability.(Minimal) cut sets and binary operations are frequently used in reliability modeling, so they appear in the term list.The most representative paper is 48 's "Discrete time dynamic reliability modeling for systems with multistate components."

Physics of failure
In terms of trends, the authors expected papers related to physics of failure to show the highest growth.These papers focus on descriptions of specific failure mechanisms and their effect on reliability.Because these are domain-and application-specific it is somewhat surprising they were aggregated.The top terms show that most of the papers related to electronic component physics of failure, though other papers in the topic discussed plastics and metallurgical crack propagation.The most representative paper is 49 3 Software topic sub-topics as identified with a target 5 topic hierarchical reduction.We note that these could be mapped onto a timing dimension like the top-level topics, though the dimension would be unique to software development.

Risk analysis
This topic seems closely related to the management topic, but is unique in its focus on safety and risk.The top terms include "hra" (hazard and risk assessment) and "human" indicating its focus on the role of operators rather than on hardware failure as is typically the focus of reliability engineers.The most representative paper is 50 's "Probabilities are useful to quantify expert judgments."Maintenance This topic deals with service or maintenance of equipment, related to reliability engineering by way of the fact that failures necessitate service and by the relatively new field of reliability-centered maintenance.Specifically, we note that the most common term is "preventative," so the papers are focused on increasing reliability through actions which prevent systems from failing.The most representative paper is 51 's "Optimum policies for a system with general imperfect maintenance."

Quality control
As discussed in the introduction, the line between quality and reliability can be blurry so it is expected that some papers discussing quality would also discuss reliability engineering.Those included in this topic are limited to process quality, specifically techniques which measure and track process performance.The most representative paper is 52 's "A new non-parametric CUSUM mean chart."Infrastructure This is a domain topic that primarily includes papers discussing transportation and defense infrastructure reliability.Subtopics for this topic are shown in Table 4 .The most representative paper is 53 's "Serviceability of earthquake-damaged water systems: Effects of electrical power availability and power backup systems on system vulnerability."Prognostics Prognostics and health management (PHM) is a topic that the authors expected would be small but growing, as is represented here.Papers in this topic focus on on-line assessment of systems for probability of failure as well as prediction of remaining useful life (RUL).The most representative paper is 54 's "Deep learning-based remaining useful life estimation of bearings using multi-scale feature extraction."Nuclear Another domain topic, this one focuses on reliability in nuclear power plants.Subtopics for this topic are shown in Table 5 .The most representative paper is 55 's "Probabilistic analysis of flow control as an alternative to level control for BWR ATWS." This visualization shows clear subtopic clusters within each of the eight reduced topics.This aligns well with the fact that the algorithm identified 297 topics as discussed previously.

Domain sub-topics
We can repeat the same top2vec work flow for each of the three domain topics, identifying all topics within each and subsequently aggregating them into meaningful reduced topics.These are shown in Tables 3 (software), 4 (infrastructure), and 5 (nuclear).Visualizations for these domain sub-topics are included in the Appendix.

Selected reduced topic hierarchy
As noted previously, hierarchical topic reduction requires some degree of subjectivity to assign a target number of topics., the same used to identify topics using top2vec.Each dot represents a publication.There are 22,275 publications represented in this visualization.Note the visible sub-clusters among each topic which hint at the full 297 topics.We can also see that similar topics (like risk analysis and management) tend to appear geometrically closer than those which we might expect to be more dissimilar (like quality control and physics of failure).4 Infrastructure topic sub-topics as identified with a target of 8 topic hierarchical reduction.These appear to include numerous further domain-specific topics, so assignment of a timing score would be difficult without probing at a lower level.
Constraining the topic reduction to 3 topics, the topics are "charts, estimators, multivariate", "repairable, repair, preventive", and "software, virtualization, developers".The main issue here is that the large volume of software papers has resulted in less fidelity of other topics.The first topic appears to combine statistics, quality, and modeling, while the second deals with maintenance.Clearly there is cause for more than 3 topics.
If we constrain the method to produce exactly 20 topics, the topics are: Looking at this list, there are several topics which would be readily aggregated when looking at our timing dimension.For example, "trees, tree, boolean" and "uml, language, checking" would both relate to modeling.At the same time, topics like "engineering, book, topic" and "experts, linguistic, opinions" cut across many domains and tools and are therefore impossible to place in the proactive/reactive (timing) dimension.We do see many of the same topics in our 11-topic reduction, which gives credence to the importance of those topics.

Trends in topics
With the topics identified and assigned to papers, we can answer RQ2 (How has the volume of work in these topics changed over time?) by looking at how the volume of reliability engineering literature in these topics has changed over time.This result is shown in Figure 5 .This addresses RQ2, showing that the volume of reliability papers is rapidly growing in nearly all areas.
One thing that stands out from looking at document counts over time is the significant growth seen in all topics.For the 8 topics considered, Modeling showed a 53 % average annual growth, Maintenance 42 %, Prognostics 40%, Management 36%, Risk Assessment 29%, Physics of Failure 28%, Quality Control 27%, and Statistics 19%.Overall, this is a 35% average annual growth rate.To remove this dominant factor and tease out the signal of how topic popularity is changing, we can consider the fraction of each topic represented in each year.This is visualized in Figure 6 .
From this plot, we can see clear differences in growth for each topic, represented by their differing slopes.We can directly visualize this growth by plotting the year-over-year growth in the number of publications for each topic.Since this number varies significantly and we are interested in trends, we only plot a smoothed version of this growth rate in Figure 7 .6 These are the timing or product development phase dimension assignments for each topic.Each was assigned based on discussions of best practices as described in 6 .Numeric scores were associated with each phase (1 through 6) as described in Section 2.5.1.The scores are linear and used only for calculation of mean product phase in visualizations.

Trends in proactive/reactive
In order to leverage the scale developed in Section 2.5.1, we must place the 8 identified topics on the timing dimension.Recall that a topic should be mapped to the earliest phase in which it makes a significant contribution to product reliability.Through this lens, we can assign the topics as shown in Table 6 .Finally, we can plot the trend over time as shown in Figure 8 .This addresses RQ3., indicating that reliability publications are becoming more reactive.

DISCUSSION
We can now leverage the results and analysis to identify patterns and ultimately answer our research questions.In addition, we reflect on the execution of the study and specifically challenges posed by the state of academic literature.

Analysis of topics
The 8 non-domain-specific topics identified in this study can be compared with those used elsewhere to partition the reliability engineering field.Previously mentioned, 18 is the most similar study to the present since it employs unsupervised clustering.
The key difference is that their document population included quality papers, hence their topics were skewed in that direction.As mentioned in Section 2.3.1, they found the reliability cluster had the following topics: fuzzy methods, reliability systems, sampling and inspection, software, maintenance, failure, warranty/repairs, and models.There is actually quite a lot of overlap between their findings and those of the present study.For the topics which are not clear matches, sampling/inspection maps to our quality topic and fuzzy methods maps to statistics.The remainder translate directly into our topic list.The primary gap appears to be higher-level management topics (including risk assessment).This may be because those types of papers aren't published in the studied journal or more likely because management topics were included in one of their top-level clusters.Another possible angle for checking for agreement is to look at the ASQ reliability body of knowledge 8 .There we find the following sections: management, statistics, design and development, modeling and predictions, testing, maintainability, and data collection.Again, these align well with the topics found in the present study.Testing and data collection are the most unique areas here and it would be difficult to say where they might fit into our topics.In terms of omissions, the lack of a prognostics section aligns with the idea that industry lags academia regarding the forefront of the field.
Finally, we can also look to reliability textbooks to see how the topic is introduced to students and new practitioners. 56ncludes the following chapters: 4 on probability and statistics, testing, failure modes and effects analysis, loads and capacity, maintenance, failure interaction, and safety.There is a clear bias in this text towards probability and statistics, with many topics uncovered in this study not represented.

Interpretation of trends
At a high level, the most noticeable trends are that reliability engineering literature is becoming both more reactive and more practical.As discussed in Section 3.4, the topics associated with late product development phases have been increasing in popularity since 1955 (the earliest year in our document population).In fact, other than a brief period around 1990, papers have consistently shifted towards later development phases.
Why might this be?First, we recognize that while it is important to have strong management early on and leverage modeling to assess concepts, there is a large amount of uncertainty which makes reliability prediction (and therefore distinct action) challenging early in a development project.That is, there are too many unknown unknowns early in a project to adequately predict reliability.It is challenging enough to predict a product's performance during the concept stage, so extrapolating that performance out over potentially many years in uncontrolled contexts is even more difficult.
These early practices are not necessarily considered less important, but rather they were foundational to the field and do not necessitate literature to be improved upon.As discussed in 6 , the rather recent "design for reliability" movement is not just about the design of the product, but rather doing what you can at every phase of the development process to improve reliability.
From a project risk perspective, it is clearly preferable to build confidence in the product's reliability as early in the development project as possible.Thus, we would expect that most effort in the reliability engineering field would be aimed at the early phases.At some point researchers realized that the metaphorical well was drying up and thus they moved onto the next phases to understand what could be done.This process repeated until the recent advent of reliability-centered maintenance and prognostics and health management began addressing the post-sale portion of the product life cycle.Thus, we can hypothesize that although intuition and general interest would prioritize early-stage reliability work, the field recognizes the fruitless nature of this and instead focuses on more successful late-stage methods.
It seems unlikely that in 20 years the only new reliability literature will be on prognostics and maintenance, thus we can't expect this trend to continue.We imagine the curve in Figure 8 will level off soon, and perhaps decrease as latter-phase areas of research mature.
The trend in practicality is more straightforward and we believe linked to the aforementioned trend towards later development phases.Simply stated, as products enter later phases in the development process they are more formed.This means more details about the product and execution of a reliability tool matter and therefore are likely to be shared in the literature.Said another way, it is much easier to write a robust paper on statistical techniques with concepts discussed via mathematical proof versus discussing a maintenance program without any context for the product specifics.
We can also consider growth trends for individual topics.Prognostics is of interest, showing a significant swell in the late 1990s.We believe this relates to the proliferation of the internet and greater ability for systems to report on their condition.The current upswing may be related to new capabilities afforded by deep learning (the most representative paper for this topic happens to be about deep learning, see 54 ).
Another interesting trend is that of modeling, which saw an extreme peak in the 1980s before leveling out.As with Prognostics, we believe this is due to technological enabling.Computers became much more useful for semi-complex system modeling around this time so it makes sense that reliability modeling would show growth.We do note that the large initial decline is more of an artifact of the very small number of papers in those early years and do not ascribe any specific meaning to this trend.

Answering the research questions
We can also reflect and discuss the research questions posed in Section 1.3.

RQ1. What topics comprise the body of reliability engineering academic literature?
We found that reliability engineering literature is comprised of at least 279 topics.From these granular topics, we find there to be 11 aggregated topics: Topics with an asterisk were determined to be domain-specific artifacts of the aggregation process and not associated with specific reliability tools or techniques.

RQ2. How has the volume of work in these topics changed over time?
We found that all topics show consistent positive growth over the last 20 years (35% on average).We note that this is far in excess of the overall scientific and engineering publication growth rate (approximately 4% per 57 ).Currently, the prognostics topic shows the highest level of growth (over 30%) while modeling shows the lowest (below 5%).These trends can be seen in Figure 7 .

RQ3. Are reliability engineering publications becoming more or less geared towards proactive versus reactive interventions?
Using our assignments of topics to development phases discussed in Section 3.4, we see that publications are increasingly geared towards tools and techniques that occur later in the development process.This can be considered "more reactive," though it is arguable since often the activities are planned near the beginning of the project.They only occur near the end since they require a more mature product.This overall trend is visualized in Figure 8 .

Reflection on text mining the literature
One of the main limitations and therefore disappointments of our results was the lack of coverage highlighted in Section 3.1.1.This analysis achieved between 4 and 26% coverage of references in those review papers in part due to the fact that some of those references would be outside the reliability engineering field, but also due to the state of structured data access of academic literature.The former could be addressed by "snowballing" references (adding referenced papers to the corpus), but that requires structured data access to parse those references.Thus, we see that the state of structured data in academic literature is truly the problem.
The first issue is one of open access.As noted in 58 , although large portions of the academic community are embracing open access by openly publishing their works, historical documents remain controlled by a handful of publishers.This is problematic for a study such as the present as we need full (or at least representative) access to those past articles to establish trends over time.Institutions may not subscribe to every publisher's platform or may only subscribe to certain date ranges or journals.Without complete access, coverage will be negatively impacted.
The other issue is that even when one does have access, publishers of academic literature provide different and often inconsistent access to their own collections.In the course of this research, APIs for Elsevier, JSTOR, Web of Science, ProQuest, and CrossRef were explored but none satisfied the research needs entirely.Some provided the needed fields but lacked coverage while others suffered the converse.Platforms which cross publisher boundaries, such as Elsevier Scopus, include only metadata, since the full text is seen as the publishers' core asset.
The standard unit of literature continues to be a styled PDF, a document which is not conducive to machine analysis.Publishers maintain metadata databases in parallel to the full text documents, but often this is not the case for historic documents which is an issue for studies like this one.Since those documents are controlled by those publishers, third parties cannot freely create rich metadata databases.
Some effort has been expended to address these issues and provides a possible avenue for future work as discussed in Section 5.3.Carl Malamud's Public Resource created the General Index ( 59 ) as a response to the difficulty he perceived in text mining academic literature.That it took a non-publisher exercising potentially extra-legal methods to produce the necessary database to conduct modern analysis of scientific literature demonstrates the limitations of the status quo.

Reliability outlook
We started this research with a decidedly negative view of the reliability engineering field.Outside of academia, much of the reliability engineering profession feels stagnant, clinging to the same methods developed in the 1950s.This research indicates a certain level of self-awareness within the published literature to the limitations of these methods and demonstrates a consistent move toward more effective tools.
Heavily regulated and/or prescribed sectors like defense and aerospace will likely lag the rest of the industry in adoption of new reliability tools, but commercial industries where the primary concern is meeting customers' reliability expectations can move much faster.Indeed, most of the examples in Design for Reliability ( 6 ) are from the automotive and electronics industries.

Lessons learned
The core lesson from our results is that reliability engineering is increasingly seen as an outcome measured by customer experience rather than a specific set of tools.Reliability managers should be accountable for customer experience rather than specific deliverable like system models and predictions.
An example of this change in mentality is demonstrated by the author's experience as a reliability manager at a robotics company.While business goals were stated in terms of specific metrics, the reliability strategy that was implemented focused less on those metrics and more on improving reliability throughout the development process.This manifested as a heavy focus on system-level testing, a decidedly late-stage activity.The next focus was on development of a prognostics program, including hiring of dedicated resources to build that functionality into products.

Future work
As discussed in Section 4.4, the primary limitation of this research is related to the lack of document coverage.Increasing the size of the corpus through utilization of different database(s) would be a straightforward extension to the present work.
This could be accomplished either with access to commercial databases like Elsevier Scopus or by leveraging third-party databases like the General Index ( 59 ).The latter is of particular interest since it provides n-grams for document full texts which may provide even more robust topic modeling compared to the present study which was restricted to abstracts.The trade off would be a loss of context since these n-grams would only enable bag-of-words analysis.It would therefore be appropriate to leverage traditional latent Dirichlet analysis to perform topic analysis.

FIGURE 7
Year-over-year growth rate of top-level topics in reliability engineering literature.Traces have been smoothed with locally estimated scatter plot smoothing (LOESS) to extract trends.The inset highlights trends between 2000 and 2023.Note that prognostics consistently the highest growth rate which continues to increase.Early growth rates were highly volatile due to low absolute paper counts.This volatility coupled with the smoothing algorithm produces artifacts such as the extremely negative growth rate of modeling.

FIGURE 8
Mean timing period (product development phase) associated with the main topic of reliability engineering literature over time.Topics are correlated with phases and a weighted average is used to summarize the topic mix of papers for each year.The trace has been smoothed with locally estimated scatterplot smoothing (LOESS) to extract trends.The shaded region represents a 95% confidence interval for the smoothing.Note that reliability publications increasingly focus on activities which occur later in the product development process.

1 .
Promise a minimum life, never use averages 2. Spend a lot of time on requirements 3. Measure all life cycle costs 4. Design for twice the promised life 5. Safety-critical components should be designed for four lives 6.Consider the full life cycle when making design trades 7. Design to avoid latent manufacturing flaws 8. Design for prognostics and health monitoring

2 .
Methods: definition of the research strategy and tools 3. Results: outcome and artifacts from execution of the research strategy 4. Discussion: placing those results in context 5. Conclusion: reflections on the research process In the course of this work, we will address these core research questions: RQ1.What topics (areas of common subject matter) comprise the body of reliability engineering academic literature?RQ2.How has the volume of work in these topics changed over time?RQ3.Are reliability engineering publications becoming more or less geared towards proactive versus reactive interventions?

FIGURE 1
FIGURE 1Overview of research procedure/pipeline.The process input is the entire body of academic literature while the core outputs are aggregated topics along with reactive/proactive and practicality scores.Secondary outputs are the full topic list and validation accuracy from both the topic modeling and classification processes.

FIGURE 3
FIGURE 3Top-level topic counts in reliability engineering papers (cumulative).All documents are included except for those in domain-topics.

FIGURE 4
FIGURE 4Two-dimensional representation of top-level reliability topic clustering using the UMAP algorithm (34), the same used to identify topics using top2vec.Each dot represents a publication.There are 22,275 publications represented in this visualization.Note the visible sub-clusters among each topic which hint at the full 297 topics.We can also see that similar topics (like risk analysis and management) tend to appear geometrically closer than those which we might expect to be more dissimilar (like quality control and physics of failure).

FIGURE 5
FIGURE 5 Top-level topic count in reliability engineering papers over time.Topics are stacked according to count of publications in 2022.Note the overall exponential growth.Prognostics had the largest count of publications in 2022.

FIGURE 6
FIGURE 6 Top-level topic proportion in reliability engineering papers over time.Topics are stacked according to count of publications in 2022.Note that prognostics represented the largest fraction of publications in 2022.

TABLE 2
Topic modeling results.The model automatically identified 297 topics; these 11 were generated using hierarchical topic reduction.Target reduced topic numbers between 5 and 20 were surveyed and 11 was qualitatively determined to be the optimal amount of aggregation to elucidate the maximum number of relevant top-level topics.
's "The properties of 2.7 eV cathodoluminescence from SiO2 film on Si substrate."

TABLE 5
Nuclear topic sub-topics as identified with a target of 8 topic hierarchical reduction.We note that these could be mapped onto a timing dimension like the top-level topics, though the dimension would be unique for nuclear plant development.