Machine learning for identifying relevant publications in updates of systematic reviews of diagnostic test studies

Updating systematic reviews is often a time‐consuming process that involves a lot of human effort and is therefore not conducted as often as it should be. The aim of our research project was to explore the potential of machine learning methods to reduce human workload. Furthermore, we evaluated the performance of deep learning methods in comparison to more established machine learning methods. We used three available reviews of diagnostic test studies as the data set. In order to identify relevant publications, we used typical text pre‐processing methods. The reference standard for the evaluation was the human‐consensus based on binary classification (inclusion, exclusion). For the evaluation of the models, various scenarios were generated using a grid of combinations of data preprocessing steps. Moreover, we evaluated each machine learning approach with an approach‐specific predefined grid of tuning parameters using the Brier score metric. The best performance was obtained with an ensemble method for two of the reviews, and by a deep learning approach for the other review. Yet, the final performance of approaches strongly depends on data preparation. Overall, machine learning methods provided reasonable classification. It seems possible to reduce human workload in updating systematic reviews by using machine learning methods. Yet, as the influence of data preprocessing on the final performance seems to be at least as important as choosing the specific machine learning approach, users should not blindly expect a good performance by solely using approaches from a popular class, such as deep learning.


| INTRODUCTION
In patient-centered medicine, the integration of external evidence is a crucial component in deciding on the use of medical services. Clinicians, researchers and health policy makers have to deal with a multitude of publications within their field of expertise. A systematic review summarizes the empirical evidence according to a priori defined inclusion criteria and serves as an external evidence basis for clinical decisions. 1 An increasing number of studies such as seen in the field of medicine in recent years, 2 also leads to the need for a higher frequency of systematic review updates. Updating systematic reviews is a very resource-intensive process, which creates a major barrier for up-to-date evidence syntheses. Intelligent technical support systems, for example, based on machine learning techniques, are a promising approach for reducing human effort in this updating process. In particular, deep learning techniques, which have recently become quite popular, are expected to considerably advance this area. We specifically considered the setting of systematic reviews of diagnostic test studies and compared the performance of several machine learning techniques, also including deep learning approaches.
In the context of systematic reviews and living reviews, that is, continuous updates, various computerassisted approaches were introduced to increase the efficiency. [3][4][5] Currently, however, living reviews are not widely used 5 as human effort for continuous updating within a few months is very substantial. Despite existing efforts to automate sub-processes of systematic/living reviews, these are only rarely applied in research practice. Initially, studies investigated the performance of established machine learning methods, such as Support Vector Machines or Random Forests. The results obtained with these methods have already highlighted the potential of such support systems. 3 Recently, the application of deep learning methods for natural language processing has been growing in addition to conventional machine learning methods. 6 Machine learning and in particular deep learning methods have mostly been successful in big data applications. In this research study, we investigated to what extent such techniques can be useful in settings with relatively small data sizes.
To the best of our knowledge, only one study by Marshall et al. 7 compared convolutional neural networks (CNN) with other machine learning methods in the context of systematic reviews. Furthermore, current research has mainly focused on the classification of publications for which detailed reporting standards have long been available like randomized controlled trials (RCTs).
We specifically evaluated the performance of machine learning methods for identifying relevant publications in the context of updating a systematic review of diagnostic studies. Since they are characterized by a rather low standardization of reporting, classification is made more difficult. 8 In addition to comparing different machine learning techniques, we also focused in particular on the effects of data preprocessing and tuning parameter selection on classification performance, as these might have considerable influence.

| Data sets
We used three reviews to assess the performance of various machine learning methods in the context of systematic reviews (Table 1). These are data from the screening process of two already published reviews (R-1, 9-11 R-3 11 ) of diagnostic studies in the field of orthopedics and physical therapy. The review R-2 12 contains data of an unpublished Cochrane review. The sizes are typical for diagnostic test systematic reviews in the field of medicine. The investigated systematic reviews aimed to evaluate the validity and reliability of a physical examination What is already known • Updating systematic reviews is a resource-and time-intensive process that requires a lot of human effort. • Machine learning algorithms were successfully tested in the automatic classification of studies that had high standardized report quality, such as randomized controlled trials (RCTs).

What is new
• This study shows the potential of different machine learning algorithms to reduce human workload in updating systematic reviews of studies with a low standardized reporting quality.  Figure 1 illustrates the entire study process containing text preprocessing, scenario preparation, training of machine learning methods and evaluation. In order to identify relevant publications, the title and abstract of each publication were available as training data. The reference standard for the evaluation of the machine learning methods was the human-consensus based classification (inclusion, exclusion) of the studies based on title and abstract screening. The various phases of the study are described in more detail below.

| Text preprocessing
After the export from the literature databases, the titles and abstracts highly differed in terms of format. Abstracts, for example, varied in structure and sometimes additional unnecessary information was reported and had to be removed. Otherwise, it would have distorted the classification. Therefore, in the first processing step, we preprocessed the texts with the aim of creating a consistent record format with optimized information content. We used various text cleaning procedures and removed additional content by using regular expressions in an iterative process (e.g., journal or author information). Furthermore, we applied standardized text procedures (punctuation handling and upper/lower case, replacement of numbers and symbols as well as removal of stop words) to all titles and abstracts. Two different types of text reduction techniques were compared to each other: (a) Stemming -individual text words were reduced to the undeclined root word to identify words with the same content as the same token. (b) Lemmatizationinflected words were grouped together into their original form.

| Data scenario preparation
After text preprocessing, all titles and abstracts were prepared for model training and various scenarios were generated using a mixture of data preprocessing steps. Figure 2 outlines this procedure. Our aim was to ensure a fixed allocation of publications into training as well as test data and to provide various data preparation scenarios for model training in a compact way. We manipulated the sampling ratio of relevant versus irrelevant publications (inclusion/exclusion) to investigate a possible effect of the balancing on model training.
Two different types of data representation were required for different machine learning methods. 13 On the one hand, (a) conventional machine learning methods use sparse vectors/document-term-matrices as input data. On the other hand, (b) deep learning methods with word embedding require the input data represented as dense vectors. For both types of the data representation, we varied (a) both types of text reduction techniques, and (b) the maximum number of initial tokens (original frequency, fifteen, ten as well as one thousand and five hundred). Furthermore, for the document-termmatrix, we used (a) term-frequency (Tf) and (b) term-frequency-inverse document frequency (Tf-idf) as weighting scheme for tokens and varied the number of n-grams between one and three words/tokens. This resulted in 150 different data scenarios per review for the conventional machine learning methods and 50 data scenarios for the deep learning approaches.

| Model training
We trained different models on each specific data scenario ( Figure 2). Overall, we investigated six conventional machine learning methods, two deep learning methods, and finally, two ensemble methods.
Further, we investigated the performance of different  (Table S2).
We evaluated two different ensemble methods, which comprise of conventional machine learning methods. (E-1) Ensemble stack with a Random Forest algorithm as meta learner and the (E-2) ensemble soft voting averaging the prediction probabilities of the base learners. Both Ensembles (E-1, E-2) used the scenario specific best-trained base learner (B-1) logistic regression with elastic net regularization, or (B-2) Support Vector Machines, as well as (B-3) a Random Forest algorithm. In addition, we used a soft voting ensemble to combine (E-3) deep learning methods.
The pseudo code in Figure 3 illustrates the entire model training process. For each model, a 10-fold crossvalidation with three repetitions was used to identify the optimal tuning parameter set for each data scenario. To enable a fair comparison, it was ensured that resampling was identical across all models during cross-validation. The optimal data scenario for each algorithm with its specific optimal tuning parameters was determined by averaging across hold-out predictions while training and choosing the best performing scenario.

| Model evaluation
The evaluation of the methods was carried out step by step. As outlined in Figure 3, (a) we selected the optimal tuning parameter constellation per scenario, (b) we identified the optimal scenario per method, and then (c) we used the optimal tuning parameters per method and scenario to determine the predictive value of the respective method on independent test data.
1. The Brier score (Formula 1) was primarily used as selection criterion for the optimal tuning parameters. This score describes the precision of predictions by determining the distance of predicted probabilities from the reference standard. 15 2. To enable a fair comparison between scenarios with differently balanced proportions, the Brier score for each scenario was compared to the scenario-specific null model. We call this "Brier comparison score" and used this score to determine the optimal scenario (Formula 2). As a null model, we trained an intercept only model. 3. After the models were trained with the optimal tuning parameter set on the optimal scenario, we determined final prediction probabilities on a corresponding independent test data set. The Brier comparison score was primarily used as an evaluation metric. Other metrics such as sensitivity, specificity, as well as positive and negative predictive values, were also calculated.

| Sensitivity analysis
To compensate for the high class imbalance (few inclusions/many exclusions) that is inherent in all title and abstract screenings, we lowered the default threshold value (50:50) for binary classification to evaluate its influence on the prediction success. We reduced the threshold from 0.5 to 0.2 to improve the classification sensitivity of the models (i.e., lower the false negative rate). All statistical analyses were performed using the statistical software R. 16 For training and systematic evaluation, the caret package 17 was used, which streamlines the process for creating predictive models including implementation of self-developed algorithms, loss functions, and evaluation metrics. This package also enables the evaluation of models in direct comparison in a sufficient manner.

| RESULTS
Overall, three different but thematically similar systematic medical reviews were analyzed. Specifically, titles and abstracts of all individual publications contained in these reviews were used, however, no full texts. An abstract contains on average 225.2 ± 117.4 words (median = 226; min/max = 2/708). After data preprocessing (stemming and lemmatization), the maximum overall number of tokens (words) per review varied within a range of: R-1: 16.807/19.162; R-2: 12.722/14.684; R-3: 5.924/6.905.
The data preprocessing grid (shown in Figure 2) resulted in 1038 different data scenarios in total F I G U R E 4 Comparison of data scenarios (Document Term Matrix-R-1: 396; R-2: 342; R-3: 300; Dense-Vector representation-R-1: 66; R-2: 57; R-3: 50). Figure 4 gives a comparison of models during the training step (cross validation) for one review (R-1) as example. Each data point represents a prediction for a specific data preparation scenario as measured by the Brier score. On the y-axis, the Brier comparison score is shown. Positive values indicate a better model performance than the scenario specific intercept only null model. Note that some methods tend to yield rather poor predictions with negative Brier comparison scores (e.g., K-NN, Naïve Bayes). Furthermore, variation between models within one algorithm is higher than the variation between different algorithms. Depending on the specific data preparation procedures, a method might either perform well or rather poor. Yet, there is no clear relationship between specific data preparation steps and the performance of the methods. The algorithms perform differently on the same data preparation scenarios but with no distinguishable patterns. It seems that each algorithm needs its own optimal preprocessing steps, also strongly depending on the input data. Two different data preprocessing methods that were designed to reduce dimensionality of the feature space showed an impact on model performance in many cases. One method is token reduction, which aimed to increase the signal to noise ratio. The other method is word embedding, which was specifically used for the deep neural networks. The latter takes proximity of words into account and generated a lower dimensional vector out of a high dimensional matrix. We used the word embeddings to pre-weight input data of our deep neural network models. The above findings remained consistent across all reviews examined.

| Evaluation of test data performance
The final evaluation of model performance (i.e., generalizability) was based on independent test data. Figure 5 shows a heatmap with Brier comparison scores for all investigated reviews (R1-3). Cells with a light gray background (values above zero) point to a better performance of the corresponding model compared to the scenario specific null model, while dark cells (values below zero) indicate a worse performance. The best performing algorithm varied between the reviews examined (R1: Ensemble (Soft-Voting), R2: MLP (20 epochs), R3: Ensemble (Soft-Voting)). Across all reviews, there was no clear winner and performance varies between reviews. Review R3 was most challenging during test data F I G U R E 5 Heatmap-performance on unseen test data based on the Brier comparison score prediction. In review R2, deep learning outperformed most conventional machine learning methods, while in review R1 and R3 ensembles of conventional machine learning methods were superior.
Note that model tuning, especially the use of thresholding or generalized word embeddings, has not been fully explored systematically for deep learning methods. Depending on the evaluation metric, the best performing algorithm changed within a review. In the present case of updates of systematic reviews, there is always a strong class imbalance with the minority class being of particular interest. Under this scenario, a stable evaluation metric is needed incorporating this high class imbalance. This would certainly improve the quality of results. Ensemble methods, incorporating more than one algorithm, may lead to better results here compared to single algorithms since multi-model ensembles increased the sensitivity of the analysis purely by incorporating more than one algorithm, especially by aggregating with plurality or soft voting.

| Sensitivity analysis
For sensitivity analysis, we lowered the classification threshold from 0.5 (equal probabilities on both sides) to 0.2 inducing a more sensitive and less specific classification. This clearly led to better results since a more sensitive classification was superior in this task. Not missing a potentially relevant manuscript and therefore increasing the inclusion rate (lower the false negative rate) was of advantage here (Table S3-

| DISCUSSION
We investigated the performance of machine learning methods for systematic reviews in the challenging setting of diagnostic test studies. Besides the general level of performance, we were interested in whether deep learning approaches would have a considerable advantage over conventional machine learning methods. We found that deep neural network approaches were among the top performers but did not consistently or considerably outperform conventional machine learning methods such as Random Forest algorithm or ensembles. This might be due to the rather low number of publications, and small amounts of text available from titles and abstracts.
Searching for the best algorithm for a task is only part of the picture. We did not intend to give a suggestion for a specific best method to be used in general based on our study including only three reviews. Instead, our results suggest that data preprocessing steps had much more influence on the final performance. With a bad choice of preprocessing steps, any approach could drop to even below the performance level of the scenario specific null model, which only uses an intercept as predictor variable and no words. Having encountered these difficulties in our three reviews, we are sure that these issues will also be problematic in other reviews. On the other hand, most algorithms could perform quite well on a specific combination of data preprocessing steps. Unfortunately, it seems difficult to predict in advance which preprocessing steps are best suited for which algorithm and review. Therefore, we strongly encourage considering data preprocessing besides other predictors as a tuning parameter for training of machine learning approaches. Yet, one consistent pattern is that techniques reducing the dimensional load of models are preferable. For example, token reduction, which increases the signal to noise ratio, is usually beneficial. The same trend is observed for embeddings, which reduce dimensions in deep neural network models. Our finding is that the choice of data preprocessing can have a dramatic effect, which can easily be overlooked when discussing different machine learning methods on a more general level. Thus, future research should in our opinion focus on investigating such effects also in other review question types.
In the process of updating systematic reviews, it is of utmost importance not to miss a relevant publication, thus high sensitivity was needed. In fact, in our medical context we aimed to achieve a sensitivity of one (zero missed studies). It is necessary to use all available evidence in medical decision-making. Accordingly, we required that a usable algorithm must have a sensitivity of one and a specificity that allows to reduce the number of publications to be examined manually at least by a factor of two. By lowering the binary classification threshold from 0.5 to 0.2, we were able to achieve a sensitivity of one. At the same time, the number of publications need to be screened manually was reduced by more than 50% in each review. We did not use a systematic search for the optimal threshold for each review, instead we deliberately chose a value of 0.2 for all sensitivity analyses. Cross-validation on a separate test data would in principle allow to find the optimal threshold for the specific problem.

| Methodological considerations
A stable comparison metric in the setting of high class imbalance is desirable to clarify the question of which algorithm works best. In the presence of a high class imbalance, the null models tend to be extremely good in performance when measured by all used evaluation metrics, including the Brier comparison score. In fact, the higher the imbalance, the better the null model. Another potential measure is the Net Benefit, 18 which considers the class imbalance by weighting the difference in true and false positive rate. However, the advantage of the Brier score is the incorporation of the distance of the predicted probabilities. In contrast, the Net Benefit uses only the binary classification. Thus, an inclusion of a prevalence weighting factor for the Brier score could improve the performance in class imbalance problems.
A recent study showed that "generic design classifiers" (distinguishing between RCT and non-RCT) can achieve acceptable performance for use in practice. 7 In contrast, our study focused on a specialized algorithm that simultaneously identifies the correct study design (diagnostic test studies) and topic (selection of studies investigating shoulder pathologies). Our results suggest that such a "review-specific classifier" that updates an existing review should additionally be the focus of further investigations to explore the potential of human workload reduction with an acceptable proportion of false negative predictions. Maybe, a combination of various meta data could increase the classification performance. In addition to title and abstract, further information such as full text, MeSH term indexing or hierarchical inclusion of keywords amongst others could be used. Thus, separate classifiers trained on the different input information could be combined for improved predictions e.g. with a majority-vote ensemble.

| Strengths and weaknesses
One of the main strengths of this work is the realization and implementation of all working processes in a unified development environment, which allow us to systematically generate and evaluate different models on different data sets with different data pre-processing steps. According to the review by O'Mara-Eves et al, 3 the Brier Score was not included in any recent scientific research of machine learning methods as evaluation metric in the context of updating systematic reviews. In favor of a balanced prevalence in training and test data sets, we did not assess the potential effect of changes in the quality of reporting over time. The tested machine learning and deep learning methods represent only a selection of methods potentially available for this classification task.

| CONCLUSIONS
Overall, the presented work shows that it is possible to reduce the human workload in updating scientific systematic reviews by machine learning methods even for studies with a low standardized reporting quality. There is no big performance difference between deep learning and other machine learning approaches. Instead, the influence of data preprocessing on the final model performance is quite strong. Optimizing the threshold assessment and evaluating additional, desirable stable, metrics in high class imbalance settings seems more urgent and should be the topic of future research.