Machine learning in systematic reviews: Comparing automated text clustering with Lingo3G and human researcher categorization in a rapid review

Systematic reviews are resource‐intensive. The machine learning tools being developed mostly focus on the study identification process, but tools to assist in analysis and categorization are also needed. One possibility is to use unsupervised automatic text clustering, in which each study is automatically assigned to one or more meaningful clusters. Our main aim was to assess the usefulness of an automated clustering method, Lingo3G, in categorizing studies in a simplified rapid review, then compare performance (precision and recall) of this method compared to manual categorization. We randomly assigned all 128 studies in a review to be coded by a human researcher blinded to cluster assignment (mimicking two independent researchers) or by a human researcher non‐blinded to cluster assignment (mimicking one researcher checking another's work). We compared time use, precision and recall of manual categorization versus automated clustering. Automated clustering and manual categorization organized studies by population and intervention/context. Automated clustering failed to identify two manually identified categories but identified one additional category not identified by the human researcher. We estimate that automated clustering has similar precision to both blinded and non‐blinded researchers (e.g., 88% vs. 89%), but higher recall (e.g., 89% vs. 84%). Manual categorization required 49% more time than automated clustering. Using a specific clustering algorithm, automated clustering can be helpful with categorization of and identifying patterns across studies in simpler systematic reviews. We found that the clustering was sensitive enough to group studies according to linguistic differences that often corresponded to the manual categories.


Highlights
What is already known Systematic reviews can use machine learning to drastically reduce time spent during study identification, particularly screening.There remains significant potential for automated approaches to reduce time needed for study analysis and categorization, but performance must also be measured.

What is new
Automated text clustering applied to included studies' titles and abstracts resulted in several useable thematic categories.The clustering algorithm Lin-go3G was equally as precise as researcher categorizations, and had higher recall.Systematic reviewers without machine learning expertise can successfully implement automated text clustering.
Potential impact for RSM readers outside the authors' field Automated text clustering can provide useable and valid categorizations of text. The time saved compared to human categorization outweighs the time needed to sort through and make sense of the automated categories.

| INTRODUCTION
Systematic review production is highly labor-intensive. A large number of studies must be identified and screened, and depending on the type of review, studies judged eligible must be read in full text, and their results extracted, synthesized, and reported. [1][2][3] As the number of published primary studies continues to increase each year, 4 current systematic review processes are scaling poorly: reviews are becoming more expensive to produce and more likely to require updates sooner as new studies are published. A decade ago, Bastian and colleagues 5 reported that 11 systematic reviews were published per day and called for innovative evidence synthesis methods-although the suggestions they gave mainly involved reducing the number of primary studies conducted and published. It has also been estimated that the average intervention review takes 1.25 years to complete, 2 and that within 2 years of publication, one of every four systematic reviews of effect within medicine and health will become outdated. 6 We need methods and tools that reduce unnecessary human labor and duplication of tasks to produce reviews at a speed that matches the needs of policymakers and new evidence production.
Computer-based automation and machine learning (ML) are of current interest for reducing costs and accelerating systematic review production. When successful, ML can reduce tasks that are resource-intensive (e.g., difficult or time-consuming) to tasks that can be performed more efficiently, quickly, and consistently via full-or semi-automation. Screening, 7-9 risk of bias assessment, 10 and study design or quality classifiers 11,12 are some of the recent applications of ML to systematic reviewing. However, systematic reviewers are often cautious when adopting new review methods and are aware that the benefits and potential harms of new methods and tools should be characterized and tested before they are adopted. 13 This paper addresses the problem of categorizing studies based on study content, which has application in scoping reviews that aim to map the literature published on a particular topic, population, or context, and, in some cases, identify research questions for study in subsequent systematic reviews. There is relatively little research that has applied machine learning to this problem; Stansfield and colleagues provided an early and important case study. 14 This article presents a case study of the use of a clustering algorithm to define a new categorization system for a simple commissioned systematic review. We assess the utility of the resulting clusters, and compare precision, recall, and time use for completely manual categorizations versus researchers using automated clustering.
Our main aim was to assess the usefulness of an automated clustering method in categorizing studies in a simplified rapid review, then to compare performance (precision and recall) of this method against manual categorization. Ultimately, we wanted to determine whether we could "trust" a specific algorithm to cluster studies using its own categorization system, as much as we trust researchers to code to a researcher-created categorization system. Our intended audience is systematic reviewers who are not machine learning specialists.

| BACKGROUND
Machine learning (ML) encompasses a wide range of methods that fall under the narrower terms "supervised" and "unsupervised" learning. 15 Supervised learning is a common ML approach, in which a model is first "trained" (fitted to data for which ground truth annotations are available), for the purpose of being used in a fully-or semi-unsupervised predictive mode to annotate new data (for which ground truth annotations are not available). In other words, a machine first "learns" how to do a new task and is then used to do that task at some performance level. Unsupervised learning can be used when ground truth annotations are not available, as in the problem we address in this paper. In this approach, a ML algorithm "learns" patterns from unannotated data to build a useful model of the population of studies from which the data originated.
Perhaps the best-known unsupervised ML method is clustering, 16 in which each item in a data set is assigned to one or more automatically identified clusters such that any two data items within the same cluster are similar in some useful way, and any two clusters are dissimilar in some useful way (see Figure 1). Clustering has been used extensively in information retrieval, for example to group web search results into meaningful categories (e.g. Aurora borealis results separated from Aurora the singer). In some applications, clustering is hierarchical, which means that some clusters are contained within other clusters that represent higher-level concepts.
Some clustering algorithms can also characterize each cluster in a way that is useful with respect to the domain of interest. For example, clusters can be automatically named, so that humans can understand how items within a given cluster are likely to be similar to one another. In the context of systematic reviewing, an automated text clustering system "analyses the distribution of terms (words) in a body of text (e.g., titles and abstracts) and identifies groups of documents that use similar combinations of words; clustering 'engines' often then apply a descriptive term to each cluster to aid human interpretation" (Carpentino et al. 2009, in Stansfield et al. 14 ).
The utility of each cluster label may vary according to the algorithm's approach: description-centric algorithms attempt to uncover descriptive, interpretable, and unambiguous names for each cluster, and then assigns text to a cluster. 17 Data-centric algorithms are focused more on grouping text than providing readable cluster labels; kmeans methods are common, which vectorize text in a bag-of-words model such that the text loses any inherent meaning. There are also algorithms that fall in between data-centric and description-centric, such as suffix text clustering, which produces cluster labels that are more adequately informative than data-centric algorithms. 18 Within systematic reviews, most ML developments have addressed problems related to study identification, particularly screening. According to Marshall and Wallace, 19 "machine learning systems for abstract screening have reached maturity" (p. 5). Some researchers have gone so far as to recommend ML-based screening as best practice. 20 Automatic data extraction and analysis represent subsequent areas of development. 19,21 Weißer et al. have recently proposed using clustering to automatically categorize articles as low versus high interest when researchers are scoping the literature in order to develop specific research questions. 22 Clustering algorithms could also be used in the analysis phase of reviews. A simple form of analysis is categorizing studies based on content. This is important in scoping reviews, for example, in which reviewers aim to map the volume of literature that has been published on a particular topic, population, or context, and identify research questions or categories that might be studied in detail in subsequent systematic reviews. In scoping reviews, categories are informed by the research question, commissioner's needs, data accessibility (i.e., title and abstract or full-text), and resources. Categories can be defined a priori but are often adjusted iteratively, particularly during the pilot or early phase of the process.
Systematic reviewers are unlikely to have existing categorization schemes that can be used in new reviews, or annotated sets of primary studies that would facilitate use of supervised machine learning. Such reviewers must F I G U R E 1 Automatic clustering of text in a single hierarchy define a categorization scheme for each new revieweither manually or, as we discuss in this article, by using ML methods such as clustering.
We are aware of only one pilot study that has used automated clustering in the systematic review process. Stansfield et al. 14 retrospectively applied a descriptioncentric clustering algorithm to two large scoping reviews and assessed the face validity of the algorithm's clusters, and the performance of the algorithm's clusters compared to researchers' manually created categories. They found that automated clusters addressed eight of nine predetermined research questions. The clustering procedure was estimated to have high precision (i.e., most of the studies assigned to a given cluster were correctly assigned to that cluster). However, relatively few studies that were actually relevant to a given cluster were assigned to it. Moreover, performance varied greatly according to the cluster. In one review, clusters adequately captured broad topics of the included studies as well as the most common interventions, but not smaller interventions. The algorithm also struggled to describe qualitative studies. Stansfield et al. examined the performance of each cluster separately but did not report summary statistics of clustering at the level of an entire review.

| METHODS
This experiment was an early exploration of ML within the Cluster for Reviews and Health Technology Assessments at the Norwegian Institute of Public Health. ML activities in the cluster are coordinated by the ML implementation team, of which all authors are members. A published report 23 and strategy 24 provide more information on completed, ongoing, and planned activities and evaluations.

| Data
This exploratory study is based on a review of the use of secure institutions for children and youth, commissioned by the Norwegian Directorate of Children, Youth and Family. 25 The specific product commissioned was a "systematic literature search with categorization", a simple review product that includes only analysis of titles and abstracts. 26 It begins with a systematic literature search and screening of identified studies for relevance. The resulting product is an overview of the literature according to pre-defined topics (operationalized as categories), often with a focus on knowledge gaps, rather than an answer to a research question about effect or experience. It was therefore an ideal opportunity to trial the automated text clustering function as a potential aid in sorting, categorizing, or keywording studies.
This product aimed to identify the most recent research (published 2015-2020) on the effect of secure institutions for children and youth with behavioral problems. The intervention or phenomenon of interest included secure institutions, a specific program or approach within a secure institution, or children's experiences of the effect of secure institutions. Study designs of interest were literature reviews, studies with control or comparison groups, and qualitative studies. A systematic literature search in six databases and gray literatures searches in Swedish, Norwegian, and Danish resulted in more than 13,000 references. The research team screened all studies at title/abstract level using EPPI Reviewer's "priority screening" function, a ranking algorithm that prioritizes likely relevant studies to be screened first and likely irrelevant studies to be screened last. 27 The product ultimately included six literature reviews, 25 controlled studies, 95 qualitative studies, and two mixed-methods studies, for a total of 128 publications. The categorization system of these 128 publications is described under Participants and Procedures.

| Participants and procedures
The participants comprised two researchers with PhDs and 3-9 years' experience with systematic reviews (AEM, HMRA), and one researcher with 1 years' experience with systematic reviews (PSJJ).
We compared study categorization as per our usual practice (two human researchers, and a third to reconcile conflicts) with two ML-based approaches ( Figure 2). Arm 1 represented usual practice, and provided baseline values for time use and precision/recall. Three researchers (AEM, HMRA, and PSJJ) created a coding system, and each researcher applied the system to a distinct subset of all included studies; one of the researchers (AEM) then checked and reconciled all categorizations. The categorization system used in arm 1 was a two-level system created by the lead researcher (AEM) and refined through discussion with the other two researchers (HMRA and PSJJ). The categories were defined in terms of study design, context/intervention (a variable that can be applied to describe both the intervention tested in experiments or the setting and topic explored in qualitative studies), population, and country. These variables are typically delivered to this commissioner as an output of this type of review. Sub-categories were created through discussion among the three researchers and were piloted for usefulness and to reduce ambiguity. The final coding for arm 1 was agreement of two researchers. Arm 2 used ML-based approaches. Automated clustering was triggered by one of the researchers (AEM), which automatically created a coding system and applied it to categorize all 128 included studies. The categorization system used in arms 2-1 and 2-2 was defined by applying EPPI Reviewer's 27 built-in text clustering function, which uses the Lingo3G clustering engine powered by Carrot Search. 28,29 As a description-centric algorithm, the Lingo3G website consistently highlights the instantaneous utility and meaningfulness of cluster labels in its product description: "clearly-labeled" folders enable "instant analysis" and give the user an "instant overview" of text, and will help the user focus on "specific subject(s)". 30 Lingo3G's focus on informative labels and user understanding differs from more common data-centric clustering algorithms.
The default clustering settings displayed in EPPI Reviewer are two hierarchy depths, a minimum cluster size of 10%, and a maximum cluster size of 35%. We changed these parameters iteratively to obtain immediately sensible clusters and proceeded with nonhierarchical clustering. We retained the minimum cluster size of 10% and increased the maximum cluster size to 50%. While the automated clustering procedure provides cluster names, we found it necessary to edit some of the names suggested by the software to be more easily understood by other researchers. These names were chosen by reviewer AEM, by studying the titles and abstracts of the studies assigned to the poorly named clusters. Clusters that were not judged as useful after exploration were discarded.
We randomized studies in a 1:1 ratio across arms 2-1 and 2-2 using EPPI Reviewer's random distribution function, resulting in 64 different studies in each arm.
Arm 2-1 assessed the validity of the automatic clusters by having a researcher (HMRA) apply the automatically generated coding system to studies, blinded to how the algorithm had clustered them. Another researcher (AEM) then checked and reconciled all categorizations (as per usual practice).
Arm 2-2 assessed how a researcher who was not blinded to the clustering algorithm would categorize studies. A second researcher (PSJJ) simply checked how the clustering algorithm categorized the included studies, as she would check another researcher's data extraction.
Note that the ML-based approaches only require two researchers rather than three, as per usual practice, as the algorithm itself represented a third researcher. However, for the purpose of comparing the two approaches, it was necessary that the human tasks were performed by different people, that is, researchers HMRA and PSJJ.
Both parts of Arm 2 were completed after Arm 1, and Arm 1 was the commissioned review itself. By the completion of the review, all researchers were familiar with all studies. There was no way to avoid their knowledge, given that a manual categorization process beginning with a new categorization system requires in-depth knowledge of included studies.

| Analysis
To assess usefulness, one researcher (AEM) mapped each automatically generated cluster to a manually generated category. This provided a simple visualization of overlap and gaps between the two approaches. Our main aim was to objectively assess performance of the automated clustering method, to determine whether we could "trust" an algorithm to cluster studies using its own categorization system, as much as we trust researchers to code to a researcher-created categorization system. We therefore computed precision and recall (see Appendix 1) with respect to the final coding, 31 treating both the algorithm and human researchers as coders/researchers. Finally, we recorded and compared the time spent coding using automated-versus human-generated categories. Each researcher recorded her time manually in an Excel file, for each step and task, and we calculated the total time used for each arm. Figure 3 shows the 16 clusters identified by EPPI Reviewer's document clustering function on the left side. The right side displays the conceptual re-organization of 12 of these clusters into a two-level hierarchy (chosen for ease of comparison to the manually created categories). Of the original 16 clusters, four clusters (young people, sample, suggest, and conclusion) were judged to be irrelevant upon examination and were discarded. One cluster (no relevant categories/no abstract) corresponded to the eight studies that either did not have an abstract or were not assigned to any of the other 15 clusters. This cluster contained nine studies; seven of which were identified through gray literature searches, lacked abstracts, and were published in Norwegian or Swedish. These languages are among the 19 languages that Lingo3G can automatically detect and process, and therefore could have been included in the other clusters had they contained abstracts. Figure 4 displays the content of the automated clusters and manually created categories, with overlapping categories highlighted in yellow. Both approaches contained categories that described contexts/interventions and populations. Within the contexts/interventions category, both approaches identified when a study focused on a program or approach within a secure institution (e.g., anger management, animal therapy) rather than on the secure institution itself.

| Usefulness of automated clustering
The figure shows there were no automatically generated clusters that correspond to two of the manually created top-level categories, namely country and study design, which were of interest to the commissioners. Table 1 below shows the amount of studies confirmed to be in each automated cluster or human-created category, after human coding and agreement. The lack of automated country codes might be explained by the fact that only 73 (57%) of the 128 studies specified country in the abstract and could be manually categorized. In addition, nine of 11 countries were reported by less than 10% of studies; our 10% minimum cluster size cut-off would therefore have prevented studies being clustered around these rarer words.
One reason for a lack of automated clusters relating to study design is that it may be challenging for this unsupervised ML method to infer important differences in study design. However, the clustering algorithm identified categories related to study topics and findings (factors related to criminality/desistance; life experiences, lived experiences; positive changes; and negative experiences), which were not part of the manual categorization scheme. The topic/finding top-level category corresponded roughly to study design. Sixtyone (62%) of the 98 qualitative studies received a topic-finding category, most often "life experiences, lived experiences" and "negative experiences". Only 12 (38%) of the 32 quantitative studies received a topic/ finding category, most often "factors related to criminality/desistance".
"Population" was a top-level category in both automated clusters and manually generated categories. While researchers did not deem it useful in their manually created categories to group studies according to semantic differences such as "juvenile" or "young", it was straightforward to place studies in one of these two categories, as study authors used either one or the other phrase to describe their populations.

| Performance of automated clustering (using algorithmically-generated categories)
The leftmost column of Table 2 shows the cumulative performance of the researchers using manually created categories (Arm 1). Both precision and recall of the researchers exceeded 96%, likely due to how these categories were created through discussion among the three researchers, were piloted, and were intended to be unambiguous and mainly mutually exclusive. Table 2 also shows estimates of precision and recall for researchers and automated clustering in which consensus between two researchers was used for final coding. In Arm 2-1, in which automated clustering represented one independent researcher's coding, automated clustering and the actual researcher's precision rates were similar: 81%-82% of their codes identified relevant studies. In this study, recall for automated clustering was 10% points greater than for the researcher. The statistical analysis suggests it is plausible that recall may be either similar across the two approaches or that the automated approach may be superior.
In Arm 2-2, in which one researcher saw and checked the clusters rather than coding blind, precision was again identical for both the algorithm and the researcher, and higher than in Arm 2-1. The clustering algorithm again had better recall than the researcher, and in Arm 2-2,  Table 3 displays the time spent in each arm. In Arm 1, manually creating codes, piloting them together, applying codes, and reconciling conflicts required 11.4 h (approximately 49% longer than the automated approaches). The majority of this time was spent in applying and reconciling coding. In Arm 2, the time needed to run the clustering algorithm, interpret the clusters, apply the clusters independently (Arm 2-2) or check the algorithm's clusters (Arm 2-1), and reconcile conflicts was 7.7 h for the 128 studies. Almost half of this time-3.45 min-was spent in making sense of the clusters, including re-naming ambiguous clusters and discarding irrelevant clusters. Coding using the algorithm's clusters took less than 40% of the time that coding using humancreated categories did (4.2 h compared to 10.9 h).

| DISCUSSION
In this exploratory validation study, we tested the usefulness of automated clustering in categorizing 128 studies in a simplified systematic review. We assessed the performance of both this description-centric algorithm, Lingo3G, and two researchers-blinded and nonblinded-against final coding decisions and compared performance and resource use of categorizing with help of the algorithm against categorizing manually. Clustering provided useful categories for the review, but these were not exhaustive; it could not have replaced researchercreated categories. In terms of performance, the algorithm had remarkably similar precision to any one experienced systematic review researcher when assessing both against final coding, and 7%-11% better recall than any one researcher. The automated approach also used 33% less time. We therefore see exciting potential to supplement researcher categorization with automated clustering, and our study provides evidence that such methods can be as accurate as one or two researchers.
There were surprisingly helpful overlaps between the automated clusters and manual categories, as well as clear benefits to each of the approaches; see Table 4. The clustering algorithm was unable to organically cluster studies according to a pre-determined categorization system. However, it was sensitive to linguistic differences that sometimes corresponded to pre-determined categories. In social and welfare evidence synthesis we are often faced with summarizing effects of interventions and policies that lack internationally agreed upon names and definitions. In this project, when the researchers manually coded interventions/contexts and populations, they intentionally disregarded what they assessed to be country-level variation in terms, in a more semantic style of categorization due to variation in intervention and policy naming. For example, a study reporting on youth in "secure care" in the UK and a study reporting on youth in "juvenile detention facilities" in the United States were manually coded to Context/intervention > Secure institution itself. The exact name was not important in the manually created categories. However, the clustering algorithm honed in on these linguistic differences, and these two studies were clustered to Context/intervention > Secure care, and Context/ intervention > Juvenile detention, respectively, which also corresponded to the manual country codes of the UK and USA. In a subsequent addition to the project, the commissioner requested studies divided by type of secure institution. The automated document clustering categories provide exactly those groups, saving us the time it would have taken to recode from scratch. One clear benefit of the algorithm was that it created a unique cluster-Topics/Findings-that did not have a manually created counterpart, and that proved particularly useful. After the original simplified review was delivered, the commissioner requested extensive summaries of the six identified literature reviews, with a focus on their topics and themes, and particularly whether results indicated positive or negative effects of secure institutions. Researchers were able to refer to this cluster's sub-categories as they summarized these publications. 32 The algorithm therefore proved useful as a supplement, but not a replacement.
Automated clustering required significantly less time, even accounting for the three and a half hours needed to make sense of clusters, which included re-labelling some and discarding others. In fact, making sense of the clusters was the most time-intensive step. Clusters' names tended to be the words that characterized the cluster. This was different than usual practice of categorizing according to study content for two reasons: first, researchers often attempt to create mutually exclusive categories or at least minimally overlapping categories, while document clustering does not allow for this. Second, researchers often create categories within the same conceptual "plane" as one another: mutually exclusive types of program designs, mutually exclusive population groups, and mutually exclusive contexts, rather than a category that describes a particular population and a particular context and a particular program design.
Automated clustering is just as likely to cluster studies once into population groups and again into contexts, meaning there will not only be overlapping categories, with the same study categorized into a program design-related cluster and into a population-related cluster, but the categories may represent different "planes".
Overall, this suggests that automated clustering has limited utility, and does not save time, in assigning studies into pre-determined categories that may be hierarchically organized or with rules such as mutual exclusivity. Rather, the unsupervised nature of clustering points to its usefulness in highlighting similarities between studies. Stansfield et al. 14 also reported that Lingo3G succeeded in accurately describing a wider range of content than human categorization but could not cluster according to all pre-defined categories. A major advantage of description-centric algorithms such as Lingo3G over standard clustering algorithms that use a bag-of-words approach is their production of reasonable, immediately understandable cluster labels-nevertheless, we spent more than 3 h making sense of them. We therefore assume that this stage would have required even more time had we used a datacentric algorithm. Time savings may have been greater had we used a different algorithm that we were able to fine-tune more, then apply to new data.
In addition to requiring less time, the clustering algorithm performed as well as any two researchers categorizing according to the algorithm's system, whether blinded or non-blinded. While there have been studies exploring clustering algorithms within evidence synthesis for comparison, our findings of the algorithm's precision (83%-88%) were similar to the precision reported by the algorithm's developers in their initial user study (80%-95%). 28 It is possible that the range of precision and recall could have been related to differences in the two arms' studies, although we hope that randomizing studies protects against systematic differences. We also saw no indication of confirmation bias in the arm in which a researcher was not blinded to the algorithm's assignment of studies. We interpret these results to mean that this particular clustering algorithm's "decisions" regarding how studies relate to each other can be trusted as much as when researchers themselves decide how studies relate to each other.
Performance of this clustering algorithm is based first and foremost on researcher acceptance of automated categories, and second on recall/precision of the accepted automated categories, compared to researcher classification. The act of accepting (or rejecting, or modifying) algorithmically generated clusters represents human input and intervention into ML tools. We suggest this human engagement be regarded as a necessary step in implementing ML tools in evidence synthesis, even when those tools could allow for full automation.

| Recommendations for evidence synthesis
In a systematic scoping review or a systematic literature search with categorization, automated clustering using the description-centric algorithm Lingo3G should be used to create initial categories, before manually creating a categorization system. All researchers involved in this process should carefully review clusters for relevance and clarity. Only the clusters that are useful should be carried forward; ambiguous clusters or those that are taking time to understand should be discarded. As automated clustering may outperform human researchers with respect to recall (as our study suggests), we can probably depend on it to identify more studies than a researcher who codes blind. One researcher may then check the studies coded to these selected categories for accuracy. Although we see no evidence in this study that a researcher will be more precise than document clustering, we hypothesize that researcher precision will increase if all researchers are involved in assessing and understanding the automated clusters. It may be useful for a researcher to check the precision of document clustering categories. These hypotheses should be explored in subsequent studies.
After reviewing the automated clusters, researchers should manually create and code any supplemental categories as needed. This is similar to a best-fit framework synthesis 33,34 used in qualitative evidence syntheses, in which authors categorize data into a pre-existing framework. Any data that are left outside of the framework are then thematically analyzed to create new framework areas. The framework is then expanded to accommodate the new areas creating a new framework that includes all of the relevant data. By using a hybrid human-and automated-categorization system, future reviews may benefit from the resources saved by automation as well as the specificity provided by manual categorization.
The automated clusters were extremely helpful in a subsequent, smaller commission. We echo Stansfield et al.'s 14 recommendation that automated clustering could help provide direction and focus in a large review, when there is a need to create a smaller dataset. Applications for automated clustering may therefore exist in larger systematic reviews with specific quantitative or qualitative research questions, or in preliminary searches for such reviews.

| Study strengths and limitations
Our findings are based on a single review, only one clustering algorithm, and three researchers. More comprehensive prospective studies and utilizing different clustering algorithms would be required to provide rigorous comparisons of human and automated approaches. There are certainly more sophisticated algorithms to explore. Another more advanced approach could use language models that perform on more conceptual than tokenist levels, such as the Generative Pre-trained Transformer 3 model with its 175 billion language parameters. 35 At the same time, this particular algorithm was user-friendly and available in a popular systematic review software. The most cutting-edge and complex ML systems are often the least user-friendly and transparent, and both characteristics undermine uptake of ML in evidence synthesis 13 as well as more broadly. 36 Systematic reviewers are more likely to accept a ML tool if it is interpretable, as Lingo3G's clusters were. We expect more sophisticated algorithms to become more user-friendly for systematic reviewers in the future.
The research reported in this paper was carried out during a commissioned review that had a short time frame, a large number of search hits, and a large number of relevant studies. We are unsure of how well automated clustering would work on a review with a limited number of included studies. The timesavings in that scenario would be limited and potentially not worthwhile. Our time estimates are also likely dependent upon the clustering algorithm we used; different algorithms may require more or less time to interpret labels and discard irrelevant clusters.

| Future research agenda
We believe this is the first study in this area and hope our work is a useful contribution that can be used to help plan more rigorous randomized studies. Adoption of ML methods are gaining traction within evidence synthesisfor example, the well-known PRISMA study flow templates for systematic reviews now include specification of manual versus automated study identification, 37 and recent reviews have further tailored PRISMA figures to include neural network-based knowledge graphs 38,39but these are still the exceptions, rather than the rule. Research needs to explore how ML, particularly unsupervised tools with modifiable parameters, should be handled in the protocol stage: Is it better to pre-specify parameters in a study protocol, thereby protecting against human bias in modifying parameters in a particular direction, or to plan for changing parameters iteratively in order to obtain sensible clusters? Do ML methods lead to conclusions within a systematic review, and ultimately in a guideline, different from those had ML methods not been used? Finally, how can we best educate systematic reviewers and other users about the mechanism behind ML tools, even when the tool is user-friendly, as well as about potential consequences and trade-offs occurring when automating previously manual tasks?
One way in which the risks and benefits of clustering and other ML-based tools could be studied is with a prospective case-control study of systematic reviews and health technology assessments with the same or similar inclusion criteria. By pairing reviews and health technology assessments that did and did not use ML, it should be possible to analyze outcomes such as time-topublication and human agreement in data extraction. Randomizing a pre-specified amount of commissioned reviews to use or not use ML tools could also provide comparative data about conclusions.

| CONCLUSION
This study shows that it is feasible and can be useful to use automated clustering to create, inform, or otherwise supplement study categorization systems for scoping reviews or more simplified systematic review products. We estimated that automated clustering with the description-centric Lingo3G algorithm is as precise as human researcher categorization and uses 33% less time. Coding to human-created categories took far more time than coding to clusters, but there was a sunk cost of almost 3.5 h in making sense of the clusters, even using an algorithm intended on providing descriptive and meaningful cluster labels. At the same time, the clusters identified by machine learning did not include essential categories such as country or study design. In the future, review teams could begin the categorization process by applying a clustering algorithm to the included studies. These clusters should be examined and discussed within the research team and ambiguous or unnecessary clusters removed. The remaining clusters could then be used as the foundation for further categorization of the included studies, and researchers can trust the performance of the algorithm as much as one another's. Importantly, we suggest that this particular clustering algorithm, available in a popular systematic review software, can be used by systematic reviewers who are not machine learning experts.