Evaluation of literature searching tools for curation of mismatch repair gene variants in hereditary colon cancer

Abstract Pathogenic constitutional genomic variants in the mismatch repair (MMR) genes are the drivers of Lynch syndrome; optimal variant interpretation is required for the management of suspected and confirmed cases. The International Society for Hereditary Gastrointestinal Tumours (InSiGHT) provides expert classifications for MMR variants for the US National Human Genome Research Institute's (NHGRI) ClinGen initiative and interprets variants with discordant classifications and those of uncertain significance (VUSs). Given the onerous nature of extracting information related to variants, literature searching tools which harness artificial intelligence may aid in retrieving information to allow optimum variant classification. In this study, we described the nature of discordance in a sample of 80 variants from a list of variants requiring updating by InSiGHT for ClinGen by comparing their existing InSiGHT classifications with the various submissions for each variant on the US National Centre for Biotechnology Information's (NCBI) ClinVar database. To identify the potential value of a literature searching tool in extracting information related to classification, all variants were searched for using a traditional method (Google Scholar) and literature searching tool (Mastermind) independently. Descriptive statistics were used to compare: the number of articles before and after screening for relevance and the number of relevant articles unique to either method. Relevance was defined as containing the variant in question as well as data informing variant interpretation. A total of 916 articles were returned by both methods and Mastermind averaged four relevant articles per search compared to Google Scholar's three. Of relevant Mastermind articles, 193/308 (62.7%) were unique to it, compared to 87/202, (43.0%) for Google Scholar. For 24 variants, either or both methods found no information. All 6/80 (20%) variants with pathogenic or likely pathogenic InSiGHT classifications have newer VUS assertions on ClinVar. Our study demonstrated that for a sample of variants with varying discordant interpretations, Mastermind was able to return on average, a more relevant and unique literature search. Google Scholar was able to retrieve information that Mastermind did not, which supports a conclusion that Mastermind could play a complementary role in literature searching for classification. This work will aid InSiGHT in its role of classifying MMR variants.

A key role of the InSiGHT VCEP is to reclassify variants on the ClinVar database whose genotype-phenotype relationship is unclear or not definitive as understood by submitters. These include variants of uncertain significance (VUS); in addition, many variants receive discordant pathogenicity assignments when submitted to databases such as ClinVar. Discordance is multifactorial and variant interpretation is often generated from multiple sources, leading to a "silo effect," whereby information is considered in isolation by different submitters.
The result is a lack of centralized, contemporary information pertaining to a particular variant. 3 Additionally, for the most complete picture about a particular variant to be gleaned, the submission of unpublished material and information may be encouraged. InSiGHT encourages the submission of unpublished clinical and research data by recognizing contributions through microattribution. 4 InSiGHT at its Variant Interpretation Committee/VCEP Teleconferences frequently identifies critical unpublished information to aid interpretation. This information is documented on its database.
VUSs pose a particular clinical problem as they are not identified as benign with reference to the reference human genome reads, but a detrimental influence of the function of the gene is not apparent for them to be declared pathogenic on the basis of contemporary data.
This leaves families carrying these variants in diagnostic limbo. 5 For discordant variants, misclassification can result in serious clinical mismanagement across and within families, especially in the case where a variant is misclassified as benign and was later reclassified as pathogenic. 6 To classify a variant, a biocurator may face a seemingly neverending literature search which may return many irrelevant results. Collecting information that is relevant to variant interpretation is an overwhelming manual task which will only become more onerous with the rate at which literature is now produced. However, there are now variant-oriented search systems that could improve the quality of search results and by extension improve the efficiency of the curation process. 7 These literature searching tools are able to find articles that mention specific variants using artificial intelligence and natural language processing. They have been purported to increase the yield of a literature search compared to traditional search methods. 7  PubMed search, the question as to whether these tools can be applied to a practical setting such as variant curation and interpretation remains unanswered. 8,9 For such tools to be useful, they would need to return articles that are relevant to the biocurator's task of classification. Such information includes experimental validation of variant functions, tumor and co-segregation information, family history, in silico analysis and statistical methods to determine a probability of pathogenicity.
The literature so far has focused largely only on the correct identification of gene, mutation and disease within a paper by a literature searching tool. 9 Furthermore, whilst there has been discussion of open source tools applied to breast and prostate cancer variants, analysis of specific applications of literature searching tools with MMR variants is mostly limited to a study which developed "Variation Annotation Schema" that aimed to capture important concepts and relations for human genetic variation. 10 This schema was developed in response to the needs of InSiGHT biocurators and relates to the historical curation of the InSiGHT database and annotation of MMR genes. It was hoped it would provide a framework for future literature searching tools for MMR variants.
There now exist a range of commercially available literature searching tools and given the onerous task of manual curation, a tool that increases the efficiency or accuracy of the initial literature search could allow the optimum classification of MMR variants and could be beneficial in resolving discordant interpretation. We therefore set out to ask the question as to whether literature searching tools could add incremental value to the initial literature search to retrieve information for the classification of MMR variants submitted with different pathogenicity assignments.

| AIMS AND HYPOTHESIS
Our first aim was to examine the nature of discordance in a set of MMR variants with different pathogenicity assignments, by compar-

| Sample
In January 2020, a list of MMR variants with discordant classifications that require reviewing and updating on the ClinGen ClinVar database was provided to the InSiGHT VCEP by ClinGen as a part of InSiGHT's role in reclassifying these variants. This list was "prioritized" by ClinGen into several categories. The first of these were the "Alert" To describe the nature of discordance amongst this sample of variants (the first aim) and eventually use the sample to identify the value of Mastermind in an initial variant literature search (the second aim), it was important the sample reflected a typical situation that a biocurator may be faced with, that is: fulfilling the important role of the InSiGHT VCEP by reclassifying VUSs and resolving discordant variants. In order to address the two aims of this study, we used judgement sampling to identify which variants should be selected for inclusion in the study and this was on the basis of priority as designated by ClinVar and number of conflicting submissions on ClinVar.
Judgement sampling refers to a sample chosen based on the prior knowledge of a subject and is useful for samples where the aim is to improve process performances, which in our case is the process of literature searching for MMR variant classification. 11 We first focused our efforts on the "Alert" variants and then prioritized a selection of variants from the "Priority" group. We aimed for an arbitrary total of 80 variants which was thought to be a sufficient sample size to pilot the feasibility of Mastermind. As this was intended to be a study that examined the feasibility of using Mastermind across a range of different discordant settings, it was not deemed necessary (and was beyond the scope of this study) to test all of the variants in every category beyond "Alert". As detailed statistics were not planned, there was no formal power calculation for sample size.
From the 80 variants on the list, all 31 variants in the 'Alert' category were selected for analysis on the basis of being of high priority (as designated by ClinGen) for InSiGHT to provide updated classifications. Further subgroups within the Alert category will be expanded upon in the Results section.
The remaining 49 variants were selected on the basis of multiple submissions with discordant interpretations by different submitters and were from the "Priority" category as designated by ClinGen. From the 'Priority' category, we focused on two sub-groups. The first was variants that did not necessarily have an InSiGHT classification but had at least one conflicting pathogenic/likely pathogenic vs VUS/likely benign /benign submission from different sources on Clin-Var, this being a medically significant conflict. Here we denote variants pooled across one or more InSiGHT classifications by a diagonal slash to denote OR (eg,: likely benign/benign) and conflicts of classification denoted by "vs" (eg, pathogenic/likely pathogenic vs VUS/likely benign/benign) To further prioritize these variants, we then derived the median number of submissions to ClinVar per variant, which was four, and selected all the variants with four or more submissions for inclusion. The first of these groups prioritized by this method contained 39 variants.
In the second subgroup of the "Priority" category, variants did not necessarily have an InSiGHT classification, but had at least one VUS and at least one likely benign/benign non-expert panel classification on ClinVar. Since this group was deemed to be of lower priority and had a large number (540) of variants, we examined the top ten variants with the most assertions of pathogenicity on ClinVar.

| The literature searching tool
The Mastermind Genomic Search Engine (Mastermind) 12 was selected as the commercially available literature searching tool for comparison primarily because of its ease of use (as it does not require the use of Boolean search terms) and popularity amongst biocurators.
Mastermind uses artificial intelligence, machine learning and genomic language processing to search the literature for gene variants. Such technology is able to identify the ways in which genes and variants are described in the literature and filter out erroneous information by incorporating knowledge of biology and human genomics. Mastermind is updated on a weekly basis. To maximize applicability of any results to a general setting we used the Basic, free edition of the software that simply required registration using an email address and password. To use Mastermind, a variant is entered into the search field (much like any Internet search engine) and Mastermind then returns all the articles it can find that mention the particular variant.

| Standardizing Google Scholar search
The traditional searching method that we compared the results of Statistical methods planned for this study were descriptive in nature and consisted of frequencies, means, medians and ranges. Further statistical analysis including testing formally the hypothesis that: Mastermind's results would be more relevant or contain more unique information than Google Scholar's, was deemed beyond the scope of a limited feasibility study that was not randomized nor blinded.

| Ethics
Our study met the criteria for a quality assurance study in the Depart-

| Unique articles for Google Scholar and Mastermind searches
The number relevant articles that were unique to either Google Scholar or Mastermind can be found in Table 4. Mastermind found an increased proportion of relevant articles that were unique when com-     17 In terms of future directions, one may attempt to address processes beyond the initial literature search to assess whether information found by literature searching tools was later actively used in formal classification of variants by groups such as the InSiGHT VCEP.

| Instances where Google Scholar or Mastermind returned no information
The current work will usefully inform the work of the InSiGHT VCEP as it works to reach a consensus on the pathogenicity of the discor- information that might hold important answers related to optimal classification. Being an onerous task, if literature searching tools are able to add value to the initial search process and hence the overall classification process, then one may ultimately be able to resolve discordant interpretations and reclassify VUSs more efficiently and more accurately. The InSiGHT VCEP is committed to this task and delivering on the promise of precision medicine for patients and their families where it is hoped that literature searching tools may play a valuable role in this effort.

CONFLICT OF INTEREST
The authors have had no role in Mastermind, Genomenon or Google and have no conflicts of interest to disclose.

DATA AVAILABILITY STATEMENT
The data that supports the findings of this study are available in the supplementary material of this article.