TWIRLS, a knowledge‐mining technology, suggests a possible mechanism for the pathological changes in the human host after coronavirus infection via ACE2

Abstract Faced with the current large‐scale public health emergency, collecting, sorting, and analyzing biomedical information related to the “SARS‐CoV‐2” should be done as quickly as possible to gain a global perspective, which is a basic requirement for strengthening epidemic control capacity. However, for human researchers studying viruses and hosts, the vast amount of information available cannot be processed effectively and in a timely manner, particularly if our scientific understanding is also limited, which further lowers the information processing efficiency. We present TWIRLS (Topic‐wise inference engine of massive biomedical literatures), a method that can deal with various scientific problems, such as liver cancer, acute myeloid leukemia, and so forth, which can automatically acquire, organize, and classify information. Additionally, this information can be combined with independent functional data sources to build an inference system via a machine‐based approach, which can provide relevant knowledge to help human researchers quickly establish subject cognition and to make more effective decisions. Using TWIRLS, we automatically analyzed more than three million words in more than 14,000 literature articles in only 4 hr. We found that an important regulatory factor angiotensin‐converting enzyme 2 (ACE2) may be involved in host pathological changes on binding to the coronavirus after infection. On triggering functional changes in ACE2/AT2R, the cytokine homeostasis regulation axis becomes imbalanced via the Renin‐Angiotensin System and IP‐10, leading to a cytokine storm. Through a preliminary analysis of blood indices of COVID‐19 patients with a history of hypertension, we found that non‐ARB (Angiotensin II receptor blockers) users had more symptoms of severe illness than ARB users. This suggests ARBs could potentially be used to treat acute lung injury caused by coronavirus infection.

infection. On triggering functional changes in ACE2/AT2R, the cytokine homeostasis regulation axis becomes imbalanced via the Renin-Angiotensin System and IP-10, leading to a cytokine storm. Through a preliminary analysis of blood indices of COVID-19 patients with a history of hypertension, we found that non-ARB (Angiotensin II receptor blockers) users had more symptoms of severe illness than ARB users. This suggests ARBs could potentially be used to treat acute lung injury caused by coronavirus infection.   that can quickly spread from person to person and in some cases lead to death. Researchers have found that both the SARS-CoV-2 and SARS coronaviruses invade human cells in target tissues in a similar manner via high-affinity binding to angiotensin-converting enzyme 2 (ACE2) . In recent epidemiological investigations of the spread of SARS-CoV-2 and a preliminary study of the clinical characteristics of this disease (Chan et al., 2020;Chen et al., 2020;Pan et al., 2020;Wei et al., 2020;Zhu et al., 2020), researchers have found that patients infected with the new coronavirus have severe symptoms similar to that of the SARS infection. The first clinical case reports of SARS-CoV-2 infections in China revealed "cytokine storms" in critically ill patients Wan et al., 2020). However, the mechanism of the viral infection and pathological changes in the immune system are still not known. The sooner this information is added to the current clinical knowledge on these viruses, the better the control and treatment of this disease.
Here, we present an automated topic-wise inference method called TWIRLS (Topic-wise inference engine of massive biomedical literatures), to help human researchers to quickly establish topic-cognition of interest and solve different scientific problems. In this study, we constructed the "coronavirus" knowledge graph using the TWIRLS system. First, TWIRLS can process and summarize the massive biomedical literature on coronaviruses, and then collect, classify, and analyze reported coronavirus studies to reveal host-related entities based on the distribution of specific genes in the text of the articles. By combining with general protein interaction data, links between certain functional cellular/physiological components can be inferred to fill in the knowledge gaps on the probable mechanisms of host pathological changes. By analyzing the coronavirus literature, TWIRLS was able to reveal that the binding of the coronavirus spike proteins to ACE2 would cause an imbalance in the Renin-Angiotensin System (RAS).
When the level of Ang II is elevated, the angiotensin-stimulated AT1R leads to increased pulmonary vascular permeability, which triggers cytokine storm and then eventually results acute lung injury in the host Kuba et al., 2005). Therefore, TWIRLS can guide human researchers by providing further potential therapeutic target information based on the regulation of RAS for the treatment of acute viral lung injury.

| Construction of the data interface
We used PubMed, the most widely used biological literature database, as the resource for text mining. The schematic representation of the overall study design is shown in Figure 1 and can be summarized in the following steps.

| Corpus and dictionary organization
The dataset used in this pipeline was derived from the text of articles from PubMed. First, PubMed was searched for articles containing the subject keyword "coronavirus" including titles, abstracts, and author and affiliation information. The search results were downloaded in txt format for compiling into structured information. The text in the subject abstract set was organized and cleaned, and assigned to specific corpuses related to the coronavirus (specific corpus), and then compiled into the subject dictionary. To enhance the accuracy of effective entities associated with the key word, we also constructed the control group which is a random corpus with "public health" as the key word.
For balancing the amount of information, we randomly selected the same amount of text as the subject abstract set from the control group before statistical analysis.
2.3 | Identification of genes precisely related to the subject "coronavirus" Biological entity identification is a key step in the literature mining process. To validate the functionality of the extracted entities, we first compared the entities from the subject dictionary with the human official gene symbols in the Hugo Gene Nomenclature Commission (HGNC) database to generate subject candidate genes using standard nomenclature. In addition, the entities in the abstract were capitalized to avoid errors in the identification process. To obtain widely used gene entities that are precisely related to the subject and to determine the significance of the gene distribution in the specific texts, we calculated the difference in the distribution proportions. We first searched for the subject candidate genes in the subject dictionary and in the randomized control dictionary, respectively. We then counted the number of abstracts containing each subject candidate gene in each abstract set, respectively. Finally, we calculated the odds ratio of each subject candidate gene and sorted them into a list of precisely related genes referred to as coronavirus study-specific host genes (CSHG).

| Identification of all entities correctly related to the subject "coronavirus"
Similar to the process of identifying CSHG, we calculated whether entities were significantly distributed in a specific corpus as the coronavirus study-specific entities (CSSE). We counted the number of texts containing each CSHG in a specific corpus, and then counted the number of each candidate entity in the corpus subset. Next, we randomly selected the same amount of text from the random control corpus and then counted the number of each candidate entity in this subset of the random corpus. This was repeated 100-10,000 times in the random corpus to generate candidate entities in the specified amount of text from the random distribution model. According to the central limit theorem, the distribution of random sampling averages of randomly distributed data always conforms to a normal distribution. Therefore, we can use the Z score to evaluate whether an entity is significant in a specific text. Here, we used a cutoff Z score > 6.
In addition, some entities mentioned in the abstracts are in singular or plural noun forms, or synonyms with multiple forms. Therefore, we automatically combined nouns with plural forms and homologous words with adjectives and adverb roots into the same subject-related entities and assigned them the same number. For example, synonymous entities such as coronaviral, coronavirus, coronaviruses were grouped into one entity called coronavirus and assigned with one number (see entity number in Table S1, Sheet 1 first column). A previous method of merging synonymous entities based on a dictionary (Cook & Jensen, 2019;Hettne et al., 2010) relied on the integrity of that dictionary, and also required a long retrieval time. To automatically solve the synonymous entity problem, TWIRLS classifies similar strings based on whether there is a significant statistical association between the character blocks in a set of candidate entities including various synonymous entities.

| Programming language and efficiency
Part of the algorithm was developed using the MatLab programming environment and Python language. Algorithm efficiency improvements F I G U R E 1 Flow chart of the knowledge-driven literature mining method, the basic steps of the literature mining includes: (a) identify genes with accurate relevance to the subject, (b) identify entities with accurate relevance to the subject, (c) entities were classified by calculating the association strength between genes and entities, (d) alignment with KEGG database to establish an association matrix between pathways and entity-categories and the targeted parallel acceleration module were developed in C/C++ language. In our analysis, the automated text analysis took about 4 hr to complete on a workstation with an Intel Xeon CPU E5-2690 v4 X2 (28 cores) and 128 GB of memory.

| Clinical data collection
In this study, we collected the medical records of 92 patients with were classified into mild, severe, and critical according to their condition based on the partial pressure of oxygen test. All data were independently checked by more than one physician.

| Coronavirus specific entities and host genes
As of February 21, 2020, the PubMed database included 14,878 biomedical articles on coronaviruses. We obtained text data (referred to as the local samples) from all related peer reviewed articles published by human researchers that contained the keyword "coronavirus" including the title, abstracts, and author and affiliation information (total 3,182,687 words). The goal of the literature mining was to identify host genes and entities that are relevant to coronavirus research and to establish connections between them. An entity can refer to a word or phrase of the concept name including related concepts (e.g., virus structure and chemical composition, source of infection, and virus type). The gene names were defined using the mammalian official gene symbols in the Hugo Gene Naming Committee (HGNC) database. We directly retrieved 667 candidate genes from the local samples. By establishing a random distribution of one of the candidate genes in a control sample, the significance of this gene appearing in the local samples can be determined if the frequency of the current gene is an outlier of the random distribution of a control sample (see Methods for details). By calculating the odds ratio, we can also further determine the specificity of the association between this gene and the local samples. In this paper, we selected an odds ratio > 6 as the threshold for this judgment, which resulted in 123 coronavirus studyspecific host genes (CSHGs).
To determine the specificity of the entity, we made several choices in the different texts in the local samples. We removed numbers, symbols, verbs, and garbled characters to obtain clean versions of the local samples. The CSSE were then identified in only the clean texts containing CSHGs. Based on the clean selected samples, we next built a local dictionary of candidate CSSEs, which contained 49,293 words after deduplication. Before calculating the random distribution of each entity, we included the synonymous entities into a same entity number (including singular or plural words, active and passive forms, different tenses, suffixes that do not change the meaning, etc.).
After cleaning and processing, CSSEs were identified by TWIRLS using a similar method as described above for CSHG. For the candidate CSSE dictionary, a random distribution model for each entity was built by TWIRLS using the control samples. We identified 623 CSSEs (Table S1)

| Entity categories and their labels: Human conclusions and enriched pathways
Although TWIRLS only identified 623 CSSEs after collation, the information is scattered in words, which limits the reconstruction of understandable mechanistic models. Accordingly, TWIRLS clusters CSSEs based on the rules defined by the CSHG distribution, as genetic level research can accurately answer and solve physiological and pathological problems. TWIRLS first calculates the specific co-distribution between CSHGs in local samples, then determines the distance between each pair of CSSEs and performs dichotomy clustering according to the linkage relationship between CSSEs and CSHGs. This step classified the 623 entities into 32 categories represented as C0-C31 (see category number in Table S1, Sheet 1 second column). In addition, for each category, TWIRLS also cited the top 10 most relevant references for human researchers (Table S2). Therefore, in any category, according to the CSSE and the most relevant literature, we can quickly provide "Labels of conclusion-drawn-by-humanresearcher" (HR Labels) for this category. This label outlines the most relevant research directions of the current entity category. For example, for category C3, the HR label is "Neurotrophic Coronavirus Related to Immune-Mediated Demyelination". We have summarized the HR labels for the 32 entity categories in Table 1.
The relative position of any CSHG to a certain CSSE can be estimated by TWIRLS (see Table S1). As each category contains different entities, we can determine whether a certain CSHG is significantly closer to each entity in the current category based on the ranking matrix between CSHG and CSSE. For example, the average distance between ACE2 and each of the 92 entities in category C5 was first calculated and a random distribution model of the average distances between ACE2 and any of the 92 entities (3,000-5,000 times) was built. The average distances between ACE2 and entities in category C5 were then analyzed to determine those significantly less than or those deviating from the mean of the random distribution (Z score = −5.8416). The significance of each category associated with each CSHG was then determined by TWIRLS using a score ranging between −10 and + 10, with a smaller score indicating the current CSHG is more relevant to the current category (see the Z score matrix in Table S3). For an entity category, the associated CSHGs (e.g., Ci CSHGs, where i represents the category number) can thus be selected using a Z score < −3. The Z scores describing the association between CSHG and any category is summarized in Table S3 and the category labels of all CSHGs are provided in Table S4.
Specifically, Spike proteins (S proteins) of different coronaviruses recognize different receptor molecules on human cells: ACE2 binds to S proteins in SARS and SARS-CoV-2 viruses, and DPP4 binds to S proteins in the MERS virus, FURIN restriction site on the Spike protein makes the SARS-CoV-2 more infectious than SARS, and TMPRSS2 (Transmembrane protease serine 2) is widely reported to mediate and assist in the invasion of host cells by multiple viruses. We found that these four genes were assigned to category C5, which had the corresponding HR label of "Spike protein (S) of coronavirus". This The distribution and meaning of the data can be compared to specific expression values of CSHG under different conditions (here, the category is used as a condition). We applied general analysis method of pathway enrichment that the most relevant genes from each entity category are taken as input of the enrichment program for pathway analysis (Reimand et al., 2019). Therefore, TWIRLS can recommend the most likely and least likely signaling pathways based on the distribution of the pathway signatures (Table 2). On the other hand, TWIRLS can also recommend the most likely and least likely categories for each signaling pathway. As an example, Table 3 shows the signaling pathways most likely associated with category C3 and the most unlikely category.

| Entity category-associated genes involved in generalized interaction networks
We coupled the above category information with gene interaction/ regulation databases to construct a generalized protein-protein The entity cloud (CSSE cloud) associated with ACE2 and DPP4 in the coronavirus knowledge graph. (c-e) The entity clouds of the three IFITMs family proteins (IFITM1-3) in the coronavirus knowledge graph. (f) The gene cloud associated with coronavirus-C3 entity category interaction network (PPI network) for 119 genes out of the 123 CSHGs.
We defined the direct interaction between two genes as one degree (1 ) of interaction, and the indirect interaction connecting two genes through a gene as two degrees (2 ) of interaction. All the genes in the 1 networks mined in the PPI database are shown in Figure 4. The results after deduplication showed 2,004 pairs in the 119 CSHGs (see Table S6). As a control, the average interactions of 119 randomly selected genes in the database showed between 252 to 612 pairs (average 220.16, SD 35.15).
Compared to random genes, the regulatory connections between CSHGs were significantly enriched (Z score = 50.97).
Those CSHGs associated with a certain category had much closer interactions. For example, CSHGs associated with category C3 (or associated with C5 or C10) were closer to each other in the 1 networks (Figure 4) Combining the category information with generalized interaction databases provides richer interactions and regulatory linkages. We extended the 119 CSHGs to their 2 networks based on the interactions with higher likelihood connections (Combined score > 800). The 2 networks expanded the number of genes from 119 host genes to 3,494 genes that may be associated with coronaviruses (see Table S8 for a list of genes, excluding CS119, as this type of gene is called CSHG2). These genes are mainly involved in two types of functions: virus-related signaling pathways and immune function-related pathways. Table 4 shows a summary of the KEGG signaling pathways.
Among the entire network, we found several CSHGs in the 1 networks (32.6-35.71%) that directly interacted with three members of the IFITMs family, whereas fewer CSHGs in the 2 network (5.21-9.46%) indirectly interacted with them. Although there was a higher proportion of directly interacting CSHGs, they were not significantly enriched in any category (see Table S9 for the enrichment scores of the 1 network nodes in different categories), whereas the indirect CSHGs were significantly enriched mainly in the C3 and C10 categories (Z score > 3) (see in Table S10 for the enrichment scores of the 2 network nodes in different categories). These findings demonstrate that TWIRLS can provide new insights about hub molecules, particularly when coupled with interaction information. These new candidate IFITM genes had potential functions associated with category C3.
However, after adding generalized interaction information, TWIRLS also inferred possible functions of these proteins not associated with any category.
T A B L E 1 Coronavirus-entity category labels and genes associated with each category. MISC indicates the label cannot be summarized Although entities in category C5 mainly show that virus invasion is facilitated by virus-binding receptors and membrane proteases, the biological mechanism of receptor binding to viruses leading to pathological changes has been reported less frequently.
TWIRLS can also recommend new genes that interact with C5 CSHGs, and other 1 or 2 CSHGs linked to these genes might be enriched in other categories. These inferences are based on a process that finds new genes connected to different categories. The connected categories can suggest potential regulatory relationships between different biological functions or phenotypes. The genes that serve as linkers are potential targets for gain-and loss-of-function experiments to identify those systems described by the meaningful entities in these categories.
In this study, TWIRLS found the 2 networks had connections with certain CSHGs associated with categories or with no category.
For example, TWIRLS found that CSHGs in the 2 connections of IFITM1 were mainly concentrated in category C3 (see Figure 5). Interestingly, CSHGs in the 2 connections of ACE2 and DPP4 associated with category C5 were also enriched in category C3, inferring that the information summarized in category C3 probably describes the underlying mechanisms of the pathological changes after coronavirus infection. In our analysis, the signaling pathways in C3 were mainly RAS, Vitamin D and RXR activation, and Chemokine signaling, with RAS being the most significant (Table 3 shows a summary of the C3related signaling pathways).

| Angiotensin II receptor blockers (ARBs) may be beneficial in patients with COVID-19
It has been demonstrated that the binding of the coronavirus spike proteins to ACE2 leads to ACE2 downregulation (Jia, 2016), which in turn results in unbalanced regulation of ACE-Ang II axis and ACE2- results above also suggest that the homeostatic imbalance of RAS could be caused by viral binding to membrane ACE2 molecules, which may lead to dysregulation of inflammatory factor levels. Therefore, we evaluated the effects of the AT1R antagonists (ARB) such as losartan and telmisartan on the SARS-CoV-2 infection. We analyzed the medical records of 92 patients diagnosed with COVID-19 pneumonia based on the New Coronavirus Pneumonia Prevention guidelines. More than one-half of these patients (51.1%) had one or more underlying conditions including 31 patients (33.7%) with hypertension ( Figure 6a-c).
Clinically, patients with COVID-19 are classified into mild, severe, and critical according to the partial pressure of oxygen test, and in this study we can also define them by analyzing the differences in their blood indices. Using these blood indices we then investigated hypertension patients with and without ARB, and it turned out that they also can be clearly distinguished. Here 31 numeric blood indices in COVID-19 patients were selected as the clinical characteristics, including the functional indices of the liver, kidney, and heart (Table S11).
Patients with other medical history were also evaluated and clustered (see below).
For each index, we calculated the average of each group. As the numerical indices among various patients are not always normally distributed, we used a random distribution of the mean of the randomly separate group of patients for 10,000 times. According to the central limit theorem, the random distribution of any index should be normally distributed. Therefore, the Z score measures the statistical significance of each group of patient indexes defined by illness or treatments.
Patients with hypertension were divided into two groups, ARB users and non-ARB users. Out of the 31 hypertensive patients, eight took ARB drugs (one took Telmisartan, two took Candesartan, three took Irbesartan and two took Valsartan) and the other 23 patients took other drugs such as calcium antagonists or diuretics before admission. After admission, all hypertensive patients were assigned a calcium antagonist for targeted treatment. We also considered 16 patients with other medical history without hypertension as the other patient group. For each group, the blood indices were compared between ARB and non-ARB patients without any medical history to analyze differences with corresponding Z scores. We obtained Z scores of six groups representing the clinical characteristics of mild F I G U R E 5 The gene interaction networks centered around DPP4, ACE2, and IFITM1, respectively. The yellow nodes represent the ACE2, DPP4 and IFITM1 genes, purple nodes represent genes that have 1 of interaction with the core genes, green circled purple nodes represent the genes connecting CSHG and C3 category-related genes, and pink nodes represent genes with 2 of interaction with the core gene. The red diamonds show the most relevant entity category symbol for CSHG illness, severe illness, critical illness, ARB users, non-ARB users, and patients without medical history, respectively (Table 5). The cluster analysis of the Z scores showed a closer relationship between non-ARB users and severe illness (see Figure 6d), suggesting that ARB anti-hypertensive drugs may have positive effects on reducing the severity of COVID-19.

| Discussion
We used TWIRLS, a machine-based approach, to collect, summarize, and analyze about 15,000 biomedical articles related to coronavirus, with the aim to elucidate the mechanisms underlying coronavirusinduced host pathological changes. The TWIRLS system is an automated process that can be used to summarize the entities and genes related to coronavirus infection. By combining this system with generalized interaction databases, we can reveal further associations that can provide a deeper understanding of the biological mechanisms of the disease phenotype caused by virus-host interactions. Using TWIRLS, we found a possible mechanism involving ACE2/AT2R-RAS-Cytokine signaling, which becomes imbalanced under virus infection leading to cytokine storms.
Angiotensin II (Ang II) is the main effector of this system and exerts most of its actions through the activation of Ang II type 1 and type 2 receptors (AT1R and AT2R) (Donoghue et al., 2000). Angiotensin II is formed by the successive enzymatic action of renin and ACE. Deficiency of ACE2 causes respiratory failure pathologies such as sepsis, pneumonia, and SARS (Boehm & Nabel, 2002;Imai et al., 2005). It has been confirmed that genetic deletion of AT1a receptor expression in mice can significantly improve lung function and reduce the formation of pulmonary edema compared with wild-type mice (Sugaya et al., 1995). In contrast, inactivation of AT2R in mice aggravated acute lung injury. This suggests that AT1R mediates the pathogenicity of Ang II, whereas activated AT2R has a protective role (Hein, Barsh, Pratt, Dzau, & Kobilka, 1995 Although Ang II was originally described as an effective vasoconstrictor, there is growing evidence that it is closely involved in the inflammatory response of the immune system. Proinflammatory Ang II (Nataraj et al., 1999;Rudemiller & Crowley, 2016;Suzuki, Ruiz-Ortega, Gomez-Guerrero, Tomino, & Egido, 2003). In particular, proinflammatory cytokines regulate the production of AGT in the liver and kidney (Brasier, Ron, Tate, & Habener, 1990;Corvol & Jeunemaitre, 1997;Sriramula, Haque, Majid, & Francis, 2008 of tissue inflammation. Therefore, RAS dysfunction may result in the accumulation of cytokines in the lungs leading to excessive accumulation of immune cells and interstitial fluid, resulting in blocked airways and causing eventual death. In the first reports of severely infected patients diagnosed with COVID-19, a large number of patients experienced "cytokine storms" that were fatal . Figure 7 summarizes the functional changes and pathological consequences of RAS after ACE2 combines with the coronavirus. We expect the mechanism summarized and reasoned by TWIRLS can be further supported by pathological evidence. To date, only one report of a postmortem biopsy has been published with pathological data. Although histological examination showed bilateral diffuse alveolar damage with cellular fibromyxoid exudates, the right lung showed evidence of desquamation of pneumocytes and hyaline membrane formation, indicating acute respiratory distress syndrome (ARDS), whereas the left lung showed pulmonary edema with hyaline membrane formation, suggestive of early-phase ARDS. The pathological evidence suggests that ARDS symptoms are closely related to cytokine storms (Xu et al., 2020). Based on the above results, we analyzed the clinical characteristics of COVID-19 patients, which showed that patients taking ARBs were at a lower risk of developing severe lung damage than non-ARB patients, indicating these anti-hypertensive drugs may have positive effects on COVID-19 patients.
Meanwhile, some latest hypothesizes also support this conclusion that angiotensin receptor 1 (AT1R) inhibitors might be beneficial for pneumonia patients infected by COVID-19 (Gurwitz, 2020). In addition, the available evidence, in particular, data from human studies, does not support the hypothesis that using ACEI/ARB increases ACE2 expression and the risk of complications from COVID-19 (Sriram & Insel, 2020). Therefore, we suggest that ARB can be used as potential alternatives for COVID-19. At present, there are several ongoing clinical trials for testing the efficacy of ARB on COVID-19 patients (telmisartan (Rothlin, Vetulli, Duarte, & Pelorosso, 2020), NCT04355936; losartan, NCT04312009; valsartan, NCT04335786).
We hope that there will be more evidences of ARB clinical trials and more histopathology-related data can further support our preliminary findings using machine approach. At the same time, in order to further rectify the deviation of structured knowledge generated by the algorithm, more rigorous data statistics methods, discussion and interviews with scientists are demanded for guaranteeing the goals of machine learning algorithm are consistent with that of the human.
That it to say, only combining human experts and algorithms to realize F I G U R E 7 Disequilibrium of RAS-cytokine signaling homeostasis causing cytokine storms triggered by ACE2-mediated coronaviral infection machine learning with human guidance can really promote the development of machine learning in the future.

DECLARATIONS ACKNOWLEDGMENTS
This work was supported by the Public Library Association Youth Talent Project 17QNP010; Chongqing Health Commission COVID-19 Project 2020NCPZX01.