Establishment of a Knowledge‐and‐Data‐Driven Artificial Intelligence System with Robustness and Interpretability in Laboratory Medicine

Laboratory medicine plays an important role in clinical diagnosis. However, no laboratory‐based artificial intelligence (AI) diagnostic system has been applied in current clinical practice due to the lack of robustness and interpretability. Although many attempts have been made, it is still difficult for doctors to adopt the existing machine learning (ML) patterns in interpreting laboratory (lab) big data. Here, a knowledge‐and‐data‐driven laboratory diagnostic system is developed, termed AI‐based Lab tEst tO diagNosis (AI LEON), by integrating an innovative knowledge graph analysis framework and “mixed XGboost and Genetic Algorithm (MiXG)” technique to simulate the doctor's laboratory‐based diagnosis. To establish AI LEON, we included 89 116 949 laboratory data and 10 423 581 diagnosis data points from 730 113 participants. Among them, 686 626 participants were recruited for training and validating purposes with the remaining for testing purposes. AI LEON automatically identified and analyzed 2071 lab indexes, resulting in multiple disease recommendations that involved 441 common diseases in ten organ systems. AI LEON exhibited outstanding transparency and interpretability in three universal clinical application scenarios and outperformed human physicians in interpreting lab reports. AI LEON is an advanced intelligent system that enables a comprehensive interpretation of lab big data, which substantially improves the clinical diagnosis.


Introduction
Artificial intelligence (AI)-based technologies in medicine are being developed rapidly. [1] More medical AI has been developed to mimic doctors' consultations. [2,3] For instance, AI has been applied to classify skin cancer with high performance on par with human dermatologists [2] or detect diabetic retinopathy from retinal fundus photographs with competence comparable with ophthalmologists. [3] However, even the most prominent AI supporters recently emphasized that it is currently difficult to achieve truly usable AI in clinical diagnosis. [4][5][6] In the pioneering era of AI, reasoning methods are logical, symbolic, [7] and highly dependent on knowledge. [8] The logical reasoning engine would distill knowledge in the form of clear computer codes to guide computers in accurately processing data and making decisions. [1] Its outputs are fully interpretable as it is based on rules entirely defined by human experts. However, its practical applications are Laboratory medicine plays an important role in clinical diagnosis. However, no laboratory-based artificial intelligence (AI) diagnostic system has been applied in current clinical practice due to the lack of robustness and interpretability. Although many attempts have been made, it is still difficult for doctors to adopt the existing machine learning (ML) patterns in interpreting laboratory (lab) big data. Here, a knowledge-and-data-driven laboratory diagnostic system is developed, termed AI-based Lab tEst tO diagNosis (AI LEON), by integrating an innovative knowledge graph analysis framework and "mixed XGboost and Genetic Algorithm (MiXG)" technique to simulate the doctor's laboratory-based diagnosis. To establish AI LEON, we included 89 116 949 laboratory data and 10 423 581 diagnosis data points from 730 113 participants. Among them, 686 626 participants were recruited for training and validating purposes with the remaining for testing purposes. AI LEON automatically identified and analyzed 2071 lab indexes, resulting in multiple disease recommendations that involved 441 common diseases in ten organ systems. AI LEON exhibited outstanding transparency and interpretability in three universal clinical application scenarios and outperformed human physicians in interpreting lab reports. AI LEON is an advanced intelligent system that enables a comprehensive interpretation of lab big data, which substantially improves the clinical diagnosis.
extremely limited due to the broadness and complexity of capturing all the medical knowledge. [9] Recently, deep learning (DL) models have made significant progress in automatically extracting rules from large amounts of data, rather than explicitly specified by experts. [8] Due to high performance, these datadriven AI systems have made many breakthroughs, such as sound recognition and image recognition (radiology, pathology, dermatology, etc.). [2,[10][11][12] However, data availability and reliability is a foundational challenge as the quality of the output is highly dependent on the quality of the input. [1] In addition, the accurate and scalable data interpretation is another major challenge due to the US Food and Drug Administration's "black box" warning on DL models. [4] In medical diagnosis, we need not only to learn from prior data to extract and generalize knowledge [13] but also to clarify the complex inter-relationships and potential explanatory factors of the knowledge. [14] Therefore, the key to reaching truly usable AI is to develop a transparent, interpretable, trustworthy, and robust medical diagnostic system through the integration of knowledge-driven and data-driven models.
As we all know, laboratory tests have been widely used to exclude, confirm, classify, or monitor various diseases to assist doctors in their clinical decision. For example, more than 41% of emergency patients require a laboratory test. Many diseases, such as diabetes, cancers, thrombotic disorders, and genetic diseases, cannot be diagnosed without laboratory testing. [15,16] More importantly, laboratory data are needed to support evidencebased medical practice and the development of guidelines for clinical diagnosis and treatment of many diseases. For example, out of 1230 evidence-based clinical practice guidelines for 23 major diseases, 37% involved laboratory tests. [15] Collectively, laboratory medicine can provide a lot of factual, explicit, and tacit knowledge for clinical practice and is therefore very suitable for establishing knowledge-based interpretable AI diagnostic systems. In contrast, compared with imaging data that require manual annotations and careful quality control, laboratory data provide more advantages. First, in recent two decades, automation has dramatically increased the capacity and improved the standardization of clinical laboratories, thus generating vast structured or semistructured data. [17,18] Second, over the last 30 years, much progress has been made to improve the standardization and harmonization of laboratory methods and results. [16,19] For example, the International Consortium for Harmonization of Clinical Laboratory Results (ICHCLR) has paid attention to the unification of laboratory results. [19] Therefore, a large volume of high-quality laboratory data is also very suitable for establishing data-driven diagnostic systems.
In laboratory-based diagnosis, the doctors usually formulate a checklist for laboratory tests based on chief complaints. By evaluating abnormal laboratory results, the doctors compile a list of possible underlying diseases on the basis of their knowledge and experience. For those complicated diseases, the doctors may further add laboratory tests to increase the accuracy of the diagnosis. Then, the most severe diseases (such as malignant tumors) and related diseases (such as hematuria) are ranked first, followed by other minor diseases (such as malnutrition). Therefore, laboratory-based diagnosis is not only a classification process but also a complex ranking one. To mimic the decision-making processes of human doctors, we need to develop a novel AI diagnostic system that not only calculates the probabilities of different diseases but also automatically ranks them according to their severity and relevance. Moreover, to better assist decision-making, such a system should also be transparent, interpretable, and comprehensible for human doctors.
In this study, we tried to develop a novel knowledge-and-datadriven AI diagnostic system, termed "AI-based Lab tEst tO diagNosis" (AI LEON). At the technical level, we successfully integrated knowledge-driven and data-driven models by combining a comprehensive knowledge graph analysis framework and a state-of-the-art "mixed XGboost and Genetic Algorithm (MiXG)" technique. At the application level, we implemented multiple disease recommendations in clinical practice and simulated the doctor's laboratory-based disease reasoning process through the effective integration of medical knowledge, big medical data, and real-world diseases.
For the retrospective cohort, 686 626 participants were randomly divided into a training set (509 841 participants and 65 034 574 laboratory data) and a validation set (235 074 participants and 16 268 666 laboratory data). These data were explored to develop and validate AI LEON, which was eventually tested in a large cohort of 69 101 participants with 7 813 709 laboratory data and 370 common diseases. A summary of the workflow is provided in Figure 1. The clinical characteristics of the eligible participants are detailed in Table 1. The participants had a median age of 56 years (interquartile range from 38 to 67 years) with 54.58% male. All the diseases were categorized into ten organ systems: infection, respiratory system, hematological system, gastrointestinal system, urinary system, endocrine system, circulatory system, immune system, reproductive system, and nervous system ( Figure 2). The most frequently encountered diseases of each system are listed in Table 1.

AI LEON to Simulate Physicians' Diagnosis
To mimic human physicians' clinical diagnostic process in real time, AI LEON was established using two core models, the knowledge-driven model ( Figure 3a) and the data-driven model ( Figure 3b). In this system, the knowledge-driven models could diagnose 273 commonly encountered diseases, whereas the datadriven models could predict the probability of the remaining 168 diseases. The former first provided evidence of the likelihood of the diseases and the latter would be triggered to further calculate the probability of the diseases once the former failed to diagnose them directly. The workflow of AI LEON is depicted in Figure 3c, demonstrating how it works in the clinic. First, all the laboratory data from the participants are systematically evaluated by the system. Then, the possible organ systems are recommended, followed by listing the specific diseases according to the different risk levels. Ultimately, the platform is expected to assist doctors in making accurate and timely clinical interventions.

AI LEON Showing High Accuracy for Disease Diagnosis
AI LEON was independently tested in the prospective test cohort. The diagnostic performance of the 370 frequently encountered diseases in ten organ systems was assessed by the system. Among them, 203 diseases were directly diagnosed by the knowledge-driven models of the system. The other 167 possible diseases were first evaluated by the knowledge-driven models and then the probabilities of the potential diseases by the data-driven models were calculated.
For the first level of diagnosis, a high level of accuracy was achieved to predict the involved organ systems; the overall mean average precision (mAP) was 87.74%, ranging from 81.43% for the circulatory system to 96.96% for infection (See Table 2). For the final level of diagnosis, the system showed a solid performance in diagnosing diseases with a median mAP of 84.40%. The diagnostic performance was also evaluated in organ-based subcohorts of patients. Under a 100% recall rate across all subcohorts, a high level of mAP was achieved, ranging from 74.54% for the nervous system to 91.72% for infection (See Table 2).
In addition, mAP of top n for broad organ systems and the corresponding recalls were evaluated (Table 3, Figure 4a). For instance, if the top five organ systems were recommended, a mAP of 89.83% and a recall rate of 92.80% were achieved,   Figure 2. Hierarchy of the diagnostic framework of AI LEON. Using an organ-based approach, the diagnoses are divided first into ten broad organ systems and then subsequently into specific disease groups under each organ system. Abbreviations: AI LEON, Artificial Intelligence-based Lab tEst tO diagNosis.
www.advancedsciencenews.com www.advintellsyst.com Figure 3. Overview of AI LEON. a) The basic working mechanism of the knowledge-driven models in AI LEON. The output results of the knowledgedriven model are divided into two categories, including definite diagnosis and suspicious diagnosis. The former shows very strong confidence in disease diagnosis (e.g., Yes/No) based on a combination of lab results. The latter shows relatively low confidence in some diseases (e.g., Sus). b) The basic working mechanism of the data-driven models in AI LEON. Suspicious diagnosis suggested by the knowledge-driven model triggers the corresponding data-driven model, which further calculates the probability and relevance of the diseases. c) The workflow of AI LEON. The platform systematically evaluates the overall results of laboratory tests from the population. Then, the possible organ systems are recommended, followed by listing the specific diseases according to the different risk levels of the diseases. Ultimately, the AI LEON platform is expected to assist the doctors in making timely clinical interventions. Abbreviations: AI LEON, Artificial Intelligence-based Lab tEst tO diagnosis; Sus, Suspicious diagnosis.
www.advancedsciencenews.com www.advintellsyst.com respectively. If top 10 organ systems were recommended, mAP was 87.75%, and the recall rate was increased to 99.96%. Further, mAP of top n for all diseases and the corresponding recalls were also evaluated ( Table 3, Figure 4b). High accuracy of mAP of 91.95% was achieved when the top five diseases were diagnosed by AI LEON with a recall rate of 66.99%. When the top 10 diseases were diagnosed, the recall rate was increased to 85.41% with mAP reaching 88.85%. Similar results were observed in ten organ-based subcohorts of patients ( Table 4, Figure 4c-l). For example, AI LEON has the highest accuracy of mAP in diagnosing a subcohort of patients with infectious diseases. When five infectious diseases were recommended, mAP was 95.67%, and the recall rate was 69.81%; when 10 infectious diseases were recommended, the mAP was slightly reduced to 94.26%, and the recall rate was increased to 87.06%. These results indicated the missing of very few diseases and a very high degree of accuracy when ten diseases were diagnosed by AI LEON.

AI LEON Showing Good Clinical Interpretability
To further illustrate how AI LEON provides the logical reasoning process of disease diagnosis, we envisioned three clinical application scenarios. The first scenario was named the "gold standard-based diagnosis" using knowledge-driven models based on the guidelines and expert consensus ( Figure 5a). Taking COVID-19 as an example, the disease can be confirmed by the COVID-19 RNA test report, when at least two genes (ORF1ab, N and E) were positive.
The second scenario was called "specific diagnosis" using etiology-related knowledge-driven models ( Figure 5b). The first knowledge-driven model is triggered to make a preliminary diagnosis, and then the second or more knowledge-driven models are triggered to give the definitive diagnosis. Take renal anemia as an example; the definitive diagnosis is finally made by integrating the anemia diagnostic formula and chronic kidney disease (CKD) formula.
The third scenario was termed "differential diagnosis" by combining knowledge-driven models with data-driven models. As shown in Figure 5c, an abnormal laboratory parameter usually represents several potential diseases predicted by the knowledgedriven model. Subsequently, the corresponding data-driven models automatically evaluate the probability of these diseases using all the available laboratory parameters. For example, the knowledgedriven model predicts that the increase in CA19-9 may be accompanied with various diseases, such as pancreatic cancer, pancreatitis, ovarian cyst, etc. Then, the corresponding data-driven models will be triggered to calculate the probabilistic risks of these possible diseases. Finally, an ordered list of diagnoses is ranked in the order of probability from the highest to the lowest.

AI LEON Outperforms Human Physicians in Laboratory-Based Diagnosis
We also compared the performance of diagnosis between AI LEON and 15 certified physicians in interpreting lab reports of 100 cases from an independent test cohort. The reader study consists of a blank-filling study and a trueÀfalse study. In the blank-filling study, the accuracy of AI LEON was comparable  with that of each general practitioner (GP) group ( Figure 6a and Table 5, AI LEON: 0.791 vs GP groups: 0.637, 0.487, 0.499, respectively). The AI LEON's mAP was much better than that of all the three GP groups in both the overall patients ( Figure 6a and Table 5 Table 6). AI LEON also achieved a higher recall rate of diagnoses than each GP group ( Figure 6a and  Figure 6l-v and Table 6). Take together, our results indicated that AI LEON may be used to improve the clinical diagnosis of human physicians.

Discussion
Here, we presented a novel interpretable knowledge-and-datadriven AI system (AI LEON) to predict the diagnoses by www.advancedsciencenews.com www.advintellsyst.com automatically analyzing laboratory reports. By learning from a large amount of knowledge and examples, AI LEON demonstrated excellent diagnostic power across broad organ systems and specific diseases. After running the humanÀmachine competitions, we believe that AI LEON is emerging as a potentially powerful tool to imitate and even augment the clinical diagnosis that is traditionally achieved by experienced physicians. A hierarchical diagnostic system was established using the knowledge graph analysis framework and MiXG algorithm to simulate the physicians' reasoning process. The diagnostic system automatically analyzed 2071 lab indicators and performed primary diagnosis on ten organ systems, including infection, respiratory and gastrointestinal system, etc. This process mimicked the traditional framework used in physician reasoning to formulate an organ-based primary diagnosis. Then, under each organ system, the diagnostic system executed further subclassification to diagnose 441 specific diseases, simulating the disease diagnosis process of the specialist physicians. In short, AI LEON could categorize the possible diseases into the corresponding organ systems and present the final disease diagnosis according to different risk levels.
It is notable that the diagnosis process is more of a ranking process than a simple binary diagnosis. Therefore, we proposed an innovative MiXG algorithm that first calculated the probabilities of different diseases from the XGBoost classifiers and then adopted the mAP metric from the GA models to ensure that the most relevant and severe diagnostic result was ranked first. The mAP value of AI LEON reached 89.83% and 91.95% for predicting the top five organ systems and the top five specific diseases, respectively (Table 3), clearly proving the high accuracy and completeness of our system. Clinically, it remains a daunting challenge to identify all the possible underlying diseases, especially when evaluating the diseases on the basis of laboratory tests alone. Using routine laboratory tests as input, our platform succeeded in making reasonably accurate diagnosis predictions in broad organ systems as well as specific diseases. Then, we further performed a humanÀmachine competition and confirmed that AI LEON outperformed human physicians in laboratorybased diagnosis (Table 5 and 6). Based on these findings, we believe that our system may help physicians improve their diagnostic accuracy.
Many automated machine learning (ML) diagnostic models [20,21] (such as gradient boosted decision tree, support vector machine, random forest, Bayesian classifier, etc) have been established using laboratory data. In comparison, AI LEON system has two advantages. First, it exhibits excellent robustness and accuracy in laboratory disease diagnosis. The performance of ML models is known to continue to improve as the amount of input data increases. Herein, we used the massive volume of data, including 89 116 949 lab data and 10 423 581 diagnostic data, to develop and test the models, significantly improving the robustness of the diagnostic system. In addition, we used the knowledge graph to link the lab data with the diagnostic data and excluded the data that lacked knowledge relevance. All the remaining data included in the system were well cured, including data standardization, normalization, missing value padding, denoising, etc. These high-quality data further improved the accuracy of the diagnostic system. Second, AI LEON shows good interpretability, improving the "black box" problems of the many AI platforms. [22] It is thought that if the system's reasoning  TOP1  TOP2  TOP3  TOP4  TOP5  TOP6  TOP7  TOP8  TOP9  TOP10   Infection  mAP  www.advancedsciencenews.com www.advintellsyst.com cannot be explained, such evaluation is not feasible in practice. [23] Our system, however, enables the users to understand the basis for diagnostic recommendations and provide doctors with comprehensive and accurate diagnostic support, even when the doctors are not equipped with AI knowledge. Specifically, the knowledge graph can visually display the relationship between the lab data and the diagnostic data; the MiXG model can present each feature included in the model and the importance scores of these features. The combination of the two technologies can greatly improve the transparency and interpretability of the AI system. Therefore, this comprehensive analysis method can help to achieve full coverage of high-level laboratory-based diagnosis.
As we all know, advanced imaging and pathological tests are less available in resource-constrained settings, while laboratory tests are the most commonly used diagnostic method. [24] We envision that AI LEON could be used as an assistive tool by primary healthcare (PHC) doctors. Pertaining to the high top-n accuracy of the model, AI LEON could be used to narrow down the potential disease diagnosis using the top-n predicted diagnoses, followed by more informed ancillary tests to obtain the final diagnoses. By doing this, the AI LEON-assisted clinical workflow could be optimized, and doctors' misdiagnosis and missed diagnosis rate may also be reduced. In addition, in developing regions with a limited number of experienced doctors, our Figure 5. Clinical application scenarios of AI LEON. a) Gold standard-based diagnosis using knowledge-driven models based on the guidelines and expert consensus. Using COVID-19 as an example, the infection of COVID-19 could be confirmed by COVID-19 diagnostic formula, evaluating COVID-19 RNA testing report with at least two positive genes of ORF1ab, N and E. b) Specific diagnosis using etiology-related knowledge-driven models. Using renal anemia as an example, the definitive diagnosis was finally made by integrating the anemia diagnostic formula and CKD formula. c) Differential diagnosis by combining knowledge-driven models with data-driven models. Using CA19-9 as an example, the knowledge-driven model first suggests that the increase in CA19-9 may be related to some diseases, such as pancreatic cancer, pancreatitis, ovarian cysts, etc. Then, the corresponding data-driven models for each disease were triggered and further the possibility and relevance of these diseases were calculated. At the same time, the importance scores of features included in each model were presented. Abbreviations: AI LEON, Artificial Intelligence-based Lab tEst tO diagnosis; CKD, chronic kidney disease; RT, routine test.
www.advancedsciencenews.com www.advintellsyst.com Figure 6. Comparison of the diagnostic performance between AI LEON and human physicians. a) The diagnostic performance of AI LEON and three GP groups in the blank-filling study for overall patients. bÀk) The diagnostic performance of AI LEON and three GP groups in the blank-filling study for ten organ-based subgroups of patients. l) The diagnostic performance of AI LEON and three GP groups in the true or false study for overall patients. mÀv) The diagnostic performance of AI LEON and three GP groups in the true or false study for ten organ-based subgroups of patients. Abbreviations: AI LEON, Artificial Intelligence-based Lab tEst tO diagnosis; GPs, General Practitioners; mAP, mean average precision.
www.advancedsciencenews.com www.advintellsyst.com system could be deployed at the lowest cost to provide direct medical consultation to patients (such as mobile APPs). Insufficient medical resources and the continuous increase in chronic diseases are challenging our medical systems. [25] Our real-time AI diagnostic system could provide patients with preliminary diagnosis results and perform appropriate hierarchical diagnosis and treatment, thereby alleviating the shortage of medical resources. It has been well documented that the benefits from the patient's laboratory objective data involve screening the population with asymptomatic complex diseases, risk stratification of diseases, specific diagnosis of the patient's condition, etc. [19] Therefore, we expect that our laboratory-based diagnostic system can help improve the early diagnosis of complex and refractory diseases, increasing the survival rates and decreasing the treatment expense. Despite these encouraging results, our AI system still has some limitations. First, the knowledge graph base needs to be iterated and updated continuously to ensure that our system always contains the latest medical laboratory knowledge. Secondly, as the accuracy of our system is heavily dependent on the quality of the input data, all the laboratory data should be carefully preprocessed. Also, if our system is to be generalized to an external test cohort, the harmonization of lab data from different healthcare systems is the key premise. By standardization and harmonization of laboratory methods and results, [16,19] these limitations could probably be lifted.
We have established a knowledge-and-data-driven intelligent laboratory system, AI LEON, with good robustness and interpretability, achieving a full high-level laboratory-based diagnosis coverage. Our study provides a proof of concept for implementing a laboratory-based intelligent system to imitate and even augment the clinical diagnosis capabilities of human physicians. AI LEON can ultimately speed up the accurate diagnosis and timely referral of complex and refractory diseases, thereby facilitating earlier treatment and improving clinical outcomes. Therefore, our tool is expected to be applicable to clinical diagnosis and particularly useful for areas with limited medical resources.

Experimental Section
Study Approval: The study protocol was reviewed and approved by the Institutional Ethics Committee at Shanghai Changhai Hospital.
Study Design and Participants: An ambispective diagnostic study was conducted in Shanghai Changhai Hospital. Eligible criteria of the patients between Jan 1, 2010, and Oct 1, 2020, were listed as follows: 1) having diagnostic results from the ICD versions 9 or 10 and 2) having laboratory test data within 21 days prior to diagnosis. Exclusion criteria included 1) follow-up diagnosis beyond the initial diagnosis to minimize treatment bias and 2) diagnosis without correlation with laboratory medicine because many diseases are determined by pathology or imaging (such as skin cancer, fractures). Eligible participants between Jan 1, 2010, and Dec 31, 2019, were included in the retrospective cohort, and those diagnosed between Jan 1, 2020, and Oct 1, 2020, were included in the prospective test cohort. The retrospective cohort was randomly divided into a training set and validation set (8:2) to develop and validate the AI LEON system, respectively. Then, the system was prospectively assessed in the test cohort.
Data Source: To improve clinical scientific research, Shanghai Changhai Hospital developed an integrated data acquisition system and built a largescale data integration platform including Clinical Data Repository (CDR) and Research Data Repository (RDR), [26] which enforced data mapping to controlled vocabularies such as Systematized Nomenclature of Medicine Clinical Term (SNOMED CT), the ICD versions 9 and 10, Logical Observation Identifiers Names and Codes (LOINC), Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS), etc. We also developed a data extraction package specific to the data integration platform of Shanghai Changhai Hospital, extracted data comprising anonymized demographics, laboratory test results, diagnoses information, medical records, pathology, and radiology reports of eligible participants.
Data Processing: 1) Data Standardization: There were three data types in a test result, quantitative, qualitative, and character. We thus ensured that all data with different types were unified and standardized. 2) Data Normalization: Our normalization process included standardizing continuous test results and setting a finite boundary for results that might go to infinity, solving the infinite left or right boundary of the reference range. From the perspective of mathematical modeling, our normalization method adopted a nonequivalent system change. Moreover, the normalized result had a symmetrical structure that was convenient for modeling and analysis. 3) Time Window: As both the independent and dependent variables have time-series characteristics, it was necessary to combine the independent and dependent variables to form a sample. We chose Missing values in the data corresponded to cases where the doctor did not recommend the patient to have these tests, which may be within the normal range according to the doctor's initial inquiry. Therefore, we filled in the missing values as "negative." The knowledge-driven Models: 1) Establishment of Knowledge-driven Models: Senior specialists manually annotated a total of the 3156 diagnostic flowchart based on 24 693 knowledge items. Then, the knowledgedriven models were established based on the knowledge graph analysis framework (https://www.ontotext.com/knowledgehub/fundamentals/ how-to-building-knowledge-graphs-in-10-steps/). Our knowledge-driven models partitioned the knowledge corresponding to test results into two categories, defined knowledge and ambiguous knowledge. Defined knowledge shows very strong confidence in supporting a diagnosis based on a combination of some test results. As an example of defined knowledge, the infection of COVID-19 could be confirmed with very high confidence by its RNA testing report with at least two positive genes of ORF1ab, N and E according to the protocol of the COVID-19 test kit. Ambiguous knowledge shows a possibility that some suspicious diseases might be the result of certain abnormal test results. As an example of ambiguous knowledge, a high test CA19-9 value may be the evidence of pancreatic cancer, pancreatitis, ovarian cyst, etc. However, we need further evidence to finalize the diagnosis.
2) Model Expression Paradigm: The data included test results (S 1 ), diagnostic results (S 2 ), and basic information results (S 3 ). Test results included test specimen, name of the test item, result, unit, reference range, normalized result, test time. Diagnosis results included diagnosis name, diagnosis ICD10 code, and diagnosis time. Basic information result included gender, birth year, and race.
The model inputs were obtained from the above raw data, which included test parameters (P 1 ), diagnostic parameters (P 2 ), and basic information parameters (P 3 ). Test parameters included test specimen and name of the test item; diagnostic parameters only contained the name of the diagnosis; and basic information parameters included gender, birth date, and race.
In the model, S represented the patient's actual test results as the arguments passed to a function; P represented the parameters needed for the model, similar to the parameters in a function. For example, for a patient with a typical Ca199 lab result, this lab result was taken from S 1 ; but in a knowledge model, an elevated Ca199 will alert pancreatic cancer; then elevated Ca199 was P 1 , which represented a parameter needed by the model. The overall knowledge inference process continuously compares the actual arguments in S 1 with the morphological parameters in P 1 to find the conditions that can satisfy the knowledge model and complete the knowledge inference.
The model took in an input and gave an output that shows the confidence level of a diagnosis. High confidence in the output indicates the gold standard. Medium-level confidence indicates that the knowledge inference results can be directly displayed to the physician. Low-level confidence means that the result needs the data-driven models for a secondary judgment.
3) The Use of the Model: The use of the model is as follows.
Step 2: Among all knowledge models, we found the set of knowledge models that met the requirements.
Otherwise, go to Step 3.
Step 3: M 1 included three types of models with confidence level¼high, medium, and low, which were noted as M 11 ,M 12 , and M 13 , respectively.
The models in the knowledge model collections M 11 ,M 12 , and M 13 were operated separately, and the inference results of the same level models were merged. The merged outputs were recorded as List< output 1 >, List< output 2 >, and List< output 3 >.
Step 4: We combined the results in List< output 3 >, into L. The results of List< output 1 >, and List< output 2 > were merged with (List< b s 1 >, List< b s 2 >,b s 3 Þ and renoted as (List< b s 1 >, List< b s 2 >,b s 3 Þ: Then, a corresponding parameter list (List< b p 1 >, List< b p 2 >) was generated.
Step 5: The result of (List< b s 1 >, List< b s 2 >,b s 3 Þ was the final model inference result, and the List< b s 2 > was the diagnosis result with highand medium-inference levels, while the diagnosis information contained in the L set was the diagnosis result with low-inference basis. The model inference process used the original knowledge in the recorded M 1 as the inference basis.
A graphical representation of data flows within the model is shown in Figure 7.
The Data-Driven Models: 1) Establishing the Extreme Gradient Boosting (XGBoost) Model: For each disease, we used the classical XGBoost method to construct a binary classification model. As an ensemble method using boosted decision trees, XGBoost is famous for both speed and performance. The objective function ℒ ðtÞ at step t was given as where the first term is the loss function that sums the loss of actual value y i and its predicted value b y ðtÀ1Þ i þ f t ðx i Þ, and the second term is a regularization penalty. At step t, the model aims to find the best function f t to be added to the current model. Its advantages include both performance and speed. Also, like other boosting methods, even though one could not directly obtain the contribution of each feature to the final prediction, an importance score could be obtained for each feature. This importance score provided interpretability to the automated diagnosis, assisting the doctors in making decisions.
2) Establishing the Genetic algorithm (GA) Model: We used the GA to choose the near-optimal combination of thresholds of all diseases to maximize the mAP metric. GA is a search heuristic inspired by Darwin's natural evolution theory, where the genes of the fittest individuals are inherited by the offsprings and genes that are less competitive are discarded in the evolution. In our problem, we represented genes by the threshold values of the units, and the fitness function was the mAP that measured the goodness of a ranking algorithm. It should be noted that the diagnosis is more of a ranking process than a classifying one. Compared with multilabel classifications, the ranking process takes into account both classification accuracy and diagnosis relevancy. An expert doctor usually writes his or her diagnoses in decreasing order of relevance and severity. We used the mAP metric to ensure that the most relevant and severe diagnostic result was listed atop.
The genes of our algorithm were the threshold values that determined the binary diagnostic classes from the outputs of the ML algorithms. A chromosome is simply the collection of all threshold values for all outputs. We chose an initial population size of 100 and ran a total of 10 iterations. For each pair of parents to be mated, a crossover point was chosen randomly from within the genes. The new offspring was created by exchanging the genes of the parents divided by the random crossover point. We then added the new offspring to the population and removed the ones with the lowest fitness values.
We selected a random crossover from the genes for each pair of parents. New offsprings were created by swapping the corresponding genes of the parents according to the crossover. Then, we replaced the offspring having the lowest fitness value with the new offspring.
3) Modeling with MiXG: Our data-driven model is a ranking system that mimics human doctors. Therefore, the proposed MiXG algorithm first calculates the probabilities of different diseases from the XGBoost classifiers and then ranks them according to the results from GA models. Here, the standard evaluation metrics for classification, such as out-of-sample misclassification error, the area under the receiver operating characteristic curve (ROCÀAUC), and the sensitivity and specificity, could not be used www.advancedsciencenews.com www.advintellsyst.com as the appropriate evaluation metrics. Instead, our system used a mAP function as the evaluation metric for the ranking process. The mAP formulation was formally written as where AP is the average precision defined as AP ¼ In the earlier formula, p (r) is the precision p as a function of recall r.
Recall is defined as the percentage of positive diagnoses that were retrieved. Mathematically where TP is the total number of true positives and FN is the number of false negatives. The sum of TP and FN constitutes all positive diagnoses in the dataset. Recall increased as we went down the ranking list. The precision did not necessarily decrease but depended on the ranking algorithm's accuracy. Thus, the function might not be monotonically decreasing and might have a zigzag pattern. Therefore, the integral was in practice replaced with a finite sum of interpolated terms p Ã ðrÞ, where p Ã ðrÞ ¼ max r $ ≥r pðr $ Þ that replaced the precision value at a recall value r with the maximum precision to the right of that recall value. Thus, the average precision was calculated as where ΔrðkÞ is the change in recall from ranking kÀ1 to k. The formula is a general interpolated average precision function for a list of precision values and recall values from a given ranked list.
Integration of Knowledge-Driven and Data-Driven Models: We combined the knowledge-driven and data-driven models to build AI LEON, a test result diagnosis engine that provided evidence-based medical evidence and used past experiences learnt from the data-driven models. The basic working mechanism of LEON was that it applied the knowledge-driven model and obtained the results based on high, medium, or low confidence from learnt knowledge. The knowledge was important in both data filtration for better machine understandability and diagnosis for improved results' interpretability. At the same time, LEON initiated a data-driven model using the MiXG algorithm for diagnoses, obtained probability scores for each disease, and ranked the diagnoses in the order of probability from the highest to the lowest to form an ordered list of diagnostic risks.
Competition between AI LEON and GPs in Reading Lab Reports: We also compared the diagnostic performance of AI LEON with that of 15 junior GPs in reading lab reports of 100 cases from the test set. GPs are health service providers with a high degree of comprehensiveness and perform general practice covering a variety of medical problems in patients of all ages. Fifteen junior GPs were randomly divided into three groups. 100 laboratory test reports (lab reports) were randomly selected from the test set and then assigned to each group of GPs. All the GPs read the lab reports and recorded the list of diagnoses. The reading consisted of two parts, a blank-filling study and a trueÀfalse one. In the blank-filling study, readers were asked to write down the list of diagnoses. In the trueÀfalse study, readers were asked to determine whether the machine-recommended diagnoses were correct. Laboratory test reviews conducted by physicians as well as AI LEON were all done without clinical information. Then, ten specialists were invited to provide diagnostic results of the 100 cases as the gold standards for humanÀmachine tests. The accuracy of each diagnosis was recorded according to the gold standards. mAP was utilized to evaluate the accuracy of multiple diagnoses, and recall value was used to assess the rate of missed diagnoses.
Data Availability: The clinical data used for the training, validation, and test sets were collected from the data integration platform of Shanghai Changhai Hospital with strict access control in a nonidentifiable format. The use of the data was locally licensed and ethically approved. It is not publicly available, and there are restrictions on its use. However, all researchers can contact the corresponding author and request access to the relevant data.
Code Availability: We used several open-source R libraries (e.g., xgboost) to conduct our experiments. The analysis was performed Figure 7. Graphical representation of data flows within the model. S 1 , S 2 , S 3 represent the patient's actual test results as the arguments passed to a function; and P 1 , P 2 , P 3 represent the parameters needed for the model. Patients' data after normalization first enters the knowledge-driven model and results with high and medium confidence level are returned as the output. For those whose confidence level is low, the data enters the data-driven model and a probability of each potential disease is given. The probabilities are then rectified with the thresholds learnt by the GA to ensure a reliable ranking. The rectified probabilities are then returned as the output.
www.advancedsciencenews.com www.advintellsyst.com with code written in Python 3.5. Source code for the data-driven MixG model is available under an open-source license at https://github.com/ hitales-tech/Mixg. We described the details of experiments in the Experimental Section to allow for independent replication. Further inquiries regarding the specific nature of the models can be made by relevant parties to the corresponding author.