Brief introduction of medical database and data mining technology in big data era

Abstract Data mining technology can search for potentially valuable knowledge from a large amount of data, mainly divided into data preparation and data mining, and expression and analysis of results. It is a mature information processing technology and applies database technology. Database technology is a software science that researches manages, and applies databases. The data in the database are processed and analyzed by studying the underlying theory and implementation methods of the structure, storage, design, management, and application of the database. We have introduced several databases and data mining techniques to help a wide range of clinical researchers better understand and apply database technology.

modes of medical information systems. 7 Therefore, how to better develop and utilize the huge medical big data has become the focus of attention, and promoting the research and application of medical big data has become a key factor in modern medical research.
Big data is an abstract concept. It is usually explained that big data refers to the data integration which is difficult to deal with the existing database management tools, which has both massive characteristics and complexity characteristics. Big data are frequently characterized as the five Vs-volume, velocity, variety, value, and veracity. [8][9][10] Volume is "huge in volume" with the massive generation and collection of data, the scale of the data has become larger and larger, and has gone beyond traditional storage and analysis techniques; velocity is "speed," that is, big data's timeliness, which means that data collection and analysis must be carried out quickly and on time; variability is "a wide range of data types," including semistructured and unstructured data, such as audio, video, web pages, and text, as well as traditional structured data; Value is "value," which is mainly reflected in the low density of value and the commercial value is high. Veracity which emphasizes that meaningful data must be true and accurate. The key question when using big data is how to find value from a large, rapidly generating, and diverse data set. 11,12 The computational analysis of integrating databases has become the basic method of medicine and molecular biology. 13 Medical data have the characteristics of disease diversity, heterogeneity of treatment and outcome, and the complexity of collecting, processing, and interpreting data. 14 With the development of medical information, a large number of digital data has been produced in the process of medical service, health care, and health management, forming medical big data. 15 Medical big data come from a variety of sources, such as administrative claims records, clinical registration, electronic health records, biometric data, patient report data, and more. 16,17 There are many values in big data applications and data collection in healthcare systems. For example, people with diabetes use mobile devices to communicate with each other, share information or search for information, thus, forming a large group of big data networks. 18 The US Department of Health and Human Services has issued a policy to increase the transparency of the US healthcare system, which constitutes big data sharing for many patients, physicians, and medical-related information. 19 Faced with a huge amount of different types of electronic data, new requirements for R&D-related electronic products have been put forward to adapt to complex and competitive big data and its logical way. 20,21 From the massive electronic medical record data, we found that the new efficacy of existing drugs-metformin for cancer treatment can also be used to treat diabetes. 22 Medical big data have several unique characteristics that differ from big data in other disciplines: medical big data are often difficult to obtain 10 ; are usually based on protocols, collected and relatively structured 23 ; and when analyzing data and interpreting results, the role of professional knowledge may be dominant 24 ; time-dependent mixing. 3 Medical data are large in scale, extremely fast in update, polymorphic, incomplete, and time sensitive. 25 The construction of a big data platform will facilitate the remote consultation, easy operation,

MEDICAL PUBLIC DATABASE OVERVIEW
Today's society produces massive amounts of data all the time.
Database technology is a software science that researches, manages, and applies databases. The data in the database are processed and analyzed by studying the basic theory and implementation methods of the structure, storage, design, management and application of the database. The main medical public databases are described in Table 1.

Surveillance, epidemiology, and end results (SEER)
To reduce the cancer burden of the population, the National Cancer

Medical information mart for intensive care (MIMIC)
Severe medicine is a discipline that studies the characteristics and regularity of any injury or disease that leads to the development of the body in the direction of death, and treats severe diseases. The focus of this discipline is on the monitoring of critically ill patients, the implementation of organs for organ dysfunction or debilitating organs. Support, so that patients can win the time to remove the cause under the condition of ensuring oxygen delivery and maintaining organ function.
As we all know, intensive care unit (ICU) is in a very special important position in the hospital, and undertakes the treatment of patients with serious diseases. 37

China health and nutrition survey (CHNS)
The  1989, 1991, 1993, 1997, 2000, 2004, 2006, 2009, 2011, and 2015. The CHNS website updated the dataset content on 12 June 2018. The updated dataset covers vertically integrated data from 10 survey data from 1989 to 2015. 45 The China Health and Nutrition Survey (CHNS) has shown the shift in the form of either nutrients or food items or dietary patterns and this dietary shift is associated with education, income, urbanicity, and macro food environment and policy. [46][47][48][49] The survey used multistage stratified cluster random sampling to collect data from 15 provinces, autonomous regions, and municipalities in China's eastern, central, and western regions. 50

Health and retirement research (HRS)
As an important measure of the level of international economic and social development, population ageing not only means an increase in the number of elderly people but also poses severe challenges to the economy and society. 54 This has become a major social problem that cannot be ignored. There are many types of research on the health of an ageing population, data types are constantly enriched, and data reserves are growing rapidly. It is difficult to conduct an effective and comprehensive statistical analysis through traditional data collection methods. The HRS database has a large sample size, high quality, and is complex. In order to make the data easier to study, the HRS data are classified into public data and sensitive/restricted data. Anyone can create an account on the HRS data download site to get public data while restricting data and sensitive health data requires a separate application. The HRS database can be accessed in seven areas, including biennial data products, vertical data, nonelected year studies, sensitive health data (requires additional registration), researcher contributions, RAND's contribution data, and cognitive, economic projects.
Each subdataset file can be read by three different statements: SAS, SPSS, or Stata.
The HRS database is a database of resources related to ageing in the United States regarding changes in health and economic environment. Most of the public data in this database are freely available through user registration. Its multidisciplinary data focuses on surveys of income and wealth, health, awareness, and use of health services, work and retirement, and contact with the family. Since 2006, data collection has expanded to include biomarkers and genetics as well as greater depth of psychology and social background. This mixedeconomy, health and psychological information database provides the unprecedented potential for researchers' work. 59,60 The HRS database can help researchers in all disciplines to obtain more convenient, efficient, and clear data to improve work efficiency.

Dryad
With the advent of the era of big data, data reusability and data sharing Compared to other public database platforms, the Dryad database is more efficient in data sharing by working with many mainstream journals. By assigning DOIs to metadata, data can be referenced, increasing the scientific data utilization rate while increasing the academic reputation of researchers and publishers. Dryad has a detailed management policy for data maintenance and data disaster recovery, so data can be stored for a long time. The use of data "zero thresholds" and friendly interface also make the Dryad database more and more popular among researchers. UK Biobank started a new medical imaging data collection program in 2014, using magnetic resonance imaging (MRI) and X-ray technology to brain, heart, and bone of more than 100 000 volunteers. 67 The imaging analysis is performed to establish a database of scanned images of internal organs. This will also be the most significant health imaging research in the world to date. These vast amounts of data will help researchers analyze population differences and their causes, such as cancer, heart disease, diabetes, arthritis, Alzheimer's disease, The disadvantage of UK Biobank is that the sample provider must fill out a detailed basic situation questionnaire, including its name, gender, NHS number, disease information, etc., and there are inevitable privacy leaks. 69 At the same time, the registration and application process is complicated and cumbersome, and the period is long. It may be difficult for the first-time applicant.

UK biobank
We believe that UK Biobank will provide more comprehensive research data and biological sample coverage in the future, providing global researchers with more efficient and convenient resource registration, application and use services, as well as more secure information security.

Biologic specimen and data repositories information coordinating center (BioLINCC)
BioLINCC was established in 2008 by the National Heart, Lung, and Blood Institute (NHLBI). The Institute provides global leadership in the prevention and treatment of heart, lung, and blood diseases and supports basic, transformational, and clinical research in these areas.  For applicants who want to apply for multiple research resources, the application process is complicated; when searching for biological samples, BioLINCC needs to provide the name of the biological sample for research purposes. The search method is not efficient enough for unidentified researchers. In the future, BioLINCC will also expand the field of data sharing, provide a more convenient resource application process, collect and maintain data and specimens in a "high-efficiency-low cost" way, and maximize the utilization of existing resources.

Gene expression profiling interactive analysis (GEPIA)
The use of big data analysis has facilitated the development of cancer genomics research. In essence, the cause of cancer is a genetic disease caused by differential gene GEPIA is a public database developed by the Chinese. Using the GEPIA database, laboratory biologists can easily explore TCGA and GTEx data sets, find answers to questions, and test their hypotheses. In the differential analysis and expression profiling, users can easily discover genes that are differentially expressed. With the application of genetic testing, the model of tumor prognosis assessment and treatment options for immunohistochemistry-based tumors has been gradually changed, and the more accurate classification of tumors has more important guiding significance for prognosis evaluation and treatment.

The cancer genome atlas (TCGA)
For a long time, tumor prevention, early screening, individualized treatment, and prognosis evaluation have always been the key issues that the medical community is committed to. As a result, the magnitude of cancer is increasing substantially, with over 20 million new cancer cases projected for 2025 compared to an estimated 14.1 million new cases in 2012. 74 The study found that genetic variation is an important microscopic molecular cause of all tumor cells. Therefore, more and more oncology researchers began to conduct related research from the perspective of molecular genetics. By measuring the biological identity of specific gene expression, it is possible to predict tumor growth, spread, and patient survival, and to develop a targeted diagnosis and treatment plan based on gene expression. 75

Therapeutically applicable research to generate effective treatments (TARGET)
In recent years, with the continuous development of medical level, the overall prognosis of childhood cancer has been greatly improved, The TARGET large database targets children's tumors, although it contains fewer types of diseases, but it is more targeted. To a certain extent, the database can help researchers conduct more in-depth disease research and lead to more precise treatment options.

eICU collaborative research database (eICU-CRD)
Severe medicine is an inevitable trend and a prominent symbol of the development and progress of modern medicine. It is an era product of the development of medical science to a fairly high level. There are many difficult problems involved in critical medicine, including the application and management of noninvasive ventilation, the rational use of antibiotics, the implementation of nutritional assessment and nutritional support, the indications for analgesia and sedatives, and the scope of application of the ICU risk assessment model. 83 Philips Healthcare is a leading provider of ICU equipment and services, offering a teleICU service called the eICU program. After implementing the eICU plan, a large amount of data is collected and streamed for realtime monitoring by the remote ICU team. These data were archived by Philips and converted to a research database by the eICU Institute. 84 The To obtain access to the eICU Collaborative Research Database, you must first apply for registration. 86 The agreement stipulates that applicants do not share data with others, do not attempt to reidentify any patient or institution, and abide by the principles of collaborative research. 87 There is a repository on GitHub to store eICU collaborative research database code, and the code for generating tables and descriptive statistics is available online (https://github.com/mit-lcp/eicu-code).
With the advent of health information networks, humans need to

Gene expression omnibus (GEO)
The GEO database is an international public function gene expression repository created by NCBI. The data have powerful inclusion and storage capabilities that allow users or researchers to submit, save, and retrieve many different types of data. GEO provides a simple submission process and format whose data source relies on data submission from researchers. GEO data submission follows the MIAME principles. Although the GBD database can query and download data, including many search parameters can cause problems: Query sometimes causes the file to ignore certain results specified in the query: specific age groups, years, etc.; query all locations at the same time and many or all of the reasons, age groups, years, etc. will appear incomplete data.
This tool is not available for Internet Explorer 10 and earlier.

CLINICAL DATA MINING METHODS
With the advent of the information age, data mining is increasingly Predictive patterns are summarized on current data including classification and regression.

Association analysis
Association analysis, also known as association mining, is the search for According to the high frequency item group obtained in the first step, if the rule satisfies the minimum confidence, the rule is an association rule. Machine learning methods for association analysis include: Apriori algorithm, FP tree frequency set algorithm, and Upgrade Lift.

Apriori algorithm
The Apriori algorithm is based on the a priori principle and reflects the relationship between the subset and the superset: that is, all nonempty subsets of frequent itemsets must be frequent, and all supersets of infrequent sets must be infrequent. If item set I does not satisfy the minimum support thresholds, then I is not frequent. Frequent mode refers to the fact that the various items that appear in each shopping record actually reflect the nature of a combination. The combination of these items is unordered in the record, and this disordered combination is called "pattern." Some of these modes have low frequency and some have high frequency. It is generally considered that the higher frequency is usually more instructive. This high frequency mode is called "frequent mode." Therefore, the nature of the Apriori algorithm is mainly used to search for candidates when searching for frequent itemsets. Apriori algorithm can better avoid blind search and improve the efficiency of frequent item set search.

FP tree frequency set algorithm
The FP tree is constructed by reading in transactions one by one and mapping the transactions to a path in the FP tree. Since different transactions may have several identical items, their paths may partially overlap. The more the paths overlap each other, the better the compression effect obtained by using the FP tree structure; if the FP tree is small enough to be stored in the memory, the frequent itemsets can be extracted directly from the structure in the memory without having to repeatedly scan and store the data on the hard disk. The main idea of the FP tree frequency set algorithm is to compress the frequency set in the database into a frequent pattern tree after the first pass scan, while still retaining the associated information, and then separately mining the condition bases.

Upgrade lift
Regardless of the Apriori algorithm or the FP tree frequency set algorithm, in some cases, even if the two indicators of support and confidence are relatively high, the rules generated may still be useless. Lift gives a new indicator of the quality of the evaluation rules. Lift indicates the intensity of a given random occurrence of the predecessor and the back part, which provides an improved message to increase the probability of occurrence of the next piece of the given front piece.

Cluster analysis
The classification algorithm must know the information of each category in advance, and all the data to be classified have corresponding categories. When the above conditions are not met, we need to try cluster analysis. Cluster analysis is to study how to classify similar things into one category. Clustering divides similar objects into different groups or more subsets by static classification, so that member objects in the same subset have similar properties.
There are several clustering methods: partition-based algorithm, hierarchical clustering algorithm, density-based algorithm, and grid-based algorithm.

Partition-based algorithm
The K-means method is the most commonly used and most basic clustering algorithm in cluster analysis. It is based on the prototype and partitioned distance technique. According to the given parameter K, the N objects are roughly divided into K classes, and then the unreasonable classification is modified according to some optimal principle.
The advantages of the K-means algorithm are that it is simple, fast, easy to understand, and has low time complexity. However, the K-means are poorly processed for high-dimensional data and do not recognize nonspherical clusters.

Density-based algorithm
In order to find clusters of arbitrary shape, the cluster can be regarded as a dense region separated by sparse regions in the data space, which is the core idea based on the density algorithm.

Grid-based algorithm
Based on the partitioning and hierarchical clustering methods, the nonconvex shape clusters can not database. The algorithm that can effectively find the arbitrarily shaped clusters is based on the density algorithm, but the density-based algorithm generally has a high time complexity. From 1996 to 2000, the research data mining scholars have proposed a large number of grid-based clustering algorithms.
The grid method can effectively reduce the computational complexity of the algorithm and is also sensitive to density parameters. The grid-based clustering method uses a multiresolution grid data structure. The advantage of this method is that the processing speed is extremely fast and depends only on the number of elements in each dimension in the quantization space.

Regression analysis
Traditional regression is a statistical analysis method that uses ordinary linear regression to determine the quantitative relationship between two or more variables. It is widely used. Its expression is The machine learning methods for regression models are decision tree, adaptive boosting, bagging, random forests, support vector machines, nearest neighbor algorithm, and artificial neural network.

Classification analysis
Classification is a supervised learning process. The goal is to "tag" the data to extract valuable data. The more accurate the categories are, the more valuable the results will be. Usually, the following methods are However, there are limitations. The information of each category must be known in advance, and all the data to be classified have corresponding categories. When the dependent variable is a categorical variable and the independent variable contains multiple categorical variables or the categorical variable has a high level, the classical statistic is not applicable, and the machine learning method is more practical for processing complex data, and the accuracy is better.

PROSPECTS AND CHALLENGES OF MEDICAL DATA MINING
The use of new cutting-edge disciplines to generate big data and analyze big data is a trend that has evolved between traditional medicine and precision medicine. The development of big data will help the global application of precision medicine and the emergence of new health management models. 27 The potential for big data is still to be discovered. Although it is not easy to generate new findings and conclusions in massive amounts of data, as long as effective investments are made on the right systems, key breakthroughs in technology and workforce are available, and future big data analysis, visualization, and artificial intelligence can be foreseen. The convenience and change in medical care and life are worth looking forward to. The potential for big data is still to be discovered. However, medical big data mining still faces enormous challenges, mainly in the following: medical knowledge concept is complex, medical knowledge reasoning key technology has not broken through; medical information sources are wide, data modality is high, the latitude is high, the type is unbalanced, and structure is complicated. The hospital's electronic medical record system is poor in openness and scalability; the out-of-hospital process is poorly regulated. Although it is not easy to generate new findings and conclusions in massive data, we can foresee the future medical and life conve-nience of big data analysis as long as we make productive investments in appropriate systems and achieve key breakthroughs in technology and workforce.

CONCLUSIONS
This article first briefly introduces the database and data mining methods commonly used in the era of big data. With the advent of the information age, data mining is increasingly being used in clinical practice.
With information technology, medical records and follow-up data can be stored and extracted more efficiently. At the same time, look for potential relationships or patterns from medical data to gain useful