Data‐driven research on eczema: Systematic characterization of the field and recommendations for the future

Abstract Background The past decade has seen a substantial rise in the employment of modern data‐driven methods to study atopic dermatitis (AD)/eczema. The objective of this study is to summarise the past and future of data‐driven AD research, and identify areas in the field that would benefit from the application of these methods. Methods We retrieved the publications that applied multivariate statistics (MS), artificial intelligence (AI, including machine learning‐ML), and Bayesian statistics (BS) to AD and eczema research from the SCOPUS database over the last 50 years. We conducted a bibliometric analysis to highlight the publication trends and conceptual knowledge structure of the field, and applied topic modelling to retrieve the key topics in the literature. Results Five key themes of data‐driven research on AD and eczema were identified: (1) allergic co‐morbidities, (2) image analysis and classification, (3) disaggregation, (4) quality of life and treatment response, and (5) risk factors and prevalence. ML&AI methods mapped to studies investigating quality of life, prevalence, risk factors, allergic co‐morbidities and disaggregation of AD/eczema, but seldom in studies of therapies. MS was employed evenly between the topics, particularly in studies on risk factors and prevalence. BS was focused on three key topics: treatment, risk factors and allergy. The use of AD or eczema terms was not uniform, with studies applying ML&AI methods using the term eczema more often. Within MS, papers using cluster and factor analysis were often only identified with the term AD. In contrast, those using logistic regression and latent class/transition models were “eczema” papers. Conclusions Research areas that could benefit from the application of data‐driven methods include the study of the pathogenesis of the condition and related risk factors, its disaggregation into validated subtypes, and personalised severity management and prognosis. We highlight BS as a new and promising approach in AD and eczema research.


| BACKGROUND
Atopic dermatitis (AD, also referred to as eczema or atopic eczema) is a common chronic inflammatory skin disease that affects approximately 20% of children and 10% of adults in high-income countries. 1 Recently, computational modelling 2 and data-driven analytical methods have emerged as powerful new approaches to AD research, especially to elucidate its complex pathophysiology, 3 patientdependent response to treatment, 4,5 and endotypes or subtypes. [6][7][8][9][10][11] Big data have revolutionized the way we study disease. 12 The increased availability of large and diverse medical datasets has favoured the adoption of modern computational methods which can integrate and interrogate large quantities of data and extract hidden patterns and associations. There are three primary analytic methodologies or disciplines for data-driven research: multivariate statistics (MS), Bayesian statistics (BS), and machine learning and other artificial intelligence methods (ML&AI). MS encompasses methods that analyse datasets with multiple independent and/or dependent variables, 13 which is a key characteristic of biomedical datasets thereby making MS a popular and powerful methodology.
AI is a field concerned with building systems that can mimic human intelligence, and ML is a subfield of AI. Finally, BS allows us to combine prior knowledge and observed data, 14 contrasting the frequentist approach which bases its analysis only on the observed data, 12,15 and is a potentially promising approach to develop predictive models and utilize clinical data. Such data-driven approaches have been applied to identify biomarkers to diagnose disease and identify therapeutic targets. 12,16,17 Deep neural networks have been developed to aid in the detection and diagnosis of skin, 18 breast, 19 and prostate 20 cancer. In AD research, the Bayesian mechanistic model recently developed by Hurault et al. 21 can predict individual patients' next-day AD severity scores from their previous severity scores and treatments applied. These examples illustrate the benefit of employing a data-driven approach in medical fields with a growing quantity of data.
Within the AD community, data collection is increasing, providing a unique opportunity to leverage data-driven methods. 2 As we enter a period of further substantial growth in the employment of data-driven methods to study AD, we aimed to identify the areas in AD research where data-driven methods have been applied, their current state of development, and highlight the knowledge gaps in the field that could benefit from the application of these methods. To address our aim, we conducted a bibliometric analysis highlighting the publication trends and conceptual knowledge structure of datadriven research on AD and eczema, and applied topic modelling to retrieve the key topics present within the literature. Bibliometrics uses statistical tools to study publication trends and patterns within an area of research, 22,23 and can be used to summarise a field of research in a systematic and reproducible manner. Probabilistic topic modelling explores the knowledge structure of a field by identifying the latent thematic structure of a corpus of documents. 24 A bibliometric analysis was previously conducted to understand the knowledge structure and theme trends of AD research 25 but it considered publications with the term AD from 2015 to 2019 and did not focus on data-driven research. Also of note, the continued absence of a consensus in nomenclature has resulted in the co-existence of two main terms for the skin condition, AD and eczema, which have been shown to be linked to different findings and biased to different disciplines. 26 Our study included both AD and eczema terms and retrieved all publications available up to March 2021 without a time constraint, to provide the full picture of the field. Additionally, we included topic modelling to provide a detailed view of the key research topics in the field and methods employed.

| METHODS
This section summarises the analysis conducted in this paper; detailed description is presented in Appendix S1.

| Literature search
We retrieved all publications to March 17 th , 2021, on atopic dermatitis (AD) and eczema that apply MS, ML&AI, and BS methodologies from the SCOPUS database. The keywords 'AD' and 'eczema' were used with each of MS, ML&AI, and BS methodologies.

| Bibliometric analysis
We performed a bibliometric analysis on the bibliographic information (including the authors, sources, countries, citations, and keywords) of the publications obtained from the literature search. Using the bibliometrix R package, 27 we obtained descriptive statistics on the collection of publications, including the most productive countries and the general publication trends. We also performed co-word analysis to produce keyword co-occurrence networks and thematic maps.

| Probabilistic topic modelling
We used the Latent Dirichlet allocation (LDA) algorithm 28 to explore the main topics present in the publications obtained by the literature search. LDA is an unsupervised ML method that estimates both the distribution of topics within each document and the distribution of words within each topic, by assuming each document consists of a mixture of topics and each topic consists of a mixture of words. Here, each document consisted of the title, keywords, and abstract. We used the tm R package 29 to clean the data (tokenization, lowercase conversion, removal of special characters and stop-words, standardization of words) and remove words with low frequency (words that occurred in less than 10 publications), the topicmodels R package 30 to run the LDA algorithm on the corpus, and generated the plots of the results using R packages such as ggplot2 31  Most publications are labelled as either AD or eczema, and only a small portion are annotated with both terms (100 of 620 articles). This phenomenon is similarly found within the individual methodologies (Table S1). Publication numbers for each term are similar throughout the years, showing at first glance no significant frequency preference of the field in general for one term over the other ( Figure 1B). Table S2. Table 1 summarizes the key methods used within the collection of publications. Cluster and factor analysis are the two most common methods. Of the 37 BS papers, a manual inspection found that only eight 21,33-39 specifically study AD. Of these eight, half 33-36 used random-effects Bayesian network meta-analysis to compare treatments for AD, and one 21 uses a Bayesian mechanistic machine learning model to predict next-day AD severity for individual patients.

| Analytical methods and the use of AD or eczema terms
The use of AD and/or eczema terms is not uniform throughout the different methods. Detailed analysis is presented in the Appendix S3.
Briefly, papers applying ML&AI methods use the term eczema more often. Within MS, papers using cluster and factor analysis are often only identified with the term AD. In contrast, those using logistic regression and latent class/transition models are eczema papers.

| Five central themes of data-driven AD and eczema research and their level of development
The bibliometric analysis identified five key themes within AD/ eczema research employing MS, BS, and ML&AI methods, as visualized in a thematic map (Figure 2), where themes are mapped onto a two-dimensional space according to their centrality and density. The centrality is the degree of interaction of the theme with other themes and measures the significance or relevance of a theme in the development of the field at large. 40 The density measures the development of the theme. 40 Using these two measures, themes can be separated into four quadrants: emerging or declining themes (low centrality and density), niche themes (low centrality and high density), motor themes (high centrality and density), and basic themes (high centrality and low density). 40 We named the five identified themes retroactively, ordered by decreasing density:   Figure 2 Figure 2).

Theme 4 Quality of life and treatment response. This theme includes
studies investigating the quality of life and the cost-effectiveness of treatments, not specific to only AD/eczema but also for other skin conditions such as psoriasis. It is a basic theme with relatively low development but high relevance (purple in Figure 2).

F I G U R E 2
Thematic map. Themes were generated using the top 100 authors' keywords and separated according to centrality (the degree of interaction of the theme with other themes) and density (the strength of internal connections among keywords in the theme). Up to six of the most frequent keywords in the associated theme are shown on the map DUVERDIER ET AL.
Thematic maps were also generated for the three methodologies and the term used (eczema or AD), Figure 2.

| Eight key topics and identified gaps in employing modern computational methodologies
The LDA algorithm revealed eight key topics of data-driven AD and eczema research (Figure 3, Table S3), by breaking down the five themes obtained in the bibliometric analysis into their main components. It identified, in greater detail, the key areas of interest explored in the literature to date (Table S4) and their growth over time ( Figure S3).  (Figure 4 and Figure S4). This reflects the trend seen in the bibliometric analysis that the term eczema tends to be used in publications that also study other allergic diseases F I G U R E 3 Word clouds for the eight topics obtained by Latent Dirichlet allocation (LDA). The topic names were retroactively chosen to best summarize the content of topics. The 40 most probable words in each topic are plotted with the size of the words proportional to their probability ( Figure S5). In contrast, AD term is used often in publications that are more specific to the condition.

| DISCUSSION
The first application of data-driven methods to AD and eczema research occurred in September 1973. Since then, 620 articles have been published, with over three-fourths of the publications in the last decade. The growth in scientific production over time shows an increased interest in applying data-driven methodologies to the study of AD and eczema, similar to asthma research. 12,15 Five central themes currently characterize the field: (1)  Bayesian approaches have been used to study asthma 12,15 and the relationship between allergic diseases. 42,43 However, only eight publications to date apply BS to study AD and eczema specifically.
One of the eight developed a Bayesian mechanistic model that can predict next-day AD severity from patients' previous severity scores and treatments applied. 21  The analysis performed in this study corroborates the discrepancy in the use of AD and eczema terms within the literature that has been highlighted in previous studies. 26,44 Our results point towards a bias in term use depending on the computational method employed; this alludes to the previously articulated notion that AD and eczema terms may be associated to different research communities that have differing views on nomenclature (Appendix S3).
The main limitation of our analysis is that it is heavily dependent on the terminology used by the authors. The authors' keywords associated to each publication were used to discover the key themes of the field of research; they were also used, in part, to retrieve publications of interest. This points to the importance of keyword choice when publishing a paper and the impact of using eczema or AD terms. A second limitation is that the LDA algorithm was applied on each publication's title, keywords, and abstract, but not the full text as they were not available. Additionally, the publications were retrieved from the SCOPUS database. Although similar in content to that found on PubMed, future systematic reviews could aggregate the publications from multiple databases to ensure completeness of the collection of articles analysed.
Three key areas that could benefit from the application of datadriven approaches are the study of the disaggregation of the condition, quality of life and treatment response, and risk factors and prevalence. One of the greatest challenges for research in these areas regards data curation, particularly its collection and sharing.
The study of the course of the condition, including its onset, persistence, and flare-ups, and the design of personalised treatment strategies would be greatly aided by additional longitudinal data.
Previous studies have showcased the benefit and need of such data 21,42,43 and new smartphone apps could facilitate the collection of data outside of a clinical visit. The sharing of data is also crucial, as AD is a complex disease that cannot be fully characterized in a single study. It would be greatly aided by a collaborative system to share and manage data from different studies across the community.
Our study particularly underlines the need for standardized data collection, including a clear and detailed record of the criteria for diagnostics and patient selection in clinical studies to allow for proper comparison between studies. A recent study highlighted the impact of different definitions of AD in prevalence estimates, risk factors, and the performance of predictive models. 47 Further work demonstrated that development of standardized composite definitions of AD derived from multiple sources (healthcare records and validated questionnaires) may help to define AD cases with more precision. 48 A data collection tool or unified database would be particularly useful for data sharing and to ensure high quality and quantity of data needed for the proper employment of statistical methods. We may take example on similar fields of study, including asthma research, 49 to guide the next steps.
As the development and employment of machine learning and other data-driven approaches gain popularity in healthcare, experts and government agencies have increasingly collaborated to develop guidelines to facilitate the growth of the field and delineate principles of best practice. 45,46 We further underline the need and benefit of cross-disciplinary collaborations for the future of data-driven research on AD and eczema. 12  highlight its availability. 26