Rethinking AI for Science: An Evolution From Data‐Driven to Data‐Centric Framework

The rapid advancements in Artificial Intelligence (AI) have found applications across a multitude of disciplines, including the scientific domain. Yet, the dominant focus on model‐centric AI often overlooks the critical role that data plays, thereby limiting its effectiveness in scientific investigations. This paper advocates for a transition to a data‐centric viewpoint, especially within the framework of AI applied to scientific advancements. The paper outlines the contrasting philosophies of model‐centric and data‐centric AI, highlighting the latter's commitment to data strategies. It explores a range of techniques that bolster the data‐centric framework. Furthermore, the paper presents an overview of key milestones in AI technology, the significance and the changing role of data‐centric AI, and the current state of research in this emerging field.


The Dichotomy: Model-Centric Versus Data-Centric AI
Historically, the AI community has been interested in the potential of AI models, often fine-tuning them to maximize performance.The focus has been on tweaking these models and architectures to eke out every last bit of performance improvement.This model-centric approach often overlooks a critical component-data.Data-centric AI redirects attention from models to data, involving rigorous data cleaning, data labeling, data valuation, coping with distribution shifts, curating data for inferencing among other things.The principle is straightforward but impactful: enhance data and you improve the performance of AI systems, often without altering the underlying models.It's important to note that data-centric AI is distinct from data-driven AI, which typically still centers on developing models rather than engineering data.

Data-Centric AI for Science
The data-centric AI framework is equally important within science for many reasons.

The Nature of Scientific Research
Scientific research is fundamentally data-dependent.Whether it's large-scale particle physics experiments or complex computational biology simulations, data quality is pivotal for accurate scientific understanding.In addition, in a scientific research lifecycle majority of tasks are data related: data collection, data verification, data cleaning, data formatting, feature extraction, data analysis, data serving, data monitoring, etc.

Robustness and Accuracy
In science, the stakes are too high for errors stemming from poor data.Such data can lead to flawed conclusions, affecting both scientific credibility and potentially societal well-being.

Data Management
The sheer volume of scientific data necessitates effective data management.A data-centric approach can streamline this, making it more efficient and reliable, while also setting consistent benchmarks for future model evaluations.

Ethical Considerations
Ethical issues like data privacy, biases, and diverse representation are integral to scientific research.A data-centric AI approach provides a framework for addressing these ethical concerns more effectively.

Emerging Initiatives
Various efforts are underway to promote data-centric AI, such as the Data-centric AI competition led by Andrew Ng (Ng et al., 2021).Initiatives like Cleanlab (Northcutt et al., 2021) aim to identify labeling errors, filling the gap of data metrics for AI.These efforts can easily be adapted for science domains.

Human Factor
Data-centric AI also involves human expertise, especially in data labeling and interacting with models in production.Science domain experts offer invaluable insights that models alone cannot provide, ensuring data aligns with scientific goals and enhances research reliability.

Trust in AI
The integrity of AI systems is fundamentally linked to the quality of the training data.Errors in labels or gaps in data can distort learning, leading to biased or unreliable models.To increase domain scientists' confidence in AI systems, meticulous attention to data quality, including accurate labeling and comprehensive coverage, are crucial.

Approaches and Techniques in Data-Centric AI
While many methodologies in data-centric AI are not entirely new, they serve a unified goal: enhancing data to align AI models with intended results.For example, data augmentation has long been used to diversify data, improving model generalization.Feature selection techniques help create more focused and informative data subsets, thereby boosting model performance and interpretability.Curriculum learning (Bengio, et al., 2009) has also shown to improve performance.Recent advancements include data programming methods that speed up the data labeling process, a crucial aspect of supervised learning.Additionally, algorithmic recourse mechanisms contribute to explainable AI by offering deeper insights into model decisions.Prompt engineering is another emerging technique that tweaks the input for large language models like GPT-3 (Brown et al., 2020) to generate specific outputs.
The data-centric AI paradigm is more than just a theoretical shift; it's backed by a variety of established and emerging methodologies.These techniques strengthen the data-centric framework, making it a versatile solution for complex challenges in scientific AI applications.
In essence, the data-centric AI approach should be integrated throughout the entire scientific research lifecycle.Figure 1 delineates tasks related to data that are integral to AI and span the complete cycle of scientific research.It begins at the hypothesis formulation stage, where tasks like identifying the relevant scientific data, determining the necessary data volume, and specifying metadata such as data formats are crucial.During the data preparation stage for AI, tasks like feature subsetting, data cleaning, label generation, data valuation, and exploratory data analysis are essential.Additionally, the creation of evaluation data that is either in-distribution or out-of-distribution from training is important.If a model performs well on in-distribution data but poorly on out-of-distribution data, it may indicate that the model has overfit to the training data.For science applications, it is critical to understand how a model behaves on data that deviates from the norm.In the model training phase, tasks focus on data scheming, which includes data augmentation, the inclusion of active learning techniques, and anomaly detection.Lastly, the deployment phase involves ongoing model auditing and monitoring tasks, such as continually fine-tuning of prompts to achieve specific results, and the establishment of robust infrastructures for data management, organization, debugging, and identifying biases.This framework for the entire science research lifecycle can be achieved by development of supporting tools, engaging domain experts, addressing standardization, and creating data benchmarks to evaluate various data strategies based on agreed-upon metrics.

Current State of Progress in Data-Centric AI
While data-centric AI is a relatively nascent field, significant strides have already been made in numerous relevant areas, many of which were traditionally considered preprocessing steps in a model-centric approach.Additionally, new tasks are continually emerging, and research in these areas is ongoing.Among the core tasks in data-centric AI-namely, the development of training and inference data, and data maintenance-the focus has predominantly been on training data development.As the corpus of scholarly articles focusing on data-centric AI grows, evidenced by trends in Google Scholar search results on "data-centric AI" depicted in Figure 2, it is reasonable to expect further advancements in the imminent future.Nevertheless, such a trend appears absent within the scientific community; for instance, a query for "data-centric AI" within publications from the American Geophysical Union (AGU) yields no results.
It's worth noting that the rise of data-centric AI doesn't undermine the importance of model-centric AI.Rather, these two approaches are complementary and mutually beneficial in constructing AI systems.Model-centric methods can be employed to achieve data-centric objectives, and vice versa.Consequently, in real-world applications, both data and models are likely to evolve in tandem within a dynamic environment.

Foundation Models and Data-Centric AI
The emergence of foundation models (Bommasani et al., 2021), including large language models, marks a transformative shift that has catapulted AI into new dimensions.These models are inherently data-centric, pre-trained on vast quantities of unlabeled data and capable of adapting to a wide array of tasks.The advent of the transformer architecture has been a catalyst for advances in foundation models across language and computer vision.This has led to a series of models (Figure 3), most trained on increasingly large data sets and featuring larger model sizes to perform complex tasks.Beyond training data, the design of inference data has been instrumental in redirecting the model to perform critical tasks, as well as in the unlocking of novel model functionalities such as emergent capabilities.A case in point is the technique known as "prompt engineering," which achieves a variety of objectives by exclusively fine-tuning the input data to extract knowledge from an unaltered model.Applications like ChatGPT (https://chat.openai.com)have also emerged, demonstrating that downstream applications now require significantly smaller data sets for effective model deployment.These data needs can be categorized into three types: zero-shot learning through prompting, which requires virtually no data; few-shot prompting, which needs only a handful of examples; and fine-tuning, which involves adjusting a pre-trained model on a relatively small training data set.As foundation models specifically tailored for scientific applications continue to emerge, the data strategy for large volumes of unlabeled data gains equal importance.Moreover, these foundation models are used to support many downstream tasks, all of which can be fine-tuned using smaller subsets of training data.The reduced need for data samples is particularly advantageous for the scientific community, not only making research more cost-effective but also more feasible, especially when studying rare scientific events.

of 5
The traditional belief that more data is always better has been supplanted by the understanding that limited, high-quality data sets are often more valuable.Therefore, the role of a data-centric AI framework remains crucial, perhaps even more so, within the context of these foundation models for science.

Conclusion
Data-centric AI is not just a complementary approach to model-centric AI; it is a necessity, especially in the realm of scientific research.By focusing on data for AI, we can pave the way for more robust, accurate, and ethical scientific discoveries powered by AI.Therefore, it is imperative for a data-centric framework to become an integral part of the scientific method, rapidly guiding hypothesis construction, data selection, experiment design, analysis, and synthesis for new scientific insights.For those involved in scientific research, the opportunity to adopt a data-centric approach is becoming increasingly relevant.It would be beneficial to engage with this emerging field, organize workshops, take part in related competitions, and make use of the resources available to improve data.By doing so, we can better position ourselves to leverage the capabilities of AI in advancing science.

Figure 1 .
Figure 1.Data strategies for incorporating Data-centric AI within science research and deployment lifecycle.

Figure 3 .
Figure 3. Rapid evolution of transformer architecture-driven models in computer vision and language.