Interpretable and explainable machine learning: A methods‐centric overview with concrete examples

Interpretability and explainability are crucial for machine learning (ML) and statistical applications in medicine, economics, law, and natural sciences and form an essential principle for ML model design and development. Although interpretability and explainability have escaped a precise and universal definition, many models and techniques motivated by these properties have been developed over the last 30 years, with the focus currently shifting toward deep learning. We will consider concrete examples of state‐of‐the‐art, including specially tailored rule‐based, sparse, and additive classification models, interpretable representation learning, and methods for explaining black‐box models post hoc. The discussion will emphasize the need for and relevance of interpretability and explainability, the divide between them, and the inductive biases behind the presented “zoo” of interpretable models and explanation methods.

In general, there is no agreement within the ML community on the definition of interpretability and the task of interpretation (Doshi-Velez & Kim, 2017;Lipton, 2018). For example, Doshi-Velez and Kim (2017) define interpretability of ML systems as "the ability to explain or to present in understandable terms to a human." This definition lacks mathematical rigor (Lipton, 2018). Nevertheless, the notion of interpretability often depends on the domain of application (Rudin, 2019) and the target explainee (Carvalho et al., 2019), that is, the recipient of interpretations and explanations. Therefore, an all-purpose definition might be infeasible (Rudin, 2019) or unnecessary. Other terms that are synonymous with interpretability and also appear in the ML literature are "intelligibility" (Caruana et al., 2015;Lou et al., 2012) and "understandability" (Lipton, 2018). These concepts are often used interchangeably.
Yet another term prevalent in the literature is "explainability," giving rise to the direction of explainable artificial intelligence (XAI) (Gunning & Aha, 2019). This concept is closely tied with interpretability; and many authors do not differentiate between the two (Carvalho et al., 2019). Doshi-Velez and Kim (2017) provide a definition of explanation that originates from psychology: "explanations are … the currency in which we exchange beliefs." Rudin (2019) draws a clear line between interpretable and explainable ML: interpretable ML focuses on designing models that are inherently interpretable, whereas explainable ML tries to provide post hoc explanations for existing black-box models, that is, models that are incomprehensible to humans or are proprietary (Rudin, 2019). Lipton (2018) stresses the difference in questions the two families of techniques try to address: interpretability raises the question "How does the model work?," whereas explanation methods try to answer "What else can the model tell me?"

| Purpose of the review
This review is intended for a general machine learning audience interested in exploring the problems of interpretation and explanation beyond the logistic regression model or random forest variable importance. It is not an exhaustive literature survey but rather an overview with a selection of concrete, comprehensively studied examples that represent different research directions. We will address the following questions throughout this review: 1. What is the difference between interpretable and explainable ML? 2. In what settings is it desirable for an ML model to be interpretable or to be explained? 3. How can the interpretability and explainability be assessed in practice? 4. What inductive biases are characteristic of interpretable models and explanation methods?
The material presented in this overview is partially based on the literature review from the article by , although the current work covers a much broader range of topics and has been updated with more recent references.

| Related work and our contribution
To date, interpretable and explainable machine learning form an established subfield with its own research questions and directions. There exist numerous thorough review papers tackling the topic. Many reviews can be categorized into four groups briefly summarized below. (i) Some provide a relatively nontechnical and general introduction to the fundamental problems, concepts, and research questions and directions, for example, see works by Carvalho et al. (2019), Barredo Arrieta et al. (2020), or Molnar (2020). (ii) Others view interpretability and explainability from a novel or unusual perspective or provide opinions on the progress, challenges, and future directions. For instance, Ghassemi et al. (2021) overview healthcare applications of explainability techniques and their failure cases and argue that XAI is unlikely to address the real needs of practitioners. (iii) Another category of reviews focuses on a restricted class of models or a family of methods. For example, Verma et al. (2020) discuss only counterfactual explanation methods, and Puiutta and Veith (2020) specifically survey reinforcement learning techniques. (iv) Last but not least, some reviews discuss the use of interpretable and explainable ML in a particular application area, for example, genomics (Watson, 2021) or robotics (Anjomshoae et al., 2019). Table 1 lists a nonexhaustive manual selection of the review articles from the four categories mentioned above.
In contrast, while starting with a broad introduction to the topic and basic concepts, this review explores interpretability and explainability via concrete, comprehensively studied examples of the latest models, methods, and their typical inductive biases. We provide an intuitive explanation for many techniques but do not shy away from examining equations and definitions behind them hands-on. At the same time, we attempt to give the reader a well-rounded overview of the various lines of methodological work. The models and methods discussed later were chosen as representative of the current state of the field.

| Organization of the paper
In the remainder of this review, we discuss a need for interpretable and explainable machine learning techniques, giving examples from several application domains (Section 2). We provide an overview of the evaluation methods for interpretability and explainability (Section 3). We then outline a taxonomy of the techniques for interpretable (Section 4.2) and explainable (Section 4.3) ML with concrete examples of several recent developments. Finally, Section 5 contains concluding remarks.

| MOTIVATION AND RELEVANCE
It is natural to question the utility of interpretable and explainable ML, especially given a widespread belief that a trade-off exists between accuracy and interpretability (Rudin, 2019;Semenova et al., 2019). Therefore, it is sensible to ask "Why would a designer of an ML system consider sacrificing performance for the sake of transparency?" First, it is important to note that there are many cases when interpretability is not necessary, particularly when the studied problem is well-known, well-understood, and does not have substantial consequences (Doshi-Velez & Kim, 2017), for example, mail sorting, movie recommendation, and so forth. Second, the perceived accuracy-interpretability trade-off may not necessarily apply to all datasets and prediction problems (Rudin, 2019).
Arguably, the commonest motivation behind interpretability and explainability is developing user trust (Doshi-Velez & Kim, 2017;Lipton, 2018). Lipton (2018) decomposes trust into knowing "how often a model is right" and "for which examples it is right." Sometimes we might want to gain a more profound intuition about the model's behavior. In that case, an ability to interpret or explain could be another prerequisite for a trustable ML system. However, this ability alone is not sufficient (Rudin et al., 2022), since it is not a substitute for accurate and reliable predictions.
In practice, interpretability and explainability are typically most useful when auditing ML systems and confirming auxiliary desiderata beyond predictive performance (Carvalho et al., 2019;Doshi-Velez & Kim, 2017;Lipton, 2018). From a legal perspective, interpretable and explainable ML is concordant with the EU General Data Protection Regulation (GDPR) (Voigt & von dem Bussche, 2017) that states data subjects' right to an explanation of algorithmic decisions and the right to be informed. It is worth mentioning that the GDPR does not prohibit black-box predictive models and that the right to an explanation is not legally binding (Carvalho et al., 2019;Wachter et al., 2017). This, however, as Wachter et al. (2017) note, does not undermine the social and ethical value of providing interpretations and T A B L E 1 Overview of manually collected review papers on interpretable and explainable ML or related topics.  Otte (2013) explanations. Below we discuss several goals attainable with interpretability and explainability that are commonly cited in the literature. Table 2 shows a few concrete examples of the considered use cases. One could leverage an interpretable model or an explanation method to generate hypotheses about causal relationships among the observed variables in the data (Lipton, 2018). In those cases, it is often desirable for the model or explanation to pick up cause-effect relationships (Carvalho et al., 2019) rather than spurious associations. Such formulation of interpretability is ambitious and inherently requires solving the problem of observational causal discovery (Nogueira et al., 2022). Some authors even go further and suggest that genuinely interpretable machine learning should provide causal interpretations and explanations of the data (G. Xu et al., 2020).
On a related note, interpretability and explainability can be instrumental in exploratory data analysis and scientific discovery (Doshi-Velez & Kim, 2017). For example, interpretable support vector machines have been used for discovering unknown physics in materials science (K. Liu et al., 2021). In quantum chemistry, neural networks allowed for an analytical differentiable representation of the quantum mechanical wavefunction (Sch✓tt et al., 2019). In computational linguistics, Pimentel et al. (2019) have leveraged NLP models alongside an information-theoretic approach to quantify the relationship between word forms and meanings. These are just a few examples of emerging machinelearning-assisted scientific discovery. A comprehensive survey by Raghu and Schmidt (2020) contains many more scientific deep learning applications.
"Good" ML models should be resistant to noisy inputs and domain shifts. Interpretations and explanations can be instrumental in designing reliable, robust, and transferable models (Carvalho et al., 2019;Doshi-Velez & Kim, 2017;Lipton, 2018). For instance, an iconic example wherein interpretability facilitated model "debugging" in that regard is discussed by Caruana et al. (2015), who have used generalized additive models for pneumonia risk prediction, exhibited, and alleviated unwanted confounding in the dataset. Another noteworthy example is the Manifold-an inhouse visualization and debugging tool for ML models developed at Uber (Carvalho et al., 2019;L. Li & Wang, 2019).
When ML algorithms are incorporated into decision-making, for example, social, economic, or medical, and use sensitive personal data, we have to scrutinize their fairness (Barocas et al., 2019;Dignum, 2019) and privacy (Papernot et al., 2018). Interpretations and explanations can be instrumental in exposing demographic disparities and reliance on sensitive information in ML models (Carvalho et al., 2019;Doshi-Velez & Kim, 2017;Lipton, 2018) by making them readily auditable. For example, using explanation methods, the ProPublica analysis of the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) recidivism model (Larson et al., 2016;Rudin, 2019) has revealed that the COMPAS might be racially biased. In summary, interpretability and explainability, although not necessary in many straightforward applications, become instrumental when the problem definition is incomplete and in the presence of additional desiderata, such as trust, causality, or fairness. These principles can be helpful to both the specialists designing predictive models and endusers who want to obtain a more profound intuition about the behavior of an ML system. In practice, there exists a plethora of techniques, ranging from specially tailored interpretable neural network architectures to out-of-the-box model-agnostic explanation methods. According to Bhatt et al. (2020), who have conducted interviews with 50 data scientists and practitioners from 30 different organizations, the choice of an interpretable model or an explanation technique for a specific use case should depend on the identified stakeholders' needs and expectations regarding interpretability and explainability.

| EVALUATION OF INTERPRETABILITY AND EXPLAINABILITY
Despite the abundance of methodological research, literature on evaluation approaches and metrics for interpretable and explainable ML is still relatively scarce (Carvalho et al., 2019). There appear to be no uniform, well-established standards for qualitative or quantitative evaluation, likely due to the lack of an all-purpose definition of interpretable and explainable ML and the diversity and subjectivity of the desiderata and principles investigated in the literature. Nauta et al. (2022) provide the most comprehensive survey to date of qualitative and quantitative methods. This section outlines one popular classification of the evaluation criteria, due to Doshi-Velez and Kim (2017), that is concordant with much of the current literature. Examples of how these evaluation methods could be implemented in practice are provided in Table 3.

| Application-grounded evaluation
Application-grounded evaluation requires evaluating a method or a model on an exact task with human experts representing the target audience. For example, the best way of evaluating an explainable ML-based decision support system for medical diagnosis would be to ask doctors to diagnose diseases assisted by the system and compare their performance to a reasonable baseline. Similar evaluation methods are widely adopted, for example, in the field of human-computer interaction (MacDonald & Atwood, 2013) and, arguably, if implemented correctly, provide the strongest evidence of success. A study by Jesus et al. (2021) is an excellent example of application-grounded evaluation: the authors evaluate several explanation methods for fraud detection based on transaction data. They measure the accuracy and time of decisions by fraud analysts assisted by different explanations and compare versus the decisions based purely on the raw data and black-box model predictions.
T A B L E 3 A taxonomy of evaluation approaches for interpretable and explainable machine learning due to Doshi-Velez and Kim (2017

| Human-grounded evaluation
Human-grounded evaluation can be viewed as a relaxed version of the application-grounded evaluation. It requires conducting experiments with human users performing, possibly, a simplified task reminiscent of the target application. For instance, Ribeiro et al. (2016) evaluate the proposed explanation technique using the human-grounded approach. They recruit human subjects on Amazon Mechanical Turk (Paolacci et al., 2010) and compare their ability to choose the best text classification model based on explanations provided by the proposed method versus baseline techniques. Notably, the recruited subjects are not experts in the subject area of texts, and the task is merely a proxy for the end-goal of the ML system. Needless to say, although human-grounded evaluation is cheaper than the application-grounded approach, its results inevitably lead to less specific and insightful conclusions.

| Functionally-grounded evaluation
Last but not least, functionally-grounded evaluation is, arguably, most appropriate for early feasibility studies and is the simplest to implement since it requires no human subject experiments. These methods use some formal mathematical definition of interpretability or explainability as a proxy measure. For example, Shrikumar et al. (2017) perform an experiment evaluating different explanation methods for image classification based on the decrease in the classification accuracy on the MNIST dataset (LeCun et al., 2010) after masking features identified as important by an explanation method. Another example is the dataset of Kandinsky Patterns and accompanying challenges introduced by M✓ller and Holzinger (2021): in brief, challenges comprise classifying simple visual patterns in controllable synthetic image datasets while producing explanations in a specific format, for example, natural language. Similarly, by extending CLEVR dataset (Johnson et al., 2017) for visual question answering, Arras et al. (2020) release the CLEVR-XAI benchmark for neural network explanation methods. While such evaluation approaches are compelling and can be implemented entirely in silico, their insights are often limited by the subjectivity of the proxy measure chosen and the simplicity of the toy datasets used. The evaluation of interpretability and explainability in ML models largely remains an open problem. Interpretable and explainable ML research still often relies on anecdotal or subjective evidence; for instance, Nauta et al. (2022) observe that only 58% of the papers surveyed by them evaluate their models and methods quantitatively, and mere 22% conduct a user study. Performing large-scale experiments with human subjects, identifying and systematizing good proxy metrics, developing rigorous criteria and desiderata for evaluation are all essential for the advancement of the whole field.

| INTERPRETABLE MODELS AND EXPLANATION METHODS
Now that we have established that interpretability and explainability of ML models are essential in certain settings and how these properties are evaluated, the reader might be left wondering how interpretability of a model is achieved in practice or how the predictions of a black-box model could be explained? The following sections discuss several stateof-the-art interpretable and explainable ML methods. The selection of works does not comprise an exhaustive survey of the literature. Instead, it is meant to illustrate the commonest properties and inductive biases behind interpretable models and explanation methods using concrete instances. Figure 1 provides a roadmap for the remainder of this section, compiling some of the most salient characteristics of interpretable and explainable ML identified in the previous literature (Carvalho et al., 2019;Doshi-Velez & Kim, 2017;Lipton, 2018;Molnar, 2020). These include the model class, scale at which interpretations or explanations are produced, agnosticism with respect to the black-box model, and actionability. Tables 5 and 8 outline the properties of the concrete techniques further. These properties will be defined and discussed in detail throughout the section.

| Notation and preliminaries
This review primarily focuses on interpretability and explainability in the context of supervised learning for classification and regression tasks. However, some sections will discuss unsupervised learning scenarios, such as unsupervised representation learning (Section 4.2.11). F I G U R E 1 Roadmap for the review of interpretable models and explanation methods based on a compilation of salient characteristics identified in the literature (Carvalho et al., 2019;Doshi-Velez & Kim, 2017;Lipton, 2018;Molnar, 2020). Concrete examples for each property are shown in italic.
T A B L E 4 Mathematical notation used throughout the remainder of this review.

Symbol Explanation
consisting of features x i X and labels y i Y. For tabular data, features are given by a p-dimensional vector x i ℝ p . We use f Á ð Þ to refer to a classification or regression model, which may be interpretable or black-box, fitted on the training data. In the unsupervised learning scenario, we assume a dataset of unlabeled points D ¼ Throughout this section, we will occasionally provide examples of different techniques applied to a simple dataset comprising clinical, laboratory, scoring, ultrasound variables, and ultrasound images acquired from a cohort of pediatric patients admitted to the hospital with suspected appendicitis Roig Aparicio et al., 2021). The underlying problem of this dataset is binary classification-the prediction of the patient's diagnosis (appendicitis vs. no appendicitis). The data analysis was approved by the University of Regensburg institutional review board (Ethikkommission der Universität Regensburg, no. 18-1063-101). The dataset is publicly available at https://github. com/i6092467/pediatric-appendicitis-ml.

| Interpretable models
Interpretable models, sometimes also referred to as "white-" or "gray-boxes," are usually constrained and structured to reflect physical constraints, monotonicity, additivity, causality, sparsity, or other desirable properties (Carvalho T A B L E 5 Properties of several reviewed interpretable models.

Model
Scale Linear Sparse Additive Monotonic Unstructured data . Some researchers have even argued that interpretable supervised machine learning can be viewed as an instance of constrained empirical risk minimization (Dziugaite et al., 2020). The choice of properties depends on the particular application and the end-user. For example, Lipton (2018) notes that a high-dimensional linear model is not more interpretable than a very compact neural network. In contrast, a sparse linear model is comprehensible and easy to visualize. Therefore, two desirable characteristics are (i) simulatability and (ii) decomposability (Lipton, 2018), that is, (i) a model must be comprehensible in a limited amount of time, and (ii) its inputs and parameters should be intuitively meaningful. Table 5 contains concrete examples of machine learning models that fall into this broad category.

| Rule-based models
Rule-based classification algorithms have been known for a long time. One could argue that these well-established techniques are intrinsically interpretable. While single if-then rules are indeed readily comprehensible, inductive logic programming (De Raedt, 1999), for instance, yields an unordered set of rules; on the other hand, decision trees (Loh, 2011) are not monotonic and, thus, require additional mental effort. Several rule-based classification approaches have been introduced with interpretability in mind. Some examples include repeated incremental pruning to produce error reduction (RIPPER) (Cohen, 1995), which keeps the number of rules small, RuleFit (Friedman & Popescu, 2008), which induces rules from a sparse linear model with pairwise interactions, and falling rule lists (FRL) (F. Wang & Rudin, 2015b), which prioritize monotonicity across the induced rules. Herein, we will focus on FRLs more closely as an illustrative example. FRLs (F. Wang & Rudin, 2015b) are binary classifiers motivated by the wide adoption of risk scores and risk stratification systems in healthcare. A falling rule list is a list of if-then rules such that (i) during classification rules have to be applied in the order given by the list, and (ii) the probability of the positive class is monotonically decreasing within the list. Table 6 provides an example of an FRL for predicting the risk of appendicitis in pediatric patients, learnt from a small publicly available tabular dataset (Section 4.1) Roig Aparicio et al., 2021). Notably, the rules use simple discretized features, and the risk decreases monotonically throughout the list. The constrained format of FRLs makes them more understandable than decision trees and is natural for practical decision-making in a clinical setting.
In practice, FRLs can be learnt using a Bayesian modeling approach, wherein monotonicity and sparsity constraints are encoded in the prior distribution. The simulated annealing procedure is used to sample from the posterior distribution and obtain the MAP estimator. C. Chen and Rudin (2018) further relax the original optimization problem of learning FRLs by introducing softly falling rule lists. Rather than having hard monotonicity constraints, the authors add a non-monotonicity penalty term to the loss function. Such formulation is better suited to noisy real-world datasets, where sparse and strictly monotonic solutions might be less performant. Another noteworthy extension is causal falling rule lists (F. Wang & Rudin, 2015a) that leverage FRLs to estimate treatment effects in the potential outcomes framework (Rubin, 2005).

| Score-based models
Another class of interpretable binary classification models, likewise motivated by medical risk scoring, is supersparse linear integer models (SLIM), introduced by Ustun and Rudin (2015). SLIMs allow learning data-driven risk scores that T A B L E 7 A supersparse linear integer model for predicting the risk of appendicitis in pediatric patients based on tabular data comprising clinical, laboratory, scoring, and ultrasonography variables (Roig Aparicio et al., 2021).
are reminiscent of conventional medical scoring systems, such as APACHE (Knaus et al., 1985) or SOFA (Vincent et al., 1996). In contrast to FRLs, whose key focus is monotonicity, SLIMs represent sparse decision boundaries, that is, relying on a limited number of features. Moreover, interpretability in SLIMs is additionally facilitated by learning a linear scoring function with integral coefficients. Roughly, SLIMs require solving the following optimization problem (refer to the original paper by Ustun and Rudin (2015) for the complete formulation): where β ℬ is an integer-valued coefficient vector with ℬ ¼ L,L þ 1, …, U À 1, U f g p , L, U ℤ, L < U, and 1 Á f g denotes an indicator function. Notably, the original features, if continuous, have to be discretized and encoded as binary-valued factors. The integer linear program (ILP) defined by Equation (1) enjoys the advantages of directly minimizing the 0-1 loss and the ℓ 0 penalty instead of convex surrogate measures commonly adopted in statistics and machine learning literature, cf. Zou and Hastie (2005). Table 7 contains an example of a scoring system learnt from tabular data using SLIM for predicting the risk of pediatric appendicitis in children. In practice, the risk for an individual described by features x i is quantified by β Τ x i , that is, by the sum of the coefficients corresponding to the applicable conditions.
In addition to the theoretical guarantees, the ILP formulation above has the benefit of easily incorporating and enforcing additional constraints beyond integrality and sparsity, for example, introducing desirable "either-or" or "if-then" conditions on features or preferences for (not) using certain variables. Ustun and Rudin (2015) also introduce a range of extensions of SLIMs. Particularly noteworthy are personalized SLIMs with varying scoring rules for individual data points. The authors also present rule-based adaptations. Further algorithmic improvements are made by Ustun and Rudin (2017).

| Generalized additive models
As mentioned before, decomposability is a desirable property of interpretable ML models (Lipton, 2018). One class of "decomposable" functions is additively separable functions (Segal, 1994). We say that a function f x 1 , x 2 , …, x p À Á is additively separable if we can rewrite it as a sum of univariate terms: f x 1 , x 2 , …, x p À Á ¼ P p j¼1 u j x j À Á . Hastie and Tibshirani (1986) introduce the class of generalized additive models (GAM) that rely on this additivity property. In particular, for p features, a GAM is given by where g Á ð Þ is a link function, s j Á ð Þ are smooth functions, often referred to as shape functions (Lou et al., 2012). GAMs are an extension of the linear model that preserves the additivity but allows introducing nonlinearities in individual variables by choosing appropriate shape functions. This model class ignores interactions between variables. Therefore, the influence of each feature is easily comprehensible and can be visualized by plotting the corresponding shape function s j Á ð Þ. Figure 2 depicts shape functions for two continuously-valued features in a GAM for classification, fitted on a tabular dataset. Lou et al. (2012) conduct extensive experimental comparison among different methods for fitting GAMs and choices of s j Á ð Þ. They consider least squares, gradient boosting, and backfitting approaches. In addition to the standard use of spline-based shape functions (Hastie & Tibshirani, 1986), Lou et al. (2012) consider single, bagged, boosted, and boosted bagged decision trees. Building on the work by Lou et al. (2012), Caruana et al. (2015) propose a simple yet more performant extension by including two-way interaction terms referred to as generalized additive models plus interactions (GA 2 M): where s j,k x j , x k À Á are pairwise interaction terms. Pairwise interactions are still intelligible since they can be visualized using simple plots, for example, heat maps. Another noteworthy extension of GAMs is sparse additive models (SpAM), proposed by Ravikumar et al. (2007). SpAMs combine the ideas of Hastie and Tibshirani (1986) with sparse linear modeling for high-dimensional regression problems (Zou & Hastie, 2005). In addition to shape functions s j Á ð Þ, the authors introduce a weight vector b ℝ p multiplied with outputs of s j Á ð Þ and penalize its norm in the loss function. In this way, SpAMs rely on a sparse subset of features while still preserving the additive structure of the classical GAMs. To further improve the scalability and performance of GAMs on large datasets and their modularity, Agarwal et al. (2021) and Chang et al. (2021) introduce neural generalized additive models that rely on neural networks as building blocks. Specialized cases of this model class address specific modeling tasks or induce additional inductive biases, for example, application to survival analysis by Utkin et al. (2022) or, similar to SpAMs, sparse neural additive models proposed by S. Xu et al. (2022). Another line of work focuses on improving interactivity and actionability of GAMs via scalable and accessible visual diagnostics and editing programs (Fasiolo et (2020) J ✓ ✓ ✓ Note: " • " and " J " denote global and local explanation methods, respectively. "✓" denotes that a property (columns) is satisfied by a technique (rows). "$" denotes that a property either holds partially or that a method could be easily extended to satisfy the property.

| Sparse-input neural networks
In many high-dimensional regression and classification problems, for example, genomic data analysis (Lucas et al., 2006) and social network modeling (Ravazzi et al., 2018), sparsity is an important inductive bias that allows producing parsimonious interpretable models. We have already mentioned sparsity as a desirable property when describing supersparse linear integer models (Section 4.2.2). However, the predictive performance of SLIMs could be limited by their assumption of linearity. Recently, there have been renewed efforts in leveraging sparsity-inducing regularization to understand and control the behavior of neural networks models (Feng & Simon, 2017;Khanna & Tan, 2020;Lu et al., 2018;Tank et al., 2021;Valdes et al., 2021). Significant advantages of neural networks are their ability to model complex nonlinear relationships and their scalability to large datasets and unstructured data types, such as text and images. Feng and Simon (2017) provide a thorough theoretical and empirical analysis of sparse-input neural networks (SPINN) in the context of p ) N problems (Hastie et al., 2009). SPINNs are fully connected neural networks characterized by sparse weights in the input layer and are trained by minimizing the following loss function: where Á ,j refers to the j-th column of the input layer weight matrix; and Ω α β ð Þ ¼ 1À α ð Þkβk 1 þ α k βk 2 with α 0, 1 ð Þ is the sparse group Lasso penalty (Simon et al., 2013). Here, parameter α controls the trade-off between the element-wise Lasso and the group Lasso penalties. Simply put, this penalty ensures that all input weights corresponding to a single feature are shrunk together, allowing for feature selection in the style of the classical Lasso (Figure 3). Feng and Simon (2017) prove probabilistic, finite-sample generalization guarantees for this model class and demonstrate performance gains empirically for high-dimensional data with higher-order interactions compared to other nonparametric models. Tank et al. (2021) leverage similar penalties in the context of autoregressive time series modeling. Khanna and Tan (2020) apply the approach to a more advanced neural network architecture, namely, the long shortterm memory (Hochreiter & Schmidhuber, 1997).

| Knockoff features
One could argue that the ultimate goal of SPINNs (Feng & Simon, 2017) is "deep" feature selection. In a similar vein, Lu et al. (2018) propose another solution-deep feature selection using paired-input nonlinear knockoffs (DeepPINK), F I G U R E 2 Example of a visual interpretation of a generalized additive model (GAM) for predicting the risk of appendicitis among pediatric patients based on tabular data comprising clinical, laboratory, scoring, and ultrasonography variables. The plots depict shape functions s j Á ð Þ for two features present in the dataset: (a) body temperature and (b) white blood cell count. Functions are plotted as solid lines, 95% confidence intervals (CI) are plotted as dashed lines. Observe that the GAM predicts higher probabilities of appendicitis in children with higher body temperatures and white blood cell counts. The plots were generated using pyGAM library (Servén & Brummitt, 2018).
which leverages knockoff filters (Barber & Candès, 2015) to facilitate interpretability and sparsity in deep neural networks at the input level. A compelling advantage of this technique over SPINNs is that it controls for the false discovery rate (FDR) (Benjamini & Hochberg, 1995) when selecting significant features. Knockoff filters were initially proposed by Barber and Candès (2015). In brief, knockoff filters are a variable selection procedure that controls the FDR exactly in a linear model in finite sample settings, whenever there are at least as many observations as features. The key idea is to construct knockoff features mimicking the dependency structure of the original features; augment the original dataset with knockoffs; and compare statistics for each original feature and its knockoff, for example, the absolute value of the regression coefficient. In this way, variables with genuine signals can be identified while controlling for the FDR. Candès et al. (2018) propose "model-X" knockoffs which rely on the assumption that the joint distribution of variables is known, without assuming anything about the distribution of the output conditional on features. Their significant limitation is that the generation of knockoff features is based on a known multivariate Gaussian distribution. Jordon et al. (2019) alleviate this issue by introducing the KnockoffGAN-a generative adversarial network (GAN) (Goodfellow et al., 2014) for knockoff generation capable of producing more complex dependency structures. Along similar lines, following the model-X framework, Romano et al. (2019) propose constructing knockoff features with deep generative models utilizing the maximum mean discrepancy (MMD) (Y. Li et al., 2015).

| Varying-coefficient models
While GAMs (Section 4.2.3) generalize the linear model by allowing for nonlinearities in individual features, varyingcoefficient models (VCM), proposed by Hastie and Tibshirani (1993), offer a different sort of generalization. In a VCM, variable coefficients vary smoothly with so-called "effect modifiers"-additional, potentially exogenous, variables r 1 , r 2 , …, r p : wherein β j Á ð Þ is a smooth function corresponding to the varying coefficient of the j-th feature. The choice of variables r 1 ,…, r p depends on a particular application. Note that r j may coincide with features x j or correspond to some additional attributes. For example, in dynamical systems, time can be a single effect modifier, producing time-varying coefficients.
Notably, VCMs are only locally interpretable since coefficients vary across data points, that is, interpretations may differ wildly between different instances. By contrast, all of the model classes described before are globally interpretable: the insights gained from inspecting model parameters are equally applicable to all data points. While the interpretation of locally interpretable is far more cumbersome, this trade-off may be necessary in pursuit of a more flexible model with personalized interpretations. Several more recent interpretable ML models (Al-Shedivat et al., 2020; Alvarez-Melis & Jaakkola, 2018) discussed in the following sections bear a striking resemblance to VCMs: they essentially generalize this F I G U R E 3 A graphical representation of a fully connected neural network with p input variables x 1 , …,x p . Input weights corresponding to the feature x 2 are shown as bold dashed lines. If x 2 were not useful for predicting the output, a sparse-input network would shrink all of its input weights W 1 ð Þ Á ,2 toward 0 thus, deselecting x 2 completely. conventional statistical framework to unstructured data types, such as images, by parameterizing β j Á ð Þ with neural networks.

| Contextual explanation networks
We have seen before how sparsity could be introduced in fully connected neural networks and feature selection could be performed (Sections 4.2.4 and 4.2.5). Nevertheless, quantifying and explaining the contributions of individual inputs to the predictions in neural networks is not straightforward due to entangled interactions in downstream layers of a network (Guerguiev et al., 2017;Tank et al., 2021). While there has been a substantial effort and progress in explaining neural network models post hoc (Section 4.3.1), that is, after training, a few lines of research instead focus on building interpretable neural network architectures whose structure is decomposable and whose parameters can be interpreted directly to produce local explanations.
For instance, Al-Shedivat et al. (2020) introduce contextual explanation networks (CEN)-a class of neural network architectures that jointly predict and explain their predictions without requiring additional model introspection. CENs can be defined as deep probabilistic models for learning the conditional distribution of the output variables P w Y j x, c ð Þ, parameterized by w, where c C are context variables observed in addition to the features x X and y Y are outputs, to be predicted given x and c. The probabilistic model is then specified by where P Y jx, θ ð Þis a predictive model parameterized by θ that explicitly relates features to the outputs. Thus, parameters θ Θ can be seen as an explanation of a model's prediction that is specific to the context given by variables c. For example, Al-Shedivat et al. (2020) consider the problem of poverty prediction based on categorical variables from living standards measurement surveys and the context given by satellite images. In practice, P w θjc ð Þ is replaced with an encoder neural network and the predictive distribution P Y jx, θ ð Þ is parameterized by an interpretable function, for example, a linear model f θ x ð Þ ¼ softmax θ Τ x À Á . CENs are closely related to VCMs (Hastie & Tibshirani, 1993). In fact, they can be seen as a special case: context variables c (Equation 6) are the effect modifiers for features x (cf. Equation 5). The principal contribution of CENs is to cast the VCMs into a probabilistic framework and parameterize coefficients with neural networks. The authors demonstrate the efficacy of their approach on classification and survival analysis tasks. They show that CENs are still interpretable in datasets with noisy features where post hoc explanation techniques (Section 4.3) are often inconsistent and misleading.

| Self-explaining neural networks
Another class of functions related to VCMs was introduced by Alvarez-Melis and Jaakkola (2018). Similarly to Al-Shedivat et al. (2020), the authors develop an intrinsically interpretable neural network model that allows disentangling contributions of individual features or basis concepts. Self-explaining neural networks (SENN) (Alvarez-Melis & Jaakkola, 2018) are motivated by (i) explicitness, (ii) faithfulness, and (iii) stability properties-three desiderata for interpretability. The authors claim that SENNs are (i) explicit because their explanations are "immediate" and "understandable," (ii) faithful because explanations reflect the ground truth relationship between the basis concepts and outputs, and (iii) stable because their explanations are consistent for similar data points.
SENNs act like a simple model locally but can be highly complex and nonlinear globally. In their most basic form, SENNs are given by where θ Á ð Þ is a neural network with p outputs referred to as generalized coefficients. Without further restrictions, the model in Equation (7) is not more interpretable than a classical multilayer neural network. Therefore, SENNs are encouraged to be locally linear: it needs to hold that r x f x ð Þ ≈ θ x 0 ð Þ for all x in the neighborhood of x 0 . Under this constraint, individual components of q x ð Þ act as interpretable and adaptive regression coefficients. Further extensions of the model in Equation (7) are possible. For example, instead of raw features one can introduce basis concepts h x ð Þ : ℝ p ! ℝ k and use them alongside the generalized coefficients: f x ð Þ ¼ θ x ð Þ Τ h x ð Þ. Furthermore, depending on the ground task, some generalized interpretable link function g Á ð Þ can be used, resulting in the refined model definition below (see Figure 4 for a schematic visualization): where z j ¼ θ x ð Þ j h x ð Þ j is the influence score, or importance, of the j-th concept for data point x. In practice, a SENN in Equation (8) is trained by minimizing the following gradient-regularized loss function that balances performance with interpretability: where ℒ y f x ð Þ, y ð Þis a loss term for the ground classification or regression task, for example, the mean squared error or the cross entropy; λ > 0 is a regularization parameter; and ℒ θ f x ð Þ ð Þis the gradient penalty: where J h x is the Jacobian of h Á ð Þ w.r.t. x. Alvarez-Melis and Jaakkola (2018) postulate further desirable properties that SENNs should satisfy: 1. The link function g Á ð Þ is monotonic and additively separable in its arguments (see Sections 4.2.1 and 4.2.3).

The number of concepts k is small.
In addition, the authors emphasize three guiding criteria for choosing interpretable basis concepts: (i) fidelityrepresentations should contain relevant context information; (ii) diversity-concepts used to represent inputs should be few and nonoverlapping; and (iii) grounding-concepts should be immediately understandable to a human. Moreover, they demonstrate how such representations can be learnt using autoencoder neural networks in an end-to-end manner in conjunction with the SENN model. The class of functions described by the assumptions above is quite broad, for example, generalized linear models (Nelder & Wedderburn, 1972) and the nearest neighbor classifier satisfy these. Nevertheless, the advantage of SENNs stems from the richness of neural network architectures that could be used for functions θ Á ð Þ and h Á ð Þ. Like CENs (Section 4.2.7), SENNs are closely related to varying-coefficient models (Section 4.2.6). The main difference is that in SENNs, regressors themselves act as effect modifiers and that the framework is augmented with interpretable basis concepts, defined on top of the raw inputs. Notably, all of the three model classes (Al-Shedivat et al., 2020; Alvarez-Melis & Jaakkola, 2018; Hastie & Tibshirani, 1993) described so far hold a promise of local interpretability while providing room for predictively powerful models.

| Attentive mixtures of experts
In natural language processing, the attention mechanism (Vaswani et al., 2017) has become a powerful tool for exploring relationships between inputs and outputs of deep neural networks and is utilized for interpretability and performance. Nevertheless, several works have criticized the naïve use of attention for model interpretation (Jain & Wallace, 2019;Serrano & Smith, 2019), showing that it is often uncorrelated with gradient information and other natural feature importance measures. Some works have focused on improving the attention mechanism, particularly for interpretability, for example, models by Nauta et al. (2019)  The attentive mixture of experts (AME) is a neural network model comprising several connected "experts," that is, subnetworks. The AME is given by the following equations: where c j ¼ E j x j À Á is the output of the j-th expert subnetwork given the input variable x j ; a j is the output of the j-th attentive gating network G j Á ð Þ quantifying the importance of the j-th feature; h j denotes a hidden representation from the j-th expert subnetwork; vector u j is a projected representation of h all ; vector u s,j is a per-expert learnable context vector; and σ Á ð Þ is a nonlinear activation function. Notably, the architecture specified by Equation (11) allows disentangling contributions of individual features to predictions, using attentive gating and per-feature subnetworks.
The AME is trained end-to-end by minimizing a loss function augmented with an auxiliary objective. The auxiliary objective encourages the importance score a j to reflect the decrease in error associated with the contribution of the j-th expert, that is, the j-th feature, similar in spirit to the definition of the RF variable importance (Breiman, 2001): where ε x ∖ j f g and ε x denote the prediction error of the model without the j-th feature and of the full model, respectively. The error difference above can be normalized to produce F I G U R E 4 A schematic depiction of a self-explaining neural network model. Input variables x 1 , …, x p are mapped to generalized coefficients and interpretable basis concepts by neural networks θ Á ð Þ and h Á ð Þ, respectively. Generalized coefficients and basis concepts are then combined by an interpretable link function g Á ð Þ into a predicted value b y.
Ideally, attentive gating network outputs should be correlated with the feature importance measure defined in Equation (13). Therefore, Schwab et al. (2019) introduce an auxiliary term into the loss function for training the AME. In particular, the authors minimize the discrepancy between normalized error differences (Equation 13) and the outputs of the attentive gating networks: where D Á , Á ð Þ is some discrepancy measure, for example, the Kullback-Leibler divergence; ω x i and a x i are normalized error differences and the attentive gating network outputs, respectively, for the i-th data point.
The attentive mixture of experts successfully overcomes limitations of the naïvely trained attention mechanism by introducing regularization terms into the loss function which forces learning importance scores that reflect the increase in prediction error from removing a feature. Next to CENs and SENNs (Sections 4.2.7 and 4.2.8), AMEs are yet another class of locally interpretable neural network architectures which produce individualized explanations for each data point at prediction time. However, the relationship between predictions and explanations is, arguably, more opaque within the AMEs than the other model classes discussed so far.

| Symbolic regression
Most interpretable models tend to learn some numerical measure of feature importance that can be visualized and interpreted either globally or locally. By contrast, as its name suggests, symbolic regression tries to provide a symbolic interpretation of the data. More formally, symbolic regression is a problem of inferring an analytic form for an unknown function that can only be queried (Amir Haeri et al., 2017;Udrescu & Tegmark, 2020). Although this problem emerged long before the relatively recent interest in explainable and interpretable ML (McKay, 1995), symbolic regression perfectly fits the broad category of interpretable models. While neural networks and many other models do provide analytic representations, symbolic regression usually seeks parsimonious equations, for example, making further restrictions to polynomial forms. Thus, symbolic regression can be leveraged to learn interpretable functional relationships from raw data (Jin et al., 2019). For example, when regressing y on features x, a mathematical expression f x ð Þ ¼ x 1 þ 2 cos x 2 ð Þþ exp x 3 ð Þþ0:1 could be a candidate solution for symbolic regression. The optimization problem behind symbolic regression can be formalized as follows: where F is a set of succinct mathematical expressions and ℒ Á , Á ð Þ is the loss function for the ground regression or classification task. In practice, symbolic regression often reduces to combinatorial optimization. Therefore, some conventional approaches include genetic programming (Amir Haeri et al., 2017;McKay, 1995) and simulated annealing (Stinstra et al., 2007). Petersen et al. (2021) and Biggio et al. (2021) have proposed neural-network-based solutions to the problem. Namely, Biggio et al. (2021) predict symbolic expressions using a Transformer model pretrained on a large-scale corpus of procedurally generated input dataset and symbolic equation pairs. Last but not least, we remark that symbolic regression is also helpful for explaining the behavior of black-box machine learning models post hoc, for example, a neural network could be approximated by a symbolic surrogate model (Section 4.3.3).

| Interpretable representation learning
In the previous sections, we have considered interpretability exclusively in supervised learning and at the level of raw input variables. Sometimes we might want to learn low-dimensional embeddings, or representations, in an unsupervised, weakly-or semi-supervised setting instead of exploring purely discriminative relationships among raw variables. Representations can be helpful for several, usually unknown at the time of representation learning, downstream applications. A desirable property targeted by some of the representation learning techniques is interpretability. Similarly to the classification and regression settings, interpretability is usually attained by enforcing some constraints on representations.
Disentanglement is one such constraint: in disentangled representations, separate sets of dimensions are uniquely correlated with salient, semantically meaningful features. In addition to interpretability, disentanglement facilitates the controllable generation of synthetic data. Recently, many deep generative models have been used for disentangled representation learning-for example, X. Chen et al. (2016) demonstrate experimentally that their InfoGAN model, an information-theoretic extension of GANs, learns disentangled representations from image data. Similar results have been attained with variational autoencoders (VAE) (Higgins et al., 2016;Kingma & Welling, 2014;Kingma & Welling, 2019) by introducing statistical independence constraints on embedding dimensions via a factorizing prior distribution. In theory, learning identifiable disentangled representations in a completely unsupervised manner is fundamentally impossible (Locatello et al., 2019). However, the latter result does not diminish the utility of disentangled representation learning by injecting inductive biases and implicit or explicit forms of supervision. For instance, Adel et al. (2018) and Taeb et al. (2022) consider semi-supervised variants of the VAE wherein conditioning the representation on some side information or label helps the disentanglement and interpretability of the generative model. Tschannen et al. (2018) provide a thorough overview of noteworthy advancements and inductive biases in autoencoderbased representation learning. Many of the approaches discussed by them strive toward some form of interpretability, although often do not state that explicitly.
Disentanglement is not the only approach to interpretable representation learning. Several lines of work have focused on introducing additional supervision to learn representations reflecting high-level concepts useful in classification or regression (Z. Chen et al., 2020;Koh et al., 2020;Marcos et al., 2021). One such approach is concept bottleneck models (N. Kumar et al., 2009;Lampert et al., 2009), recently reexplored by Koh et al. (2020). As opposed to the deep generative models discussed above, concept bottlenecks perform supervised learning. For predicting the output y based on features x ℝ p , a bottleneck model is given by f g x ð Þ ð Þ, where g : ℝ p ! ℝ k and f : ℝ k ! ℝ. Here, f Á ð Þ relies entirely on interpretable concepts c ¼ g x ð Þ, and g Á ð Þ is learnt in a supervised manner using additional concept labels. For instance, consider classifying bird images into species based on a set of visual traits defined by ornithologists. Koh et al. (2020) explore a range of strategies for training such models and propose to parameterize f Á ð Þ and g Á ð Þ by deep neural networks. A significant advantage of a concept bottleneck f g x ð Þ ð Þ over a block-box e f x ð Þ is that at prediction time, an expert end-user, for example, a medical doctor, can intervene on incorrectly inferred concepts. However, the applicability of concept bottleneck models is limited to areas and tasks where vast domain knowledge is available and where experts can cheaply label instances.
To summarize, the problem of interpretability in representation and, more broadly, unsupervised learning is still under-explored, despite a growing body of research. As seen from the previous sections, many techniques focus exclusively on classification and regression tasks, while deep clustering, generative modeling, and representation learning have attracted comparatively less attention. With the emergence of new socially consequential application domains, a need for interpretable unsupervised learning techniques is becoming apparent.

| Explanation methods
We now turn toward a completely different family of methods. According to Rudin (2019), explainable ML focuses on introspection for existing black-box models, for instance, by training a simpler surrogate model post hoc. As seen before, explanations can take various forms: textual, visual, symbolic, and so forth. Even a data point from the training set or a synthetic data point can serve as an explanation (Lipton, 2018). An explanation can be global, that is, characterizing the whole dataset, or local, that is, explaining individual classification or regression outcomes (Carvalho et al., 2019;Molnar, 2020). It can be model-specific, that is, capable of explaining only a specific class of models, or model-agnostic, that is, applicable to an arbitrary model. Carvalho et al. (2019) discuss several desirable properties of an explanation technique: (i) faithfulness-an explanation should be faithful to the original black-box model, that is, an explanation should in some way accurately predict the behavior of the black-box; (ii) consistency and stabilityexplanations for different models tackling the same task should be consistent, and explanations for similar data points should be similar; (iii) comprehensibility-the end-user should be able to comprehend explanations easily; (iv) certainty and novelty-explanations should convey (un-)certainty about predictions and should warn the end-user if the data point considered is "far away" from the support of the training set; (v) representativeness-explanations should "cover" training data evenly, particularly for prototype-based explanation techniques (Kim et al., 2014(Kim et al., , 2016. Table 8 further expands on the list of salient characteristics and presents the explanation methods overviewed in the following sections.

| Attribution methods
Arguably, the family of explanation techniques that is used most frequently in practical applications are attribution methods. Sundararajan et al. (2017) define an attribution as follows: for a function f : ℝ p ! 0, 1 ½ , representing a blackbox binary classifier, and an input x ℝ p , an attribution for x with respect to some reference, also referred to as baseline, x 0 (some of the attribution techniques do not require a reference sample x 0 ) is given by wherein a j quantifies the "contribution" of the feature x j to the prediction made by the model f Á ð Þ for data point x. We will adhere to the definition and notation above throughout this section. The use of attributions as a model diagnostic has become ubiquitous in ML applications (Arcadu et al., 2019;Kelley et al., 2018;Y. Liu et al., 2020;Parsa et al., 2020), especially for image data, since attributions can be readily visualized as a heat map and are helpful for understanding on which input regions the model "concentrates." Figure 5 shows an example of an attribution heat map for a deep neural network classification model trained to predict appendicitis in children based on ultrasound images.
Most recent attribution techniques focus specifically on explaining deep neural network models, and many of them implicitly or explicitly rely on gradient information to produce attributions. Ancona et al. (2019) distinguish two different categories of attribution methods: (i) sensitivity-based methods quantify how strongly the output of the model f Á ð Þ changes if an input variable is perturbed, whereas (ii) salience-based methods quantify marginal effects of features on the output of f Á ð Þ compared to some baseline, for example, the same input but with the feature of interest masked or removed. Below we describe a few archetypal examples of attribution techniques.

Lime
Local interpretable model-agnostic explanations (LIME), introduced by Ribeiro et al. (2016), seek interpretable data representations that are faithful to the given black-box classifier f Á ð Þ. The authors define an explanation ξ Á ð Þ for a data point x as follows: where G is a class of surrogate models used for explaining the black-box; ℒ Á , Á , Á ð Þis the fidelity function quantifying the loss for g x ð Þ approximating f Á ð Þ within the neighborhood of x given by π x ; and Ω Á ð Þ is a model complexity penalty. Essentially, ℒ Á , Á , Á ð Þis a locality-aware loss function and, in practice, can be minimized in a model-agnostic manner, that is, regardless of the model class of the original black-box. Usually, G is chosen to be a constrained class of intrinsically interpretable models (Section 4.2), for example, linear models or GAMs. Put simply, LIME trains many interpretable surrogate models to approximate a black-box model f Á ð Þ locally. During training, instances are sampled around each data point x i weighted by π x i . In addition to local explanations given by ξ Á ð Þ, Ribeiro et al. (2016) introduce a procedure for obtaining a global understanding of the model f Á ð Þ: given a limited budget, their algorithm picks several explanations based on greedy submodular optimization (Krause & Golovin, 2014) and aggregates them into global variable importances, similar to the random forest feature importance (Breiman, 2001).
DeepLIFT Shrikumar et al. (2017) introduce an efficient method for disentangling contributions of inputs in a neural networkdeep learning important features (DeepLIFT). Compared to LIME, DeepLIFT is not model-agnostic since it is explicitly tailored to neural networks; it also requires a reference, or baseline, data point. While in natural images an all-black image is typically used as a baseline input, the choice of a reference might not be so trivial for more specialized datasets and could affect the attribution (Srinivas & Fleuret, 2019).
Let t denote the activation of neuron of interest, usually one of the output neurons, and let η 1 , η 2 ,…, η K be intermediate neurons, potentially, from several layers, that suffice to compute t. Let Δt ¼ t À t 0 be the difference between t and a reference output t 0 . We then seek to assign contribution scores C Δη i Δt so that they satisfy the so-called summation-todelta property: An intuitive interpretation of the equation above is that C Δη i Δt is the amount of "blame" for the difference in outputs assigned to a difference in the activation of the i-th intermediate neuron. Since neurons η 1 , η 2 , …, η K suffice to compute t, differences in their activations Δη 1 , Δη 2 ,…, Δη K should suffice to explain the difference Δt. Notably, C Δη i Δt need not be 0 when ∂t ∂η i ¼ 0 and, thus, can yield insights very different from those of gradient-based measures. By analogy to the partial derivative, Shrikumar et al. (2017) define a multiplier as follows: In practice, we may not be necessarily interested in contributions of hidden units η 1 , η 2 , …,η K . Therefore, the authors instead consider the following definition of multipliers for input features, which is consistent with the summation-todelta property (Equation 17): Equation (19) is informally referred to as the chain rule for multipliers. The authors propose several propagation rules for computing C Δη i Δt , which alongside the summation-to-delta and chain rule properties are then used to compute m Δx i Δt . The choice of propagation rules is not set in stone, and more complex or specialized neural network architectures require adaptations to the original DeepLIFT approach.

SHAP
A framework of Shapley additive explanations (SHAP) (Lundberg & Lee, 2017) builds on Shapley regression values (Lipovetsky & Conklin, 2001) inspired by the game-theoretic concept of Shapley values (Hart, 1989). For the j-th feature, the Shapley regression value at data point x is given by (a) Rawimage (b) Attribution F I G U R E 5 An example of attribution in medical image classification. (a) A raw appendix ultrasound image from a pediatric patient admitted to a hospital with suspected appendicitis. (b) The corresponding attribution map, overlaid with the raw image, produced using the GradCam method (Selvaraju et al., 2017) for a deep neural network classifier predicting patients' diagnoses. Red color denotes higher attribution values, that is, higher "importance" of pixels, whereas blue color denotes lower values. According to the attribution map, the classifier concentrates on the region around the appendix.
where ℱ ¼ 1, …, p f gcorresponds to the set of all input variables; x S is a feature vector composed of the components of x that are in S ⊆ ℱ; and f S Á ð Þ is a model trained only on the features from the set S. Intuitively, ϕ j x ð Þ quantifies the change in the output of the model resulting from adding the j-th variable to the set of features. Since there are exponentially many subsets of ℱ ∖ j f g, in practice, Equation (20) does not have to be evaluated exactly and can be approximated by sampling subsets randomly. Lundberg and Lee (2017) propose a model-agnostic kernel approximation of Shapley regression values described above. There also exist model-specific implementations of SHAP, for example, for decision trees and gradient boosted decision trees (Lundberg et al., 2020). A compelling advantage of SHAP is the generality of its formulation and elegant connections to statistical regression models and cooperative game theory. Moreover, both LIME and DeepLIFT described before are special cases of the SHAP framework that resort to model-specific approximations of Equation (20).
Follow-up work has explored other explanation methods derived from the concept of Shapley value and cooperative game theory, for example, integrated gradients (Sundararajan et al., 2017), Shapley values for individual neurons (Ghorbani & Zou, 2020), or the least core (Yan & Procaccia, 2021), based on a different solution concept. Rozemberczki et al. (2022) provide an in-depth overview of the cooperative game theory and numerous applications of the Shapley value in machine learning. Sundararajan et al. (2017) introduce another attribution method-integrated gradients (IG). They are motivated by the two following axioms. The sensitivity axiom posits that (i) if an input differs from a baseline in one feature and has a prediction outcome different from the baseline, then the differing variable should be assigned a nonzero attribution and that (ii) if the black-box model f Á ð Þ is constant in some variable, then this variable should be given zero attribution. The implementation invariance axiom states that attributions should be identical for two functionally equivalent blackbox models. Integrated gradients satisfy point (i) of sensitivity and the implementation invariance.

Integrated gradients
For the data point x, the j-th variable, and baseline x 0 , the integrated gradient is given by Observe that IG f x ð Þ is an integral of gradients along the straight path between x and x 0 . Similarly to DeppLIFT, integrated gradients defined above satisfy the completeness property: if f Á ð Þ is differentiable almost everywhere, (21) can be generalized further by considering a non-straight path between x and x 0 . Path integrated gradients are then defined for specified paths γ ¼ γ 1 , …, γ p : 0, 1 ½ !ℝ p as Path integrated gradients are the unique attribution measure that fulfills points (i) and (ii) of sensitivity, implementation invariance, and completeness. Similarly to SHAP, path integrated gradients are rooted in cooperative game theory and correspond to a generalization of Shapley values proposed by Aumann and Shapley (1974) in the context of infinite games.
Among more recent developments, Erion et al. (2021) introduce expected gradients (EG), which require fewer hyperparameters than the measure in Equation (21): where D is the reference distribution, for example, x 0 could be sampled from the training dataset with replacement, and U 0, 1 ð Þ is the uniform distribution on the interval 0, 1 ½ . Observe that rather than using a single reference x 0 , EG samples multiple references and approximates the integral as expectation. Moreover, Erion et al. (2021) investigate incorporating attributions into the training process by imposing a prior on the expected gradients of the neural network. Attribution priors (Erion et al., 2021;Ross et al., 2017) facilitate the use of post hoc explanations, such as EG, to make the neural network more interpretable, thus, building a connection with approaches described in Section 4.2. Bau et al. (2020) investigate the role of individual neurons in discriminative and generative deep networks and demonstrate that a sparse subset of the network's units often contributes the most to the output. Such insights facilitate a better understanding of how representation learning occurs and how high-level concepts emerge within a neural network. Several recent attribution measures have focused on providing more "fine-grained" explanations. In particular, some measures attempt to quantify the importance of individual feature detectors within a neural network corresponding to individual neurons, aka units, or whole channels or filters in convolutional networks (Dhamdhere et al., 2019;Leino et al., 2018;Nam et al., 2020;Srinivas & Fleuret, 2019). As a concrete example, Ghorbani and Zou (2020) propose Shapley-value-based importance for individual neurons.

Further remarks
Although attribution methods have become a well-established research topic, their general applicability and usefulness have been scrutinized (Kim et al., 2018;I. E. Kumar et al., 2020;Rudin, 2019). For instance, Rudin (2019) argues that explanation and, especially, attribution methods cannot be entirely faithful to the original black-box model and that attributions do not provide any information about how the model works; they instead tell us what the model looks at. I. E. Kumar et al. (2020) criticize Shapley-value-based explanations, such as described above, for their reliance on the additivity axiom (Hart, 1989) and lack of human-groundedness and contrastiveness. Through experiments on semisynthetic datasets and a user study, Adebayo et al. (2022) demonstrate that post hoc explanation methods, in general, and particularly attributions, often fail to detect spurious correlation captured by the black-box model being explained. Thus, while attribution techniques are an easy-to-use and understand model diagnostic, their effectiveness is limited by the scope of their definitions and assumptions.

| Concept-based explanations
Explanation methods described so far mainly focused on elucidating the relationship between the input variables and the network's output. Arguably, explanations expressed w.r.t. the input space are not always straightforward. For example, individual pixels in an attribution map (Section 4.3.1) are meaningless. The user must associate the map with larger, semantically meaningful regions in the image to make sense of the attribution. Moreover, sometimes attribution methods might fail to explain the relationship clearly, for example, consider the case where the ground-truth explanation for classification is the object's color. One way to address such limitations is to explain the model's predictions in terms of high-level, human-understandable concepts, similar to the concept bottlenecks (Section 4.2.11). For instance, for the medical image in Figure 5, high-level concepts explaining the classification might be the visibility and diameter of the inflamed appendix. Kim et al. (2018) propose quantitative testing with concept activation vectors (TCAV)-a method for quantifying the influence of a high-level concept on the representations learnt by a neural network post hoc. Let us consider the following decomposition of k-th output unit of a neural network classifier given by f k where h l : X ! ℝ d l refers to the activation vector of the l-th layer. Given a binary concept C 0, 1 f g, for the layer l of the neural network f Á ð Þ, input x X , and class y ¼ k, the conceptual sensitivity (CS) is defined as where v l C ℝ d l is a concept activation vector (CAV)-a unit-norm vector orthogonal to the linear decision boundary of the classifier in the output space of h l Á ð Þ trained to differentiate between the categories of the concept C. Notably, to compute the CAV and evaluate conceptual sensitivity, a sample of data points labeled w.r.t. C is required. Consequently, conceptual sensitivities can be aggregated into the TCAV score given by where S k is a set of inputs belonging to the k-th class. Intuitively, TCAV f C,k,l quantifies the proportion of inputs from class k for which the activations of the l-th layer of f Á ð Þ are positively influenced by the concept C. The statistic in Equation (25) can be then used for hypothesis testing, for example, to decide if a specific concept has a significant influence on the network f Á ð Þ. TCAV score (Equation 25) only provides a global explanation, that is, it indicates if concept C influences the classifier across the entire dataset. Schrouff et al. (2021) combine the definition of the conceptual sensitivity (Equation 24) with the integrated gradients (Equation 21) to produce local concept-based explanations, referred to as integrated conceptual sensitivity (ICS): where h 0 ℝ d l denotes a reference activation vector. Note that, unlike the integrated gradients, ICS performs integration in the activation space of the network and uses the directional derivative, similar to the TCAV (cf. Equation 24). Another limitation of the TCAV is that, similar to concept bottlenecks, the concepts of interest have to be known, and the dataset must be at least partially labeled w.r.t. the concepts. To this end, some works have focused on the automatic discovery of concepts from neural network activations, for example, methods introduced by Ghorbani et al. (2019) and Yeh et al. (2020).

| Symbolic metamodels
Section 4.2.10 described symbolic regression as an approach to learning interpretable mathematical expressions from raw data. Similarly to linear models and GAMs in LIME (Section 4.3.1), symbolic regression can be used for surrogate modeling of already learnt opaque predictive models (Alaa & van der Schaar, 2019;Crabbe et al., 2020). For instance, Alaa and van der Schaar (2019) propose an elegant parameterization of the symbolic regression problem (cf. Equation 15) that allows for optimization by gradient descent, in contrast to genetic programming and simulated annealing approaches that search through a discrete solution space. According to Alaa and van der Schaar (2019), symbolic metamodeling reduces to the following: where ℒ Á , Á ð Þ is a metamodeling loss, and G is a class of succinct mathematical expressions that serve as a surrogate for the black-box model f Á ð Þ. The authors introduce a parameterization of G that makes the optimization problem in Equation (27) "easier": Þ . By Kolmogorov-Arnold superposition theorem (Arnold, 1957;Kolmogorov, 1956), assuming data points x ℝ p , surrogate model g Á ð Þ can be rewritten in the following form: where g in i,j Á ð Þ and g out i Á ð Þ are continuous basis functions. Equation (28) (Meijer, 1936) that are closed under differentiation. The closure property allows searching through G efficiently using the gradient descent procedure.
Symbolic metamodeling (Alaa & van der Schaar, 2019;Crabbe et al., 2020) alongside symbolic regression (Jin et al., 2019;Stinstra et al., 2007;Udrescu & Tegmark, 2020) is a compelling alternative to attribution methods (Section 4.3.1), especially when we seek a parsimonious analytical representation of a black-box function. The parameterization proposed by Alaa and van der Schaar (2019) is a helpful reformulation of the problem that benefits from the recent advances in automatic differentiation.

| Counterfactual explanations
In some applications, it might be of paramount importance to provide human-friendly explanations (Carvalho et al., 2019) that are understandable to a broad nonspecialist audience. Techniques discussed so far mainly addressed the question "Why this prediction was made?" By contrast, counterfactual explanations try to answer the question "Why was this prediction made instead of another?" These techniques produce contrastive and actionable local explanations that can be helpful in a wide range of real-world settings, for example, when suggesting lifestyle changes to a patient to reduce her risks or providing reasons for the low creditworthiness of a company. Wachter et al. (2017) formalize counterfactual explanations in the context of ML. To find a counterfactual explanation x 0 for a data point x, y ð Þ and a black-box model f Á ð Þ, the authors propose solving the following optimization problem: where d Á , Á ð Þ is an appropriate distance function; y 0 is chosen to be meaningfully different from y, for example, y 0 could represent a desirable classification outcome; the loss ℒ f x 0 ð Þ,y 0 ð Þquantifies how "different" the model's output is for x 0 from the y 0 chosen, for example, one could use MSE for regression or hinge loss for classification; λ is a parameter controlling the slackness on the constraint f x 0 ð Þ¼y 0 . The problem above is loosely reminiscent of generating adversarial perturbations (Moosavi-Dezfooli et al., 2017): perturbations to the original data point x are encouraged to be sparse by penalizing d x, x 0 ð Þ. Mothilal et al. (2020) extend the framework above to multiple diverse counterfactual explanations. In particular, for a data point x, explanations c 1 , …, c K are found by solving the optimization problem below: where S k,l ¼ 1 1þd c k ,c l ð Þ , and, thus, the term det S ð Þ quantifies diversity among explanations. In addition, Mothilal et al. (2020) propose an array of quantitative evaluation metrics for counterfactual explanation techniques, such as (i) validity quantifying how many of the proposed explanations are actual counterfactuals; (ii) proximity measuring the "closeness" of explanations to the original data point; (iii) sparsity quantifying how sparse the perturbations of x are; and (iv) diversity evaluating how diverse the proposed explanations are.
Counterfactual explanation methods above rely on the gradient descent and, thus, assume that the black-box model f Á ð Þ is differentiable. Karimi, Barthe, Balle, and Valera (2020) generalize this framework, introducing model-agnostic counterfactual explanations (MACE). They transform the original optimization problem into a sequence of Boolean satisfiability problems and leverage powerful satisfiability modulo theory solvers to solve these. A significant advantage of MACE is its complete agnosticism to the choice of the black-box model f Á ð Þ or distance function d Á , Á ð Þ and its ability to incorporate additional plausibility constraints that allow injecting domain-specific knowledge.
The problem of counterfactual explanation naturally admits generative modeling as an approach to producing counterfactuals. Recently, several papers have utilized deep generative models (Chang et al., 2019;S. Liu et al., 2019;Mahajan et al., 2019) to solve problems similar to the ones considered by Wachter et al. (2017) and Mothilal et al. (2020). Chang et al. (2019) introduce fill-in the dropout (FIDO) saliency maps based on counterfactual generation with masking for explaining image classifiers. S. Liu et al. (2019) leverage GANs to generate minimal change counterfactual examples for image classifiers. Last but not least, Mahajan et al. (2019) propose a VAE-based counterfactual generative model that focuses on feasibility and preservation of causal constraints with regularization derived from a structural causal model (Pearl, 2010).
Another perspective on counterfactual explanations is algorithmic recourse, surveyed in detail by Karimi, Barthe, Schölkopf, and Valera (2020). Algorithmic recourse focuses on explaining the decisions and recommending further actions to "individuals who are unfavourably treated by automated decision-making systems" (Karimi, Barthe, Schölkopf, & Valera, 2020).  criticize counterfactual explanations for the lack of actionability and provide a causal perspective of algorithmic recourse by considering interventions instead of explanations. To avoid infeasible or costly recommendations resulting from naïve counterfactuals,  propose finding minimal cost structural interventions resulting in a favorable outcome. While this approach certainly offers an exciting and, possibly, more user-centered perspective, the core limitation of algorithmic recourse is the unrealistic assumption of a known causal structure Karimi, von Kügelgen, Schölkopf, & Valera, 2020).

| CONCLUDING REMARKS
Interpretable and explainable machine learning is still a young and active research area. With the recent rapid advances in designing highly performant predictive models and the inevitable infusion of machine learning into different application domains, algorithmic decision-making will have far-reaching consequences. Therefore, algorithms need to be understood and trusted by human end-users. In this overview, we surveyed interpretable machine learning models and explanation methods, described the goals, desiderata, and inductive biases behind these techniques, motivated their relevance in several fields of application, illustrated possible use cases, and discussed their evaluation.
Although a lack of universal and rigorous definitions for interpretability or explainability may seem like an impediment, it might be impossible or even harmful to define interpretability due to the sheer breadth of contexts and applications that call for it. Nevertheless, interpretable and explainable ML could benefit from better empirical research practices like most developing research areas, as many works still rely on purely qualitative or even anecdotal evidence. The development of standardized evaluation criteria and benchmarks could make research efforts reproducible and more focused. Last but not least, meaningful adaptations of the discussed methods to "real-world" machine learning systems and data analysis problems largely remain a matter for the future. For widespread and fruitful use of interpretable and explainable ML, stakeholders need to be involved in the discussion. Interdisciplinary collaboration on equal terms between machine learning researchers and stakeholders from application domains, such as medicine, natural sciences, and law, is the next logical step in the evolution of interpretable and explainable ML.

RELATED WIREs ARTICLES
Causability and explainability of artificial intelligence in medicine Interpretability of machine learning-based prediction models in healthcare A historical perspective of explainable artificial intelligence