Missing the missing values: The ugly duckling of fairness in machine learning

Nowadays, there is an increasing concern in machine learning about the causes underlying unfair decision making, that is, algorithmic decisions discriminating some groups over others, especially with groups that are defined over protected attributes, such as gender, race and nationality. Missing values are one frequent manifestation of all these latent causes: protected groups are more reluctant to give information that could be used against them, sensitive information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we present the first comprehensive analysis of the relation between missing values and algorithmic fairness for machine learning: (1) we analyse the sources of missing data and bias, mapping the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should discourage the consideration of missing values as the uncomfortable ugly data that different techniques and libraries for handling algorithmic bias get rid of at the first occasion, (3) we study the trade‐off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods), and (4) we show that the sensitivity of six different machine‐learning techniques to missing values is usually low, which reinforces the view that the rows with missing data contribute more to fairness through the other, nonmissing, attributes. We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making.


| INTRODUCTION
Because of the ubiquitous use of machine learning and artificial intelligence (AI) for decision making, there is an increasing urge in ensuring that these algorithmic decisions are fair, that is, they do not discriminate some groups over others, especially with groups that are defined over protected attributes, such as gender, race and nationality. [1][2][3][4][5][6][7] Despite all this growing research interest, fairness in decision making did not arise as a consequence of the use of machinelearning and other predictive models in data science and AI. 8,9 Fairness is an old and fundamental concept when dealing with data that should cover all data processing activities, from data gathering to data cleansing, through modelling and model deployment. It is not simply that data are biased and this can be amplified by algorithms, but rather that data processing can introduce more bias, from data collection procedures to model deployment. [10][11][12][13][14][15][16][17] It is therefore no surprise that fairness strongly depends on both the quality of the data and the quality of the processing of these data. 18 One major issue for data quality is the presence of missing data, which may represent the absence of information but also some information that has been removed due to several possible reasons (inconsistency, privacy or other interventions). 19,20 Once missing data appears in the pipeline it becomes an ugly duckling for many subsequent processes, such as data visualisation and summarisation, feature selection and engineering, and model construction and deployment. It is quite common that missing values are removed or replaced as early as possible, so that they no longer become a nuisance for a bevy of theoretical methods and practical tools.
In this context, we ask the following questions: (1) Are missing data and fairness related? (2) Are those subsamples with missing data more or less unfair? (3) Is it the right procedure to delete or replace these values, as many theoretical models and machine-learning libraries do by default? In this paper we analyse all these questions through the first comprehensive analysis, to our knowledge, of the relation between missing values and fairness. We also give a series of recommendations and guidelines about how to proceed with missing values if we are-as we should be-concerned by unfair decisions.
Let us illustrate these questions with an example. The Adult Census Data 21 is one of the most frequently used data sets in the fairness literature, where race and sex are attributes that could be used to define the protected groups. Adult has 48,842 records and a binary label indicating a salary of ≤$50K or >$50K. There are 14 attributes: eight categorical and six quantitative attributes. The prediction task is to determine whether a person makes over 50K a year based on their attributes. Not surprisingly, as we see in Figure 1, there is an important number of missing values, as it is usually the case for real-world data sets, and most especially those dealing with personal data. As we will see later, the missingness distribution in this data set is not missing completely at random (MCAR). This means that discarding or modifying the rows with missing values can bias the sample. But, more interestingly, these missing values appear in the occupation, workclass and native.country attributes, which seem to be strongly related to the protected attributes. As a result, the bias that is introduced by discarding or modifying the rows with missing values can have an important effect on fairness.
On the one hand, as missing values are so commonplace, there are many techniques for data cleansing, feature selection and model optimisation that have been designed to convert (or get rid of) missing data with the aim of improving some performance metrics. To our knowledge, however, fairness has never been considered in any of these techniques.
On the other hand, many theoretical methods for dealing with fairness-the so-called mitigation methods 22 -do not mention missing values at all. Conditional probabilities, mutual information and other distributional concepts get ugly when attributes can have a percentage of missing values. Much worse, when the theory is brought to practice and ultimately to implementation, we see that the most common libraries (AIF360 toolkit, 23 Aequitas, 24 Themis-ML 25 and Fairness-Comparison library 22 ) simply remove the rows or columns containing the missing values, or assume the data sets have been preprocessed before being analysed (removing missing values), otherwise throwing an error to the user. As a result, the literature using these techniques and libraries simply reports results about fairness as if the data sets did not contain the rows with the missing data. In other words, the data sets are simply mutilated.
Apart from the adult data set, in this paper we will also explore and analyse five other data sets: Recidivism, Titanic, Autism, Credit Approval and Juvenile Offenders. Given the intrinsic nature of the previous data sets they may have potential fairness issues and, as usual, contain missing values. The six data sets have important differences in terms of domain, protected attributes and ratio of missing values, but it is still manageable to understand them in some detail. In what follows, we will use all of them to give a comprehensive perspective to the general analysis, the conceptual questions and the results of different algorithms. These six data sets are the only ones that, to the best of our knowledge, are commonly scrutinised in the literature of fairness and also contain missing values. We also propose this collection of data sets (with two protected attributes each) as a benchmark to evaluate future studies on fairness with missing data, building on the insights of the foundational analysis performed in this paper.
The Recidivism data set contains variables used by the COMPAS algorithm 26 to assess potential adult recidivism risk in the United States, where the class indicates whether the inmate commits a crime in less than 2 years after being released from prison or not. Data for over 10,000 criminal defendants in Florida were gathered by ProPublica, 1 showing that black defendants were often predicted to be at a higher risk of recidivism than they actually were, compared with their white counterparts. In this data set, there are also three attributes with missing values: days_b_screening_arrest, c_days_from_compas and c_charge_desc. We also study the data for 891 of the real Titanic passengers, a case where bias is more explicit than that in the other cases. The class represents whether the passenger survived or not, where the conditional probability that a person survives given their sex and passengerclass is higher for females from higher classes. There are also three attributes with missing values: age, fare and embarked. Taking into account the strong semantic meaning of these attributes and possible association with protected groups by gender or race, it is very likely that the missing values of these attributes can have an effect on fairness, as we will study in the following sections. The Autism data refer to Autistic Spectrum Disorder (ASD) screening information for 704 adults. This data set includes attributes regarding the test takers' demographics (e.g., age, gender, ethnicity, etc.) as well as 10 screening test (binary) questions. The class represents early autism diagnosis, where the conditional probability that a person is diagnosed with this condition given their Sex and Ethnicity is higher for white-European females. In this data set, there are two attributes with missing values: Age and Relation. The Credit Approval data set collects a company's decisions to approve or deny credit card applications based on the applicant's information (e.g., prior default, years employed, credit score, income level, loan balances or number of individual's credit reports, etc.). It contains information about 690 applicants. Machine-learning models attempting to generalise this sort of data usually find patterns that may be controversial or even illegal. For instance, the loan repay model unveils that the applicant's ethnicity plays a significant role in the prediction of repayment because the training data set happened to have better repayment for white applicants compared with nonwhite applicants. In this data set, there are five attributes with missing values: Age, Married, BankCustomer, EducationalLevel and ZipCode. Finally, the Juvenile Offenders data set contains information about 4753 juvenile offenders who were incarcerated in the juvenile justice system of Catalonia (Spain). Their recidivism status (after their release) is assessed with SAVRY1. It has been found that, in general, machine-learning models trained with these data could end up discriminating by gender against men, foreigners or people from specific national groups. 7 Compared with the numerous fairnessrelated studies published on COMPAS, the literature in juvenile criminal justice is limited, 28 and only a study analyses disparities between protected groups when using ML algorithms. 7 In this data set, there are six attributes with missing values: Edat_fet_agrupat (age), Provincia (province), Comarca (county), Edat_fet (age when offence) and Fet (offence). As we will see at the end of Section 2, the distribution of these missing values is not MCAR for any of these six data sets.
The main novel contribution in this paper is to clarify the relationship between missing data and fairness, and the best way forward in real situations. In this regard, we will combine a theoretical analysis of the causes of missing data and unfairness, with an experimental analysis of different kinds of classifiers using the six data sets above. We convert this general goal into more specific technical questions, such as whether the examples with missing values are fairer than the rest in terms of the Statistical Parity Difference (SPD), a common fairness metric, 23 or whether there is a relationship between protected attributes and the attributes with missing values. We will also formally characterise the space that derives from the trade-off between a metric of fairness and accuracy, and how different subsets of the data with or without missing values are represented in that space, also including the results replacing missing data with imputed values.
The rest of the paper is organised as follows. Section 2 overviews the reasons why missing values appear, what kinds of missingness there are and how missing values are handled (e.g., imputation). Next, in Section 3 we analyse the causes of fairness, along with some metrics, mitigation techniques and libraries. In Section 4, we put these two areas together and see why missingness and fairness are so closely entangled. We also analyse-for the running data sets and two scenarios each-whether the examples with missing values are fairer than the rest in terms of a common fairness metric, SPD. SPD sets the space of trade-offs with performance metrics (accuracy), representing a bounding octagon, which we derive theoretically. Once this is understood conceptually, in Section 5 we analyse a predictive model that can deal with missing values directly and see whether the bias is amplified or reduced. In Section 6 we study what happens when imputation methods (IMs) are introduced so that many other machinelearning techniques can be used. We analyse how sensitive different machine-learning algorithms are to changes in the attributes with missing values. We empirically evaluate how the results with imputed attributes compare with the models learnt by removing the missing values and with some reference classifiers (majority class and perfect classifier). In Section 7 we further analyse these results to make a series of recommendations about how to proceed when dealing with missing values if fairness is to be traded off against performance. Finally we close the paper with some final comments and takeaway messages.

| MISSING DATA
Missing data are a major issue in research and practice regarding real-world data sets. For instance, in the educational and psychological research domains, Peugh and Enders 29 found that, on average, 9.7% of the data was missing (with a maximum of 67%). In the same area, Rombach et al. 30 estimated that this percentage ranged from 1% to over 70%, with a median percentage of 25%. Another clear example of how pervasive missing data is can be found in the percentage of data sets (over 45%) that have missing values in the UCI repository, 31 one of the most popular sources of data sets for machine-learning and data science researchers.
Being such a frequent phenomenon, it is not surprising that 'missingness' has different causes. In the context of this paper, we need to analyse these causes if we want to properly understand the effect of missing data on fairness.
Three main patterns can be discerned in missing data 32 : • Partial completion (attrition). A partial completion (breakoff) is produced for a single record or sequence when, after collecting a few values of a record, at a certain point in time or place within a questionnaire or a data collection process, the remaining attributes or measurements are missing. That means that attributes that are at the end of a questionnaire, as well as users more prone to fatigue or problems, are more likely to be missing. Note that this case may have effect on full rows as well, as if only a few questions (attributes) are recorded, then the whole example (row) could be removed. This type of answering pattern usually occurs in longitudinal studies where a measurement is repeated after a certain period of time, or in telephone interviews and web surveys. This kind of missing data creates a dependency between attributes and their order, which may be used for imputation and other methods for treating missing values. • Missing by design. This refers to the situation in which specific questions or attributes will not be posed to or captured for specific individuals. There are two main reasons for items to be missing by design. (1: contingency attributes) Certain questions may be 'non-applicable' (NA) to all individuals. In this case, the missingness mechanism is known and can be incorporated in the analysis. (2: attribute sampling) Specific design is used to administer different subsets of questions to different individuals (i.e., random assignment of questions to different groups of respondents). Note that in this case, all questions are applicable to all respondents, but for reasons of efficiency not all questions are posed to all respondents. Here, we also know the missingness mechanisms, but due to the random assignment of questions, this is the easiest case to treat statistically. • Item nonresponse. No information is provided for some respondents on some variables. Some items are more likely to generate a nonresponse than others (e.g., private items, such as income). In general, surveys, questionnaires or interviews used to collect data suffer missing data in three main subcategories. (1: not provided) The information is simply not given for a question (e.g., an answer is not known, the value cannot be measured, a question is overlooked by accident, etc.); (2: useless) The information provided is unprofitable or useless (e.g., a given answer is impossible, unsuitable, unreadable, illegible, etc.); and/or (3: lost) usable information is lost due to a processing problem (e.g., error in data entry or data processing, equipment fails, data corruption, etc.). The former two mechanisms originate in the data collection phase, and the latter is the result of errors in the data processing phase.
While the three categories above originate from questionnaires involving people, they are general enough to include some other causes behind the presence of missing data in real applications (also known as incomplete data), including different kinds of failures: network, power, the own device or its sensors. 33 For instance, if a sensor capturing data stops working because it runs out of battery, we have a partial completion. If it is only installed in some specific models, devices or locations, we have missing by design. Finally, if it is affected by the weather, we have an item nonresponse. Internet of Things (IoT) applications analysing big data represent another example of a data source where data loss is very common due to, for instance, unreliable wireless link or hardware failure in the nodes (i.e., partial completion).
Missing data may also have several effects in social media environments where actors are linked together via multiple interaction contexts. 34 In this latter case, there is also a number of missing data situations, such as the noninclusion of actors, affiliations or some other incomplete registration data (i.e., item nonresponse), and censoring by vertex degree, as there is a practical limit on the number of neighbours of a vertex that can be explored (i.e., missing by design). On many occasions, for simplicity or because the traceability to the data acquisition is lost, we can only characterise some statistical kinds of missing values. In general, a distinction is made between three types of missingness mechanisms 35,36 : (1) MCAR, where missing values are independent of both unobserved and observed parameters of interest and occur entirely at random (e.g., accidentally omitting an answer on a questionnaire). In this case, missing data are independent and simple statistical treatments may be used. (2) Missing at random (MAR), where missing values depend on observed data, but not on unobserved data (e.g., in a political opinion poll many people may refuse to answer based on demographics, then missingness depends on an observed variable [demographics], but not on the answer to the question itself). In this case, if the variable related to the missingness is available, the missingness can be handled adequately. MAR is believed to be more general and more realistic than MCAR (missing data IMs generally start from the MAR assumption). Finally, we have (3) missing not at random (MNAR), where missing values depend on unobserved data (e.g., a certain question on a questionnaire tends to be skipped deliberately by those participants with weaker opinions). MNAR is the most complex nonignorable case where simple solutions no longer suffice, and an explicit model for the missingness must be included in the analysis.
Note that all three mechanisms that generate missing data may be present at the same time for different attributes. 37 While it is possible to test the MCAR assumption (t-test), distinguishing between MAR and MNAR cannot be tested due to the fact that the answer lies within the absent data. 38,39 The literature also describes different techniques to handle missing data depending upon the missingness mechanism. 40 However, it is quite common that many practitioners apply a particular technique without analysing the missingness mechanism. Among the most common techniques, we can name the following: • Row or listwise deletion (LD) implies that whole cases (rows) with missing data are discarded from the analysis. When dealing with MCAR data and the sample is sufficiently large, this technique has been shown to produce adequate parameter estimates. However, when the MCAR assumption is not met, LD will result in bias. 41,42 • Column deletion (CD) simply removes the column. This is an extreme option as it totally removes the information of an attribute. Also, it can generate bias as well, as the values of other columns may be affected by the missing values in the removed column. • Labelled category (LC). An attribute (usually quantitative) can be binned or discretised, adding a special category for missing values. However, the special label representing the missing value lacks any ordinal or quantitative value. Sometimes, the existence of a missing value can be flagged with a new Boolean attribute and the original attribute is removed. • IMs. The missing value is replaced by a fictitious value. There are many methods to do this, such as replacing it by the mean (or the median) when the attribute is quantitative, and the mode when the attribute is qualitative, or estimating the value from the other attributes, even using predictive models.
LD is very common. Even if the MCAR assumption is not disproven, LD is claimed to be suboptimal because of the reduction in sample size. 29,40 Also, this technique for MCAR FERNANDO ET AL.
| 3223 does not use the available data efficiently. 43 Therefore, methodologists have strongly advised against the use of LD, 35,37 judging it to be "among the worst methods for practical applications" [44, p. 598]. IM, on the contrary, is increasingly more frequent as a preprocessing for many machine-learning techniques, as they cannot deal with missing values. However, many libraries do not explicitly state what IM they are using, and this may vary significantly depending on the machine-learning technique, library or programming language that is used. For example, random forest 45 handles missing values by imputation with average/mode or proximity-based measures whenever the implementation is based on Classification and Regression Trees (CARTs), 46 as it was originally proposed by its authors. However, in implementations where C4.5 decision tree learning 47 is used instead, the missing values are not replaced. The impurity score mechanisms take them into account (e.g., information gain is simply weighted by the proportion of missing values on a particular attribute 48 ). This implicit (usually by-default) processing happens more in research than that in real practice, where the domain expert can evaluate several mechanisms and choose the best one, even replacing the by-default treatment done by a training algorithm. In research papers, many data sets are frequently used in experimental evaluations, and explicitly choosing the best choice for dealing with missing values would require a full understanding and analysis of each data set and technique. Using the same missingness mechanism, not to say the same IM, for a range of data set and machine-learning techniques is clearly suboptimal. There is no good-for-all missingness-dealing mechanism. We can analyse the missingness mechanisms in our six running data sets (Adult, Recidivism, Titanic, Autism Screening, Credit Approval and Juvenile Offenders). We use Little's MCAR global test, 49 a multivariate extension of the t-test approach that simultaneously evaluates mean differences on every variable in the data set. Using the R package Baylor-EdPsych 2, we reject the null hypothesis of MCAR with p <0.001 for the six data sets. MCAR is discarded for the six of them, and hence LD is inappropriate.
Given that we do not have the complete traceability of the data sets, we can only hypothesise the causes for the missing values. For instance, in the Adult Census Data it may be due to item nonresponse, because of the distribution of missing values seen in Figure 1.

| FAIRNESS
Fairness in decision making has been recently brought to the front lines of research and the headlines of the media, but in principle, making fair decisions should be equal to making good decisions. The issue should apply both to human decisions and algorithmic decisions, but the progressive digitisation of the information used for decision making in almost every domain has facilitated the detection and assessment of systematic discrimination against some particular groups. The use of data processing pipelines is then a blessing and not a curse for fairness, as it makes it possible to detect (and treat) discrimination in a wider range of situations than when humans make decisions solely based on their intuition. As we will see, metrics for fairness have led to computerised techniques to improve fairness (mitigation techniques). However, it is still very important to know the causes of discrimination, to avoid oversimplifying the problem. Also, in this paper we want to map these causes with those of missing values seen in Section 2.
What are the causes that introduce unfairness in the decision process? We can identify common distortions originally arising in the data, but also those in the algorithms or the humans involved in decision making that can perpetuate or even amplify some unfair behaviours. From the literature, [50][51][52][53][54][55][56][57][58] we can classify them into six main groups: • Sample or selection bias. It occurs when the (sample of) data are not representative of the target population about which conclusions are to be drawn. This happens due to the sample being collected in such a way that some members of the intended population are less likely to be included than others. A classic example of a biased sample happens in politics, such as the famous 1936 opinion polling for the US presidential elections carried out by the American Literary Digest magazine, which overrepresented rich individuals and predicted the wrong outcome. 59 If some groups are known to be underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias. • Measurement bias (systematic errors). Systematic value distortion happens when the device used to observe or measure favours a particular result (e.g., a scale which is not properly calibrated might consistently understate weight producing unreliable results). This kind of bias is different from random or nonsystematic measurement errors whose effects average out over a set of measurements. On the contrary, systematic errors cannot be avoided simply by collecting more data, but by having multiple measuring devices (or observers of instruments) and specialists to compare the output of these devices. • Self-reporting bias (survey bias). This has to do with nonresponse, incomplete and inconsistent responses to surveys, questionnaires or interviews used to collect data. The main reason is the existence of questions concerning private or sensitive topics (e.g., drug use, sex, race, income, violence, etc.). Therefore, self-reporting data can be affected by two types of external bias: (1) social desirability or approval (e.g., when determining drug usage among a sample of individuals, the actual average value is usually underestimated); and (2) recall error (e.g., participants can erroneously provide responses to a self-report of dietary intake depending on her ability to recall past events). • Confirmation bias (observer bias). This bias places emphasis on one hypothesis because it involves favouring information that does not contradict the researcher's desire to find a statistically significant result (or its previously existing beliefs). This is a type of cognitive bias in which a decision is made according to the subject's preconceptions, beliefs or preferences, but can also emerge owing to overconfidence (with contradictory results/evidence being overlooked). Peter O. Gray 60 provides an example of how confirmation bias may affect a doctor's diagnosis: "When the doctor has jumped to a particular hypothesis as to what disease a patient has may then ask questions and look for evidence that tends to confirm that diagnosis while overlooking evidence that would tend to disconfirm it." • Prejudice bias (human bias). A different situation is when the training data that we have at hand already includes (human) biases containing implicit racial, gender or ideological prejudices. Unlike the previous categories, which mostly affect the predictive attributes (model inputs), this kind of bias is concerned with variables that are used as dependent variables (model outputs). Therefore, systems designed to reduce prediction error will naturally replicate any bias already present in the labelled data. We find examples of this in the reoffence risk-assessment tool COMPAS deployed in federal US criminal justice systems to inform bail and parole decisions that demonstrated biased against black people. 1 Another one is the Amazon's AI hiring and recruitment system that showed a clear bias against women, 61 having been trained from CVs submitted to the company over a 10-year period. • Algorithmic bias. In this case the algorithm creates or amplifies the bias over the training data. For instance, different populations in the data may have different feature distributions (also having different relationships to the class label). As a result, if we train a group-blind classifier to minimise overall error, as it cannot usually fit both populations optimally, it will fit the majority population. It may be plausible that the best classifier is one that always picks the majority class, or ignores the values for some minority group attributes leading, thus, to (potentially) higher distribution of errors in the minority population.
While the range of causes in general can be enumerated, the precise notion of fairness and what causes it in a particular case is much more cumbersome. It is nonetheless very helpful to become more precise with definitions or metrics of fairness, based on the notion of protected attribute and parity. Let us start with the definition of a decision problem, which is tantamount to a supervised task in machine learning. We will focus on classification problems, since it is the most common case in the fairness literature and the metrics are simpler. Let us define a set of attributes X , where the subset ⊂ S X denotes the protected attributes. A protected attribute is assumed to be categorical (e.g., race, gender and religion) and can partition a population into groups that should have parity in terms of the potential benefit obtained. For each protected attribute S i , we have a set of values V i (e.g., {male, female}). Groups can be created by setting the value of one or more protected attributes. We typically use the term 'privileged' group to highlight a group that has a systematic advantage in the context or domain of application (e.g., white males). Usually, fairness metrics are based on determining whether decisions are different between groups or just between the privileged groups and the rest. Now consider a label or class attribute Y , which can take values in C (e.g., {guilty, non guilty}). For the analysis of fairness we usually consider one of the classes as the "favourable" (or positive) outcome, denoted by c + . Note that a favourable label value corresponds to an outcome that provides an advantage to the individual represented by the example. Similarly, c − denotes the unfavourable class.3 For instance, in this case, nonguilty is the favourable outcome. An unlabelled example x is a tuple choosing values in V i for each ∈ X X i , possibly including the extra value ⊙, representing a missing value. A labelled example or instance 〈 〉 x y , is formed from an unlabelled example with the class value y taken from C (e.g., 〈 ⊙ 〉 black female guilty {race = , gender = , income = }, ). A decision problem is just defined as mapping x to ŷ, such that ŷ is correct with respect to some ground truth y. Let us denote with M a mechanism or model (human or algorithmic) that tries to solve the decision problem. Data sets, samples and populations are defined over sets or multisets of examples, and examples are drawn or sampled from them, using the notation D . Given a population or data set D we denote by D X a . Some fairness metrics choose an indicator that does not depend on the confusion matrix, but just on the overall probabilities. For instance, the percentage for the favourable class or the unfavourable class can be applied either to data sets or the predictions of a model. Other metrics are defined in terms of comparing predicted and true labels, for example, comparing the true positive rate (TPR) and the false positive rate (FPR), so they can only be applied to models when compared with the ground truth (or a test data set). In the end, several combinations of cells in the confusion matrix lead to dozens of fairness metrics.
A very common metric is the SPD, which-for the favourable class-can be defined for an attribute X i with privileged value a as follows: For instance, if the attribute is Race and the values are caucasian, black and asian, if we consider caucasian as the privileged group, SPD would be the probability of favourable outcomes for Caucasians minus the probability of favourable outcomes for non-Caucasians. A value of 0 implies both groups have equal benefit, a value less than 0 implies higher benefit for the unprivileged group, and a value greater than 0 implies higher benefit for the privileged group. Note that the sign of SPD will change if for a binary data set we swap the favourable and unfavourable class: The same happens if we swap the privileged groups. This is an interesting property, as the choice of the favourable class and especially the privileged group is sometimes arbitrary. For instance, in the adult data set, if a model is used to grant a subsidy, the favourable class is earning <$50K. The important thing is that a value closer to zero is fairer.
Other popular metrics that are applicable to both data sets and models are the Disparate Impact (DI), 62 which calculates a ratio instead of the difference as SPD does (it should be noted that the use of a ratio introduces some issues with extreme values in the metric). Other metrics that are only applicable to models are the Equal Opportunity Difference (EOD), 14,63 which is the difference in TPR between the groups, or the Average Odds Difference (OddsDif), 14 which also considers the FPR (see Reference [64] for a summary of fairness metrics definitions). There is usually some confusion about which fairness metric to look at, a question that is not very different from the choice of the 'right' performance metric in classification. 65,66 However, as we can see in their definitions, all of them are closely connected since they try to quantify differences between the unprivileged and privileged groups. For illustrative purposes, we have evaluated the strength of the relationships between all these metrics in a comprehensive simulated scenario. The absolute Spearman correlations between SPD, DI and OddsDif are all higher than 0.84, and EOD is always higher than 0.59.
In conjunction with the performance metric, we have to select the performance metric that indicates how successful a trained model has been when scoring or predicting examples. Depending on the metric, performance can be measured using a (1) threshold and a qualitative understanding of error (e.g., accuracy, F-score, Kappa statistic, etc.); (2) probabilistic understanding of error (e.g., mean absolute error, LogLoss, Brier score, etc.); and on (3) how well the model ranks the examples (e.g., AUC). Our experience and empirical evidence, at least for the data sets we have selected for this study, show that we reach similar conclusions independently of the metrics of fairness and performance we choose for the analysis. As a result, in what follows, and for the sake of exposition and simplicity, we will use SPD as a representative fairness metric and accuracy as a metric of performance.
The introduction of formal metrics has fuelled the development of new techniques for discrimination mitigation. There are three main families: those who are applied to the data before learning the model, those that modify the learning algorithm, and those that modify or reframe the predictions (these are known as preprocessing, in-processing and postprocessing, respectively, in Reference [23]). Typically, maximising one fairness metric has a significant effect on the performance metrics, such as prediction error, so it is quite common to find tradeoffs between fairness metrics and performance metrics. 67 A possible way of doing this is FERNANDO ET AL.
| 3227 through a Pareto optimisation, where different techniques can be compared to see whether they improve the Pareto front. We will use this approach in the rest of the paper.
Finally, both metrics and mitigation techniques are usually integrated into libraries. As of today, these are the most representative ones, in our opinion: • AIF360 toolkit 23 is an open-source library to help detect and remove bias in machine-learning models. The AI Fairness 360 Python package includes a comprehensive set of metrics for data sets and models to test for biases, explanations for these metrics, and algorithms to mitigate bias in data sets and models. • Aequitas 24 is an open-source bias audit toolkit for data scientists, machine-learning researchers and policymakers to audit machine-learning models for discrimination and bias, and to make informed and equitable decisions when developing and deploying predictive risk-assessment tools. • Themis-ML 25 is a Python library that implements some fairness metrics (Mean difference and Normalised mean difference) and fairness-aware methods, such as Relabelling (preprocessing), Additive Counterfactually Fair Estimator (in-processing) and Reject Option Classification (postprocessing), providing a handy interface to use them. The library can be extended with new metrics, mitigation algorithms and test-bed data sets. • Fairness-Comparison library 22 presented as a benchmark for the comparison of the different bias mitigation algorithms, this Python package makes available to the user the set of metrics and fairness-aware methods used for the study. It also allows its extension by adding new algorithms and data sets.
Some of these libraries come (or are illustrated) with data sets that are known to have or lead to fairness issues. In our case, we chose those data sets from the literature that has missing values originally. Titanic is uncommon in the literature of fairness, but its issues are also very relevant (although somewhat in the opposite direction as usual). This collection of data sets is proposed as a general benchmark to examine the relation between fairness and missingness. The details of these data sets were given in Section 1.

| MAPPING MISSINGNESS AND UNFAIRNESS
The new question we ask in this paper is the effect of missing values for fairness. As we will see, this is a complex relation for which we need to map the causes of missing values to the causes of unfair treatment, as illustrated in Figure 2. Looking at the left-hand side of the figure, we see that missing values might be the consequence of innumerable factors, from basic errors while processing and acquiring data to intentional action from human agents. When fairness is taken into consideration, one must realise that the missing data might not be evenly distributed between different groups, which in turn might lead to unwanted effects on the fairness of the data and models created, depending on how the missing data are handled. For example, on the side of errors, an important factor is that different groups might have different semantic constructs to answer the same query, leading to different interpretations and omissions, 68 for example, with more missing values being generated by a sensitive group than the other. On the other extreme, people might intentionally omit information as a natural coping mechanism when there is a belief that a truthful and complete answer might lead to a discriminatory and unfair decision. 69 If we go one by one from the causes of missingness to unfairness or vice versa, we see that many combinations are related. While attrition, attribute sampling (if random) and lost information may be less associated with those causes of unfairness, some others, such as contingency questions, not provided and useless answers may be strongly related to fairness. We can recognise common underlying origins for these three missingness situations and three causes of fairness: self-reporting (survey) bias, confirmation (observer) bias and prejudice (human) bias (see Figure 2). In general, we find common origins of missingness and fairness: the emergence of minorities or underrepresented groups that are reluctant to provide sensitive information, and discriminatory or unfavourable treatment to those individuals in the decision-making process on grounds, such as gender, disability or sexual orientation, or influenced by cultural norms. What we do not know is whether some of the conscious or unconscious actions done by the actors in the process may have a compensatory result. For instance, are women who do not declare their number of children treated in a more or less fair way than those who declare (or lie) on their number of children? In the end, filling a questionnaire or simply any other kind of behaviour when a person knows that she is being observed creates a bias. People would tend to conceal their real information or behaviour, to be classified in their desired group. In other words, some types of missing values can be used in an adversarial way by the person being modelled, trying to be classified into the favourable outcome. All this happens with university admissions, credit scoring, job applications, dating apps and so forth.
Bias, either intentional or unintentional can as well arise in some other use cases due to missingness, such as some of the following. For instance, discarding data due to a subpopulation having missing values more commonly (useless answers or not provided) may result in underrepresentation. Decision makers may also favour a privileged group or reinforce stereotypes influenced by cultural norms or one's own beliefs by, for instance, using underrepresentative data to train machine-learning models (e.g., gender and racial bias found in AI recognition technology 70,71 ). They can also provide different subsets of questions to different F I G U R E 2 Mapping between causes of missingness and unfairness. Although many combinations are possible, dashed lines show those causes of missingness and unfairness that are most strongly related by having a common origin groups, thus hindering objectivity and leading to a selective and possibly misleading use of data to support decisions that have already been made. Furthermore, many real-world data sets in the health or criminal justice domain having missing values can also greatly influence explanations when machine-learning models remove or perform incorrect imputations leading to counterfactuals. 72 This is the case in health records where some tests are missing for most patients because of costs, risks or not applicable for them, but end up being imputed with unforeseen consequences when used for the rest of the analysis. Also, note that a biased population (e.g., hospitalised patients) does not provide the real distribution of the values to be imputed either.
We may even get more certainty about the relation between missingness and unfairness by analysing some real data. Let us look at our six running data sets, Adult, Recidivism, Titanic, Juvenile Offenders, Credit Approval and Autism Screening, each of them with two protected attributes. This allows us to perform 12 different evaluations.4 First, we do a simple correlation analysis, as shown in Figure 3. We observe that the protected group attributes are usually not very correlated with the class, which means that the bias must appear through other proxy attributes. Titanic is the only data set where the protected groups are really predictive about the class, which is what motivated us to include this data set. About the correlations of the occurrence of missing values, we found co-occurrences for Adult between occupation and workclass as one aggregates the other, Recidivism between c_days_from_compass and c_charge_desc, Juvenile Offenders between Edat_fet_agrupat and Edat, also because aggregation, and between Provincia and Comarca, similarly due to aggregation, and Credit Approval between all variables with missing values (except age). The influence and relation of the variables with missing values to the other variables are more diverse. What we see is that having a missing value or not shows small correlations with the protected attributes and the class (top right part of the correlation matrix), although this is affected by the proportion of missing values per attribute being relatively small. The purpose of this correlation matrix is actually to confirm that strong correlations are not found, and the relations, if they exist, are usually more subtle. Now let us have a look at fairness. For each of these 12 cases (6 data sets × 2 privileged groups each), Table 1 shows the fairness metric (SPD) for different subsets of each data set and privileged group. Specifically, we focus on the following subsets of data: (a) the whole (original) data set ("all rows"), (b) the subset of examples that contain, at least, one missing value in any of their attributes ("with ⊙") and (c) the subset of examples that do not contain any missing values ("w/o ⊙"). If we look at the value of SPD for all rows for Adult, we see that this is positive. This means higher benefit for the privileged groups, whites and males. Nevertheless, it is important to note that the favourable class for Adult is earning more than $50K. If a model is used to determine who is entitled to a subsidy, then having the label >$50K would not be the favourable outcome. This suggests that we should not take the positive or negative sign as a good or bad bias, as this is application-dependent, but paying attention to the absolute magnitudes; the closer to 0 the better. A similar result is seen for Recidivism, where the positive values for SPD indicate that the privileged groups (Caucasian and female) are associated with no recidivism more often than the other groups. Then, for Titanic, the positive (and high) values for SPD indicate that the privileged groups (first-class passengers and females) fared very favourably (their survival rate was much higher than the complementary groups). Here, the bias is much stronger, as this followed an explicit protocol favouring females in ship evacuation at the time, and many other conditions favouring the first-class passengers very clearly. For the Autism data set, white-European females are more commonly diagnosed with Autism compared with their counterparts, the bias also being much stronger for the ethnicity compared with the gender. For Credit Approval data, white males are the privileged groups granted with a higher proportion of credit approvals. Finally, for the Juvenile Offender data set, the positive values indicate that there is a bias against men and foreigners, they being associated with a major risk of reoffending.
The really striking result appears when we look at the metrics for the subsets of instances with missing values ("SDP (rows with ⊙)"). Ten out of the 12 cases analysed show that these subsets are fairer (their values closer to zero) than the subset of instances containing only clean rows (without missing values). And in 11 out of 12 cases the bias of the subset with missing values moves in the opposite direction from the bias shown in the whole data set (in Credit Approval with Ethnicity the compensation goes too far). The difference can be seen more clearly with the means in the last row. How do we interpret this finding? It is difficult to explain without delving into the particular characteristics of each data set. For instance, for Adult, this means that for those individuals with missing values in occupation, workclass or native. country, there is less difference in the probability of a favourable outcome (>50K) between privileged and unprivileged groups than for the other rows. Note that this observation is about the data, we are not yet talking about the bias that a machine-learning model can generate. Furthermore, there seems to be no general relationship between the degree of fairness and the ratio of missing values (at least for this set of data sets). In this regard, we have analysed different ratios of missing values per data set (undersampling) and multiple repetitions obtaining almost constant results for the fairness metric (see Appendix B for further details). Algorithmic bias will be affected by the metric that is used for performance. If the metric is a proper scoring rule, the optimal value is attained with the perfect model, that is, a model that is 100% correct. But a perfect classifier will necessarily have exactly the same fairness metrics results as the test data set. This leads us to the following observation: the SPD of the test data set determines the space of possible classifiers, that is, a region that is bounded by the best trade-off between the performance metric and SPD. We can precisely characterise this region. For instance, if we choose accuracy as performance metric, the space of possible classifiers in terms of the balance of SPD and accuracy is bounded by an octagon. The proof and a graphical representation of this octagon can be found in Appendix A. In what follows, and especially in Section 6, the octagons for each particular problem (also in Appendix A) will be used to see how far the trade-offs that are achieved between fairness and accuracy are from the optimal result or from some trivial cases, such as the majority classifier.
The value of the SPD for each data set (which assumes a perfect model) determines the octagon. If we go back to Table 1, we can see that the SPD-accuracy space of the complete case given by SPD (all rows) can reach better values than the space configured by the data set without missing values in nine out of the 12 cases. This is even more extreme for the subset only including the missing values. A perfect classifier for these data sets could have almost perfect SPD for Adult with Race, Recidivism with Sex and Juvenile Offender with Estranger. On the contrary, the perfect classifier for the subset of examples without missing values would be unfair for all cases except for Credit approval with Sex. As a result we see two problems with choosing this mutilated data set, only considering rows without missing values. First, the perfect 'target' that this data set determines is wrong, because any model has to be evaluated with all the examples, not only a sample that is not chosen at random. Second, it will also lead to an unfair model. In the following sections we will explore the location of classifiers in this space and see how far we can approach the boundary of the possible space.
Independently of the explanation of the remarkable finding that the subset with missing values is usually fairer than the rest-for 83.3% of the cases in Table 1 with 11 out of 12 cases opposing the dominant bias-the main question is: if rows with missing values are usually fairer, why is everybody getting rid of them when learning predictive models? As we saw, some of the libraries seen in Section 3 apply either LD (e.g., AIF360 and Fairness-Comparison) or CD (Themis-ML), or they assume that the data set is clean of missing values (Aequitas). These libraries do this because many machine-learning algorithms (and all fairness mitigation methods) cannot handle missing values. Apparently, LD seems the easiest way to get rid of them. However, do we have better alternatives? As we saw, IMs could replace the missing values and keep these rows, which seem to have less biased information, as we have seen in the previous 12 cases. Before exploring the results with LD and a common IM, we need to explore how these missing values affect machine-learning models and the fairness metrics of the trained models, in comparison with the original fairness metrics. In other words, do rows with missing values contribute to bias amplification more or less than the other rows? That is what we explore in Section 5.
Since missing values are ugly and uncomfortable, they are eliminated in one way or another before learning can take place. Actually, most machine-learning methods (or at least most of their off-theshelf implementations) cannot handle missing values directly. There are a few exceptions, and analysing results with a method that does consider the missing values can shed light on the effect of missing values on fairness when learning predictive models. One such an exception is decision trees. During training, missing values are ignored when building the split at each node of the tree. During the application of the model, if a condition cannot be resolved because of missing values, the example can go through some of the children randomly or can go through all of them and aggregate the probabilities of each class. Another important reason why we choose decision trees for this first analysis is that they can become understandable (if of moderate size) and ultimately inspectable, which allows us to see what happens with the missing values and where unfairness is created. Also, the importance of each attribute can be derived easily from the tree.
In particular, we are using the classical CARTs 46 in the implementation provided by the rpart package. 73 This implementation treats missing values using surrogate splitting, which allows the use of the values of other input variables to perform a secondary split for observations with a missing value for the variable of the best (primary) split (see Section 5 in Reference [74] for further information). Because the six data sets we are using are not very imbalanced (at most 76% for Adult), and because accuracy is the most common metric in many studies of fairness, we will stick to this performance metric.
We separated 25% of the data for test from the original data set, disregarding the existence of missing values for the split. Consequently, this test data set has a mixture of rows with and without missing values approximately equal to the whole data set. Then, for training the decision tree, we used four different training sets. The "all rows" case used all the rows not used for test. The "with ⊙" case used the subset of these that have missing values. The "without ⊙" case used the subset of the "all rows" training set whose rows do not have missing values. Finally, for comparison, we made a small sample of the latter of the same size of the training set "with ⊙." We used 100 repetitions of the training/test split, where the fairness metrics (and the accuracies) are calculated with the test set labelled with the decision tree, and then averaged for the 100 repetitions. Table 2 shows the results of CART decision trees for the test set and the 12 cases we are considering. We first compare the fairness results of the model over the data with all rows against the original fairness results of the data set as we saw in Table 1. We see that for Adult and Autism the bias is reduced (represented by ∇), but it is worse for Recidivism, Titanic and Juvenile Offenders, for which bias is amplified (represented by ∆). For the Credit Approval data sets, the bias is augmented or reduced depending on the protected attribute. When we look at other subsets, again comparing them with the data fairness in Table 1, we see a similar picture for the subset without missing values, but almost the opposite situation with the subset with missing values. Again, the rows with missing values are having a very different behaviour. Still, if we analyse which row selection is best to get the least biased model, we see that learning from the rows with missing values is better for fairness, and they now reach the best SPD in 11 out of 12 cases. Finally, to really appreciate how much of the loss in accuracy was due to the characteristics of the rows rather than the number of examples, we have the "sample w/o ⊙," the two rightmost columns in the table. From the results, we see that sample size seems to be the key factor. But not only accuracy is degraded, SPD becomes better than for the full sample, something that is easier to understand in the view of the octagons and the trend towards the perfect model, which is biased.   Of course, this analysis disregards the performance of the model. We know that we can easily obtain a perfectly fair model by using trivial models, such as the one that always predicts the majority class (with a baseline accuracy). Figures 4 and 5 show a bidimensional representation with the performance metric (accuracy) on the y-axis and the fairness metric (SPD) on the x-axis, where we also plot this majority class model. By looking at the trade-off between accuracy and SPD, the picture gets more complex but still more interesting. While the model only learning from the rows with missing values shows poor accuracy (because the sample is small), it gets an intermediate fairness result usually midway between the majority model and the model with all data. This is again in accordance with the straight line between the majority model and the best model in the octagons. The result including all rows almost always has the highest accuracy, but it is only fairer than the subset without missing values for seven of the 12 cases, but these results are always very close. The big difference happens with the small samples. There are cases, such as Autism, where the model with only rows with missing values gets accuracies that are worse than majority, while in other cases this happens for all samples (Juvenile Offenders). 5 To understand the effect of the missing values better, we analyse the effect of those attributes that have missing values. Table 3 uses the same configuration as Table 2, except that the columns with at least one missing value were removed for training (CD). The results are F I G U R E 4 Visualisation of the results from Table 2 also including the majority class model. The x-axis shows accuracy and the y-axis shows SPD. The dashed grey line shows 0 SPD (no bias and perfect fairness), and the grey area goes between −0.1 and 0.1, a reasonably fair zone that helps see the magnitudes better. SPD, Statistical Parity Difference [Color figure can be viewed at wileyonlinelibrary.com] different, especially for Titanic, where bias is importantly reduced for the passage Class and amplified for Sex, for all subsets and in comparison with the data and the model with all columns. Still, SPD for the subset of rows with missing values is the best, even more consistently than that in the case of training with all columns, now for the 12 cases. The surprise in this case comes when we look at accuracy, which is not strongly affected by removing the attributes (and in many cases the results are better).
This suggests that including these rows, but treating the columns in a better way, could be beneficial. As we want to explore other machine-learning methods, and most do not deal with missing values-but we must keep those rows-Section 6 turns our analysis to imputation.

| TREATING MISSING VALUES FOR FAIRNESS: DELETE OR IMPUTE?
Even if most libraries simply delete the rows with missing values, some IMs are so simple that it is difficult to understand why this option is not given in these packages (or included in the literature of fairness research). Fortunately, we can apply a preprocessing stage to every data set F I G U R E 5 Visualisation of the results from Table 2 also including the majority class model. The x-axis shows accuracy and the y-axis shows SPD. The dashed grey line shows 0 SPD (no bias and perfect fairness), and the grey area goes between −0.1 and 0.1, a reasonably fair zone that helps see the magnitudes better. SPD, Statistical Parity Difference [Color figure can be viewed at wileyonlinelibrary.com] T A B L E 3 Results with the same configuration as Table 2   and Autism (age and relation).
with missing values where we can use any IM that we may have available externally. Nevertheless, IMs are not agnostic, and they can introduce bias as well, which can have an effect on fairness. As we saw in Section 2, there are many reasons for missing values and some IMs (e.g., those that impute using a predictive model) can amplify certain patterns, especially if the attributes with missing values are related to the class or the protected groups. In this first analysis we therefore prefer to use simple IMs, namely, the one that replaces the missing value by the mean if it is a quantitative attribute and the mode if it is a qualitative attribute. Once the whole data set has gone through this IM, we can now extend the number of machinelearning methods that we can apply. We are using six machine-learning techniques from the Caret package 75 : Logistic regression (LR), Naive Bayes (NB), Neural Network (NN), Random Forest (RF), the RPart decision tree (DT) and a support vector machine (SV) using a linear kernel. The choice of these six methods was made to have a representative selection of machine-learning methods. They also represent very different ways of treating attributes, [76][77][78] so that we could better understand the impact of missing values (and to an extent of protected attributes) on the classifications.
As seen in Section 5, DTs use attributes very explicitly and it may be the case that an attribute is used (once or more) in a branch leading to the classification of an instance, and that can happen for all, some or no instances. LR, on the other hand, may derive a small coefficient for an attribute, and this sets its influence for all examples. A high coefficient (when all attributes are normalised) is assumed to have a high impact, but higher values for other attributes with lower coefficients may also affect a decision. For both DTs and LR, the relevance of one attribute can mask the relevance of another highly correlated attribute. This does not happen for NB, which assumes independence for the attributes. For NB, it is easy to understand the particular contribution for each decision, as this is just a product of the attribute and class estimates. The question is more convoluted for NNs (with at least a hidden layer), taking into account the nonlinearity. Apart from particular attribute importance metrics for NNs, 79 other especially designed techniques can unveil the relevance of an attribute for each decision. 80 Instead of painstakingly analysing each model (rules, coefficients, weights, etc.) separately for each scenario, we will try to further understand the potentially different behaviour (and, thus, performance) of each of the six previous techniques when dealing with missing values. To do this in an effective way, we introduce a metric of sensitivity that could be applied to any machine-learning method (but comparable to some other technique-specific measures, such as the proportion of examples affected by a condition in a DT or the attribute importance in NNs). We want to anticipate how a model will behave during deployment for different contexts with missing values. In this way, sensitivity could be used as an extra variable to determine whether to include the rows with missing values.
Our sensitivity analysis has been performed as follows. Given a data set D split into train and test, and a trained model using the former, for each test example we modify the original value of a specific attribute (containing missing values), varying X a = i systematically, with value a following the original distribution of X i , denoted as a X i . Next, we calculate the number of predicted positive classes for each example, denoted by PosX a a X = , i i , and similarly for the negative classes. Therefore, for each test example and attribute X i , the sensitivity is defined as are predicted into the minority class by the model. Sensitivity is between 0.5 (maximum variance in the predictions) and 0 (the variations in the attribute do not affect the predictions).
Specifically, we calculate this metric as follows. From the original data sets D, we separated 25% of the data for testing. For each test example and attribute containing missing values, we vary the original value 100 times following the original categorical/numerical distribution. Then we aggregate (average) the sensitivity for all the test examples. These sensitivity results can be seen in Table 4. In general, the values are low, but before looking at them in more detail, we have to clarify that this effect depends, in the first place, on the variability of the attribute. we see that attribute c_charge_desc has over 400 different values, and their corresponding sensitivity values can be very high. Now we can better understand the (magnitudes in the) values in the table. SV is the algorithm with the lowest sensitivity values overall. This can be explained by the use of the high-dimensional kernels combining many attributes, so a single attribute change does not move the example to the other side of the boundary. NB and RF also have low sensitivity for two of the three data sets. This robustness to one single attribute is already reported in the literature 45,81,82 and in line with previous studies about the effect of attribute noise. 83 Some other models have a high variability depending on the attribute (LR and DT), and this can be explained by the correlations between these attributes, as mentioned above (also shown in Figure 3). Finally, NN has high values overall, which is related to the wellknown effect of outliers in some attributes, most specially if the networks are not too deep. 84 This analysis leads to the conclusion that, in general, the low sensitivity values that are seen (they are lower than 10 in 115 out of 126) reinforce the view that the rows with missing data contribute more to fairness (and should be included in the analysis) than having a possible effect in the robustness of the machine-learning models.
Once we can determine whether performance may be affected by a wrong imputation of missing values, in Figures 6 and 7 we show the results of the models trained and evaluated on the imputed data sets for the six machine-learning models, the majority class model (Majority) and a perfect model (Perfect). We have introduced this perfect model as a reference to enrich our analysis, and direct comparison in the octagons. Note that this is tantamount to what we did for Table 1, but for the test only (which explains why the values are roughly the same as those in the table). These points are not achievable in practice (only theoretically), but help us understand the bias that is already in the data, and how much this is amplified. We can compare these plots with the octagons (the space of possible classifiers, following Proposition A1) for each of the 12 combinations that can be shown as separate plots6 in Appendix A. This also reminds us that the perfect model is not unbiased.
For each machine-learning model in Figures 6 and 7 we show the results when trained by removing the rows with missing values (Deletion) and when trained after imputation (Imputation). Note that all models are evaluated with a test set that has applied imputation, otherwise we would not be able to compare all the options with the same data.
The first observation we can make is that in terms of accuracy, imputation is generally beneficial for all cases and almost all techniques. In terms of fairness, we see in Figures 6 and 7 that for Adult, Autism and Credit approval with Sex all methods reduce bias from the perfect model, which is quite remarkable. An opposite result happens for Recidivism, and Titanic with Sex, while Titanic with Class and Credit Approval with Ethnicity are more mixed depending on the technique. Juvenile Offenders present a particular behaviour, as the many models are worse than the majority class model.
When we look at the trade-off between fairness and performance comparing deletion and imputation, we see the expected trend of more performance for imputation implying less fairness, with the exception of Titanic with class. However, if we look more carefully method by method, we see that RF behaves quite differently: using imputation increases accuracy significantly but does not amplify (and sometimes reduces) bias. Actually, RF is the only method that draws an almost straight line between the Deletion point, the Imputation point and the Perfect point, except for Credit Approval with Ethnicity and the worse-than-majority behaviour for Juvenile Offenders.
Many other models are far from the segment joining the majority class model and the perfect model and hence very far from the boundary of ideal models that we show in Proposition A1 in Appendix A, but there are exceptions, such as Autism with Gender and Credit Approval Sex, and the particular case of model SV for Titanic with class. This would be the ideal behaviour, but it would depend more on the learning model than the imputation, as the separation already happens with the deletion results. DTs are usually in a good position dominating many other classifiers (closer to the straight line between majority and perfect). This is good news, as DTs have the advantage of being interpretable and easy to get their decisions conditioned to some attributes, and they could be manually modified to get to different positions of the accuracyversus-bias space.

| DISCUSSION AND FUTURE WORK
While a predictive model can be trained on any particular subset of the data (e.g., excluding missing values), we need to evaluate any predictive model on all the examples (otherwise we would be cheating with the results and the model would ultimately need to delegate these F I G U R E 6 Performance (accuracy) and Bias (SPD) for six machine-learning models, using different subsets of data for training. Imputation: all data with the missing values being imputed. Deletion: only the data without the missing values. The test set is always with imputation. The plots also include two baselines: Majority (a majority class model) and Perfect (a perfect model, i.e., using the correct label from the test set). DT, decision tree; LR, logistic regression; NB, naive Bayes; NN, neural Network; RF, random Forest; SPD, StatisticalParityDierence; SV, support vector [Color figure can be viewed at wileyonlinelibrary.com] examples to humans). Consistently, if we want to use techniques that do not handle missing values-during deployment-we need to delete the columns or use IMs. Still, we can use different techniques and learn the models without these rows, and see if they are better or worse than the models trained with the rows with imputation. This is exactly what we have done in the analysis in Section 6 and it is our recommendation as a general practice. Using these visualisations and performing the same procedures and analyses, we can locate the baseline classifiers and the space of possible classifiers, representing several techniques inside it. We can then have a clearer view of what to do for a particular problem, and which classifier we have to choose depending on the requirements of fairness (on the y-axis) or performance (on the x-axis).
At a theoretical level, new machine-learning techniques are being introduced where one can declare whether an attribute is protected. In the future, we should also investigate new methods to determine what to do with the missing values. We could also add causal models of how the attributes are related to (or controlled by) the primitive causes of missingness and unfairness, which would ensure better traceability than current methods. With the off-the-shelf methods we have available today, however, it is much more plausible that we need to analyse F I G U R E 7 Performance (accuracy) and Bias (SPD) for six machine-learning models, using different subsets of data for training. Imputation: all data with the missing values being imputed. Deletion: only the data without the missing values). The test set is always with imputation. The plots also include two baselines: Majority (a majority class model) and Perfect (a perfect model, i.e., using the correct label from the test set). DT, decision tree; LR, logistic regression; NB, naive Bayes; NN, neural Network; RF, random Forest; SPD, StatisticalParityDierence; SV, support vector [Color figure can be viewed at wileyonlinelibrary.com] the results for a particular problem (data set), favourable class and protected groups, and find the best compromise on the Pareto front between the chosen performance and fairness metrics, as we have illustrated with this paper.
Recently, some interesting papers have proposed new imputations methods for imputing missing values. Many of these new methods are based on machine-learning techniques that have gained popularity recently. Examples are Generative Adversarial Nets, 85 fuzzy-rough sets, 86 deep generative models 87 and autoencoders. 88 All these methods claim to improve the accuracy of the final machine-learning models. However, given the complexity of both phenomena (missing values and fairness) and their interaction, it is quite unlikely that a new (predictive) method of imputation works well for all techniques and all possible bias mitigation methods (and all the fairness and performance metrics). In fact, in our experiments, we detect the expected trend of more performance implying less fairness. Therefore, these methods could imply a significant degradation in the fairness of the models.
Another avenue of future analysis could be based on exploring the artificial generation of synthetic missing data, to complement the 12 scenarios with both fairness and missing values issues that we have analysed (and arranged as a publicly available benchmark). This could follow a univariate/multivatiate configuration, at different percentages (missing rates) and according to distinct missing mechanisms: MAR, MCAR and MNAR. However, to be meaningful, we would need to build a causal model and its relation to fairness (otherwise, conclusions derived from ill-defined configurations would be invalid). Similarly, there is a huge number of combinations of fairness metrics and performance metrics that could be explored. For instance, for very imbalanced data sets, it may be more interesting to at least compare the results with metrics, such as AUC, and understand the space in this case. Note that this study should be accompanied by classifiers that predict scores or probabilities, jointly with a proper analysis of the relation between calibration and both fairness and missing values. 89

| CONCLUSIONS
We have presented the first comprehensive analysis of the relation between missing values and fairness. We investigated the causes of missing values and the causes of unfair decisions, and we saw many conceptual connections, and underlying causes for both. The surprising result was to find that, for the data sets studied in this paper, the examples containing missing values seem to be fairer than the rest. Unfortunately, missing values are usually ignored in the literature, even if we can easily transform them from ugly to graceful if handled appropriately. Indeed, we have seen that they incorporate information that could be useful to limit the amplification of bias, but they can have a negative effect on accuracy too if not dealt with conveniently. As a more precise list of lessons learnt, recovering the questions from the introduction, we give the following answers: (1) Missing data and fairness are closely related, and have common causes. From the empirical results in our experimental setting, we see that the correlations between the protected attributes and the attributes with missing values for the different data sets are usually small, so the effect may happen through other proxy variables. (2) Subsamples with missing data are less unfair, but using partial samples is usually counterproductive in terms of accuracy, as fairness and accuracy are usually presented as a trade-off. The octagons give us a perspective that while many plots get into the unfair areas as accuracy grows, this is usually worse than a straight line from the cases without imputation to those with imputation. (3) IMs can find a good trade-off between performance and fairness: imputing missing values is an easy and effective way to deal with them, instead of discarding the rows with missing data. This is also reinforced by the sensitivity analysis performed for those attributes with this sort of values. What we found is that, apart from the well-known benefits of imputation for accuracy, IMs should be used and improved for the purpose of fairness.
Of course, as said above, more experimental analysis and new techniques could shed more light on the question. However, given the complexity of both phenomena (missing values and fairness) and their interaction, it is quite unlikely that a new (predictive) method of imputation works well for all techniques and all possible bias mitigation methods (and all the fairness and performance metrics). Instead, we have followed a reproducible methodology, based on representations and metrics, and it depends on each case to find the best choice, as we have recommended in Section 7. In the end, fairness is a delicate issue for which context-insensitive modelling pipelines may miss relevant information and dependencies for a particular problem. For the moment, we have learnt that ignoring the missing values is a rather ugly practice for fairness. 098. F. Martínez-Plumed acknowledges funding of the HUMAINT project by DG CONNECT and DG JRC of the European Commission. J. Hernández-Orallo is also funded by an FLI grant RFP2-152.

ENDNOTES
* As opposed to COMPAS, 26 SAVRY is not a system based on machine-learning models, but an assessment protocol that guides the evaluating expert through the individual features that make up the overall risk assessment. 27 † https://cran.r-project.org/web/packages/BaylorEdPsych/index.html ‡ In this study we only consider binary problems where one class is the "favourable" case (c + ) and the another is the "unfavourable" case (c − ). In the case of multiclass problems, favourable classes can be merged into a single label c + and the rest merged into c − . § For the sake of reproducibility, all the code and data can be found in https://github.com/nandomp/ missingFairness ¶ In this paper, we consider a limited version of the original data set. We only use 22 of the 100 original attributes. We exclude 78 attributes because they have more than 30% of missing values. The use of a limited number of attributes explains the poor results obtained by the machine-learning models. # We do not include the octagons in Figure 6 to keep the plots clean but also because the results in these plots are the average for several repetitions, while calculating means of octagons is not conceptually correct. Instead, we show the octagons for the whole data set in Appendix A. With the separation, it is also easier to have in mind that the bounding octagons can have small variations for each of the partitions. , SPD = 1 − .  We denote by D the data labelled with a classifier, so it will have accuracy and SPD as follows. Let us start with the perfect classifier, and let us assume wlog that SPD i + is positive.

SPD positive
We consider that the favourable class is the positive class, so we have four cases in which a prediction can change: As we did not have FP nor FN, because we started from the perfect classifier, and all the TPs for X a = have been converted into FN, we only have TN and FN for X a = , which always predicts the negative class when X a = and always correct when ≠ X a. Then, when case 2 is exhausted, we can start with case 3 and move one by one all the elements from TN to FP (exactly all the original TN instances in = 1 −   Again, the last step is explained as we did not have FP nor FN, because we started from the perfect classifier, and all the TPs for X a = have been converted into FN, so we only have TN and FN for X a = , which means it always predicts the negative class when X a = . As all the TNs for ≠ X a have been converted into FP, we only have TP and FP for ≠ X a, which means it always predicts the positive class when ≠ X a. This is the most negative bias, −1.
Note that as the value of SPD is negative already after all the 2s, we know that perfect fairness for SPD has been achieved before exhausting the 2s. However, the points on the segments connecting these points and the original perfect classifier do not need to contain the majority classifier, which will usually be left out of these two segments.
And now if we want to explore the other two cases, we are at SPD = −1 and we can only increase. If | | | | ≠ D D < X a X a = i i then now we want to increase SPD as slowly as possible to cover the whole space, so it is now more advantageous to do 4 as the steps are smaller.
As in the previous cases, now we change TP to FN, so it is the positives in ≠ D X a i that we lose, so incrementally over the previous point this would lead to | 3253 SPD negative. Finally, the scenario when SPD of the perfect classifier is negative is analogous. □ An example of the octagonal SPD-Accuracy space for a test data set is included in Figure A1. The data set presents the following metrics: SPD = 0. (obtaining different subsets of rows) and then we average the results. In Figure B1 we show the average SPD values (and error bars) for different subsets of data and privileged group ("Class or Sex"). We can see: • The first item ("SPD (w/o Missing)") is just included for reference, and should be constant, as it is not affected by the subsampling. If we look at the second one ("SPD (w Missing)"), we can see that the fairness metric remains almost constant for the percentages of missing values (obviously, only the standard deviation decreases as the set of rows containing missing values increases). Similarly, for the latter two we cannot find significant differences for the fairness metric as they remain almost constant for the different subsets of rows containing F I G U R E A3 SPD-accuracy spaces for the data sets Autism, Credit Approval and Juvenile offenders of Table 1. We also show the majority class classifier with a cross, and connect it to the perfect classifier (represented by a star). SPD, Statistical Parity Difference [Color figure can be viewed at wileyonlinelibrary.com] missing values. Note that similar results can be obtained with the rest of data sets in our paper, but with even smaller variations due to the lower number of instances with missing values.
In brief, this illustrative example shows us that there seems to be no general relationship between the degree of fairness and the ratio of missing values (at least for the set of data sets in our paper).