A survey on datasets for fairness-aware machine learning

As decision-making increasingly relies on Machine Learning (ML) and (big) data, the issue of fairness in data-driven Artificial Intelligence (AI) systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which involve fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships between the different attributes, particularly w.r.t. protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate the interesting relationships using exploratory analysis.


Introduction
Artificial Intelligence and Machine Learning are widely employed nowadays by businesses, governments and other organizations to improve their operational quality and assist in decision-making in areas such as loan approval (Mukerjee, Biswas, Deb, & Mathur, 2002), recruiting (Faliagka, Ramantas, Tsakalidis, & Tzimas, 2012), school admission (Moore, 1998), risk prediction ). There are many advantages of using algorithmic decision-making as computers can efficiently analyze large amounts of data with high accuracy. Along with the advantages, unfortunately, there is plenty of evidence regarding the discriminative impact of ML-based decision-making on individuals and groups of people on the basis of protected attributes such as gender or race. As an example, racial-bias was observed in COMPAS (Angwin, Larson, Mattu, & Kirchner, 2016), a software used by the U.S. courts to assess the risk of recidivism; in particular, it has been found that black defendants were predicted with a higher risk of recidivism than their actual risk compared to white defendants. Another example refers to search algorithms in job search websites; it has been found that such algorithms exhibit gender-bias as they display higher-paying jobs to male applicants compared to female ones (Simonite, 2015;Datta, Tschantz, & Datta, 2015).
Data are an essential part of machine learning. Usage of sensitive information during the learning process is undesirable but hard to guarantee even if known protected attributes are omitted from the analysis. The reason is the causal effects (Madras, Creager, Pitassi, & Zemel, 2019) of such attributes, including observable "proxy" attributes. As an example, the non-protected attribute "zip-code" was found to be a proxy for the protected attribute "race" (Datta, Fredrikson, Ko, Mardziel, & Sen, 2017) or the "credit rating" can be used as a proxy for "safe driving" (Warner & Sloan, 2021). Hence, even if the protected attributes like race or gender are not used, the resulting ML models can still be biased (Angwin et al., 2016) due to the causal effects of such attributes. Although methods for detecting proxy attributes exist, e.g., (Yeom, Datta, & Fredrikson, 2018) detects proxies in linear regression models by using a convex optimization procedure, eliminating all the correlated features might drastically reduce the utility of the data for the learning problem.
The domain of bias and fairness in machine learning has attracted much interest in recent years, and as a result, several surveys exist that provide a broad overview of the area, its technical challenges and solutions Mehrabi, Morstatter, Saxena, Lerman, & Galstyan, 2021;Chhabra, Masalkovaitė, & Mohapatra, 2021;Pitoura, Stefanidis, & Koutrika, 2021;Xivuri & Twinomurinzi, 2021). However, an overview of the datasets used for fairness-aware machine learning evaluation is still missing. As data are a vital part of ML and benchmark datasets a decisive factor for the success of AI research 1 , we believe our survey is serving to fill a gap in the extant research.
In this survey, we overview the different datasets used in the domain of fairness-aware machine learning, and we characterize them according to their application domain, protected attributes and other learning characteristics like cardinality, dimensionality and class (im)balance. For each dataset, we provide an exploratory analysis by first using a Bayesian network to identify the relationships among attributes. Based on the Bayesian network, we provide a graphical analysis of the attributes for a deeper understanding of bias in the dataset. The Bayesian network illustrates the conditional (in)dependence between the protected attribute(s) and the class attribute; thus, it reduces the space and complexity of data analysis that needs to be performed to discover and clarify the fairness-related problems in the dataset. We then focus our exploratory analysis on features having a direct or indirect relationship with the protected attributes. We accompany our exploratory analysis with a quantitative evaluation of measures related to predictive and fairness performance.
We believe that our survey is useful as it gathers many fairness-related datasets scattered around the web and organizes them in terms of different principles (application domain, learning challenges like dimensionality and class imbalance, fairness-aware related challenges like the number of protected attributes, etc.). As such, we except that it will help researchers to easily select the most appropriate datasets for their application domain (e.g., learning analytics vs recidivism), learning challenges (e.g., balanced vs imbalanced classification), classification task (e.g., binary classification vs multiclass learning), fairness-related challenges (e.g., single protected vs multiple protected attributes etc.).
As datasets have played a foundational role in the advancement of machine learning research (Paullada, Raji, Bender, Denton, & Hanna, 2021), our survey also indicates the need for more open benchmark datasets that would reflect different application domains (from education and healthcare to recruitment and logistics), different contexts (e.g., spatial, temporal, etc.), various (machine) learning challenges (dimensionality, imbalance, number of classes, etc.) as well as different notions of fairness (multi-discrimination, temporal fairness, distributional fairness, etc.). We advocate that the community should also pay attention to benchmark datasets in parallel to new methods and algorithms. The area of fairness-aware machine learning will undoubtedly benefit from having benchmark datasets for various tasks.
The rest of the paper is structured as follows: In Section 2, we describe our methodology for dataset collection and evaluation. The most commonly used datasets for fairness are presented in Section 3 together with the results of their exploratory analysis. Section 4 demonstrates a quantitative evaluation of a classification model on the different datasets w.r.t. predictive performance and fairness. We summarize several open issues on datasets for fairness-aware machine learning in Section 5. Finally, the conclusion and outlook are summarized in Section 6.

Methodology of the survey process
In this section, we describe our dataset collection strategy and introduce Bayesian networks as a tool for learning the structure from the data. In addition, we provide a summary of fairness measures we will use for the quantitative evaluation.

Strategy for collecting datasets
To identify the relevant datasets, we use Google Scholar 2 with "fairness datasets" as the primary query term along with other terms like "bias", "discrimination", "public" to narrow down the search. After identifying the related datasets, we use Google Scholar to find the related papers which satisfy the following conditions: 1) The public dataset is used in the experiments, and 2) The learning tasks, i.e., classification, clustering, are related to fairness problems. To restrict the investigation of the related work, we consider only important works as assessed by the number of citations, quality of publication venue, i.e., published in ranked conferences, journals. We consider datasets that have been used in at least three fairness-related papers. Datasets that are not publicly available via some known repository like the UCI machine learning repository 3 , Kaggle 4 , etc., are not taken into consideration.

Bayesian network
A Bayesian network (BN) (Holmes & Jain, 2008) is a directed and acyclic probabilistic graphical model which provides a graphical representation to understand the complex relationships between a set of random variables. In the case of a dataset, random variables corresponding to the attributes of the feature space in which the data are represented. The graphical structure M : {V, E} of a BN contains a set of nodes V (random variables/attributes) and a set of directed edges E. Let X 1 , X 2 , · · · , X d be the attributes defining the feature space X of a dataset D, such that X ∈ R d . For two attributes X i , X j ∈ X , if there is a directed edge from X i to X j , then X i is called the parent of X j . The edges indicate conditional dependence relations, i.e., if we denote X pa i as the parents of X i , the probability of X i is conditionally dependent on the probability of X pa i . If we know the outcome (value) of X pa i , then the probability of X i is conditionally independent of any other ancestor node. The structure of a BN describes the relationships between given attributes, i.e., the joint probability distribution of the attributes in the form of conditional independence relations. Formally: Learning the structure of a BN from the dataset D is an optimization problem (Husmeier, Dybowski, & Roberts, 2006), namely to learn an optimal BN model M which maximizes the likelihood of generating D. A set of parameters of any BN model M, denoted by M, is the set of edges E which represents the conditional independence relationship between the attribute set V. Moreover, between the possible models M , the less complex one, i.e., the one with the least M, should be selected.
Note that in a learned BN model M, the position of the class attribute y can be in any position (root-, internal-or leaf-node), since the objective is to maximize P (D | M). However, we aim to investigate the factors (protected/non-protected attributes) that determine the class attribute's prediction probability. Therefore, we also employ a constraint on the class attribute to be a leaf node in our learning objective. Formally the problem is defined as: where y ∈ X is the class attribute, L is the set of leaf nodes and γ is a penalty hyperparameter controlling the effect of the model's complexity in the final model selection. The aim of the learned model is to maximize P (X i | X pa i ) for each X i ∈ X (Eq. 1 and Eq. 2).
A high conditional probability often refers to a strong correlation (Daniel, 2017). Attribute X i is strongly correlated with X j if there exists a direct edge between X i and X j , for any pair of attributes X i , X j ∈ X . Intuitively, the correlation is comparatively weaker with ancestors that are not immediate parents, i.e., indirect edges. In addition, the attributes which do not have any incoming or outgoing edge (direct/indirect connection) with X i , the correlation between them will be negligible. As a consequence, if we find any direct/indirect edge from any protected attribute to the class attribute in our learned BN structure M then we may infer that the dataset is biased w.r.t. the specific protected attribute.
When learning a BN, the continuous variables are often discretized because many BN learning algorithms cannot efficiently handle continuous variables (Chen, Wheeler, & Kochenderfer, 2017). Therefore, we need to discretize the continuous numeric data attributes into meaningful categorical attributes to keep the complexity of learning the BN model in a polynomial time.
We describe the discretization procedure for each dataset in Section 3.

Fairness metrics
Measuring bias in ML models comprises the first step to bias elimination. Fairness depends on context; thus, a large variety of fairness measures exists. Only in the computer science research area, more than 20 measures of fairness have been introduced thus far (Žliobaitė, 2017; Verma & Rubin, 2018). Nevertheless, there is no fairness measure that is universally suitable (Foster, Ghani, Jarmin, Kreuter, & Lane, 2016;Verma & Rubin, 2018). Therefore, to make the experimental results more diverse, we report on three prevalent fairness measures: statistical parity, equalized odds and Absolute Between-ROC Area (ABROCA). In which, statistical parity (Dwork, Hardt, Pitassi, Reingold, & Zemel, 2012) is one of the earliest and most popular discrimination measures in the fairness-aware ML literature. Statistical parity is also considered as the statistical counterpart of the legal doctrine of disparate impact (Krop, 1981). However, one main disadvantage of statistical parity is that it does not require compliance to the ground truth labels; hence, in many ML scenarios might not be ideal (Hardt, Price, & Srebro, 2016). Equalized odds introduced by (Hardt et al., 2016) countered this problem by considering the ground truth of both positive and negative class instances and grew to be one of the most promising fairness notion, being used in the leading edge methods Krasanakis, Spyromitros-Xioufis, Papadopoulos, & Kompatsiaris, 2018;Iosifidis & Ntoutsi, 2019). Later, (Gardner, Brooks, & Baker, 2019) argued that equalized odds does not consider any formal strategy such as slicing analysis to identify the prevalent biases, which might be a necessity in particular domains such as education. ABROCA measure introduced by (Gardner et al., 2019) tackles such an analysis issue and is argued to be an illustratively efficient method of representing the divergence of values of a protected attribute. Although, as mentioned earlier, there is a rich literature of fairness notions to follow, in this work, we limit our study to the above-mentioned notions, as these notions together cover a diverse area of the fairness concepts currently followed in the state-of-the-art fairness-aware ML practices.
The measures are presented hereafter assuming the following problem formulation: Let D be a binary classification dataset with class attribute y = {+, −}. Let S be a binary protected attribute with S ∈ {s, s}, in which s is the discriminated group (referred to as protected group), and s is the non-discriminated group (referred to as non-protected group). For example, let S = "Sex" ∈ {Female, Male} be the protected attribute; s = "Female" could be the protected group and s = "Male" could be the non-protected group. We use the notation s + (s − ), s + (s − ) to denote the protected and non-protected groups for the positive (negative, respectively) class.

Statistical parity
Statistical parity (SP) introduced by (Dwork et al., 2012) states that the output of any classifier satisfies SP if the difference (bias) in predicted outcome (ŷ) between any two groups under study (i.e., s and s) is up to a predefined tolerance threshold . Formally: Using the definition in Eq. 3 to measure the bias of a classifier, various measuring notions (Simoiu, Corbett-Davies, & Goel, 2017;Žliobaitė, 2015) have been proposed. The violation of statistical parity can be measured as: The value domain is: SP ∈ [−1, 1], with SP = 0 standing for no discrimination, SP ∈ (0, 1] indicating that the protected group is discriminated, and SP ∈ [−1, 0) meaning that the nonprotected group is discriminated (reverse discrimination).

Equalized odds
Equalized odds (shortly Eq.Odds) (Hardt et al., 2016) is preserved when the predictionsŷ conditional on the ground truth y is equal for both the groups s and s defined by S. Formally: where y is the ground truth class label,ŷ is the predicted label.
Using Eq. 5 we can measure the prevalent bias as: The value domain is: Eq.Odds viol ∈ [0, 2], with 0 standing for no discrimination and 2 indicating the maximum discrimination.

Absolute Between-ROC Area (ABROCA)
This is a fairness measure introduced by the research of (Gardner et al., 2019). It is based on the Receiver Operating Characteristics (ROC) curve. ABROCA measures the divergence between the protected (ROC s ) and non-protected group (ROC s ) curves across all possible thresholds t ∈ [0, 1] of false positive rates and true positive rates. In particular, it measures the absolute difference between the two curves in order to capture the case that the curves may cross each other and is defined as: ABROCA takes values in the [0, 1] range. The higher value indicates a higher difference in the predictions between the two groups and therefore, a more unfair model.

Datasets for fairness
In this section, we provide a detailed overview of real-world datasets used frequently in fairnessaware learning. We organize the datasets in terms of the application domain, namely: financial datasets (Section 3.1), criminological datasets (Section 3.2), healthcare and social datasets  3.4). A summary of the statistics of the different datasets 5 is provided in Table 1.
For each dataset, we discuss the basic characteristics like cardinality, dimensionality and class imbalance as well as typically used protected attributes in the literature. When available, we also provide temporal information regarding the data collection and the timespan of the datasets.
We start our analysis with the BN structure learned from the data (see Section 2.2), which can help us to understand the relationships between attributes of the dataset. In addition, the BN visualization already provides interesting insights on the dependencies between non-protected and protected attributes and their conditional dependencies in predicting the class attribute. We further provide an exploratory analysis of interesting correlations from the Bayesian graph (for both direct-and indirect-edges), particularly those related to the fairness problem (paths to and from protected attributes).

Adult dataset
The adult dataset (Kohavi, 1996) (also known as "Census Income" dataset 6 ) is one of the most popular datasets for fairness-aware classification studies (Appendix A). The classification task is to decide whether the annual income of a person exceeds 50,000 US dollars based on demographic characteristics.
Dataset characteristics: The dataset consists of 48,842 instances, each described via 15 attributes, of which 6 are numerical, 7 are categorical and 2 are binary attributes. An overview of attribute characteristics is shown in Table 2. We discard the attribute fnlwgt (final weight) as the suggestions of related work (B. H. Zhang, Lemoine, & Mitchell, 2018;Calders, Kamiran, & Pechenizkiy, 2009;. Missing values exist in 3,620 (7.41%) records. Many researchers remove the instances containing missing values Iosifidis & Ntoutsi, 2018, 2019Choi, Farnadi, Babaki, & Van den Broeck, 2020) in their experiments; other researches consider the whole dataset or do not clarify how the missing values are handled. To avoid the effect of missing values on the analysis, we remove the missing data and obtain a clean dataset with 45,222 instances. 5 We use the names of the protected attributes given in the original datasets, i.e. sex, gender are used with the same meaning. We do not present the class ratio (denoted by '-') in several datasets because their class label is 'multiple'. 6 https://archive.ics.uci.edu/ml/datasets/adult • race = {white, black, asian-pac-islander, amer-indian-eskimo, other }. Typically, race is used as a binary attribute in the related work Chakraborty, Peng, & Menzies, 2020;: race = {white, non-white}. The dataset is dominated by white people, the white:non-white ratio is 38,903:6,319 (86%:14%). In our analysis we also encode race as a binary attribute.
In the research of , marital-status and native-country are considered as the protected attributes. However, due to missing information on their pre-processing method on these attributes, we will not consider those as the protected attributes in our survey.
Class attribute: The class attribute is income ∈ {≤ 50K, > 50K} indicating whether an individual makes less or more than 50K. The positive class is "> 50K". The dataset is imbalanced with an imbalance ratio (IR) 1 : 3.03 (positive:negative).
Bayesian network: Figure 1 illustrates the Bayesian network learned from the dataset. The class label income is the leaf node, i.e., there are no outgoing edges.  As demonstrated in Figure 1, there is a direct dependency between income and education as well as between sex and education. Therefore, we explore in more detail the distribution of the population w.r.t. education, income and sex in Figure 2a. As expected, highly educated people have a high income. However, in the high education segment and for the high-income class, the number of males is at least 5 times higher than that of females showing an under-representation of high education women in the high-income class. Based on the dependence of hours per week attribute on sex, we plot the weekly working hours w.r.t income and sex ( Figure 2b). The number of males who work more than 40 hours per week is approximately 7 times higher than the number of females.  Interestingly, there are many outgoing edges from the relationship and age attributes in the Bayesian network. We show the distribution of sex in each class based on the age (x-axis) and the relationship status (y-axis) in Figure 3. A first observation is that a great amount of young (less than 25 years old) or old (more than 60 years old) people do not receive more than 50K. "Unmarried" people have an income higher than 50K when they are older than 45 years, while people in the "Own-child" group can have a high income when they are young. In general, there are more males than females for almost all relationship statuses for the high-income group. Another interesting observation is that there is a direct edge from protected attribute sex to race. This suggests that choosing sex as the protected attribute would make the fairness-aware classifier attain fairness w.r.t race. Evidence of such outcome is seen in the work of .

KDD Census-Income dataset
The KDD Census-Income 8 dataset (Dheeru & Karra Taniskidou, 2017) was collected from Current Population Surveys implemented by the U.S. Census Bureau from 1994 to 1995. The dataset has been considered in numerous related works (Appendix A). The prediction task is to decide if a person receives more than 50,000 US dollars annually or not. The prediction task is the same as the Adult dataset. However, the differences between the two datasets described by the dataset's authors (Dheeru & Karra Taniskidou, 2017) are: "the goal field was drawn from the total person income field rather than the adjusted gross income and may, therefore, behave differently than the original adult goal field".
Dataset characteristics: The dataset contains 299,285 instances with 41 attributes, 32 of which are categorical, 7 are numerical and 2 are binary attributes. An overview of the dataset characteristics 9 is shown in Table 3 and Table 16 (Appendix B). Attribute weight is omitted as proposed by the authors of the dataset (Dheeru & Karra Taniskidou, 2017).
Missing values exist in 157,741 (52.71%) instances. Because related studies only focus on a subset of data and features, we clean the dataset by eliminating all missing values. In particular, we remove four features migration-code-change-in-msa, migration-code-change-in-reg, migrationcode-move-within-reg, migration-prev-res-in-sunbelt due to their high proportion in the missing values, as illustrated in Table 3. The result is a cleaned dataset with 284,556 instances.
Protected attributes: Previous researches consider sex as a protected attribute (Iosifidis & Ntoutsi, 2019;Ristanoski, Liu, & Bailey, 2013;. Attribute race = 8 https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD) 9 Table 3 describes attributes used in the Bayesian network {white, black, asian-pac-islander, amer-indian-eskimo, other } could be also employed as a protected attribute because it has the same role as in the original Adult dataset. Similarly to the Adult dataset, the KDD Census-Income dataset is dominated by white people; there are 239,081 (84.01%) white people, hence, we encode race as a binary attribute for our analysis.

Class attribute:
The class attribute is income ∈ {≤ 50K, > 50K} indicating whether an individual makes less or more than 50K. The positive class is "> 50K". The dataset is very imbalanced with an IR 1 : 15.30 (positive:negative).
Bayesian network: To generate the Bayesian network, we encode the following attributes: To reduce the complexity, we eliminate these attributes: enrollin-edu-inst-last-wk, major-industry, major-occupation since they have a very low correlation with other features. Also, for efficiency purposes, we generate the Bayesian network on a randomly selected 10% sample of the dataset rather than on the complete dataset. The learned Bayesian network is shown in Figure 4; the class label income is set as a leaf node.
As shown in Figure 4, income is conditionally dependent on sex, occupation and the number of week worked in year (weeks-worked) attributes. Regarding sex attribute, females are largely underrepresented in the high income group, consisting of 13,691 males (∼10.03% of the male population) and only 3,711 females (∼2.51% of the female population). Regarding the number of weeks worked per year and income, as shown in Figure 5, women tend to do part-time jobs, i.e., the number of weeks worked per year is less than 26. In addition, women earn less money than men even though they all work 52 weeks per year. That is shown by the number of men with high income is approximately five times more than the number of women.  As mentioned, race could also be considered as the protected attribute. Based on the data, the income of non-white people is significantly different from the income of the white group. Only 3.2% of the non-white group have an income above 50K, compared to 6.7% for the white group. Furthermore, since age has a conditional dependence on marital-status attribute, we investigate the relationship between these attributes, the protected attribute sex and the class label income in Figure 6. As shown in this figure, males comprise the majority of the highincome group, especially for certain population segments like the Married-civilian spouse present segment where the number of males is 5 times higher than that of females. Interestingly, the number of widows is 1.7 times higher than the number of widowers in terms of high income.
Regarding the age effect, most people have a high income when they are over 40 years old. With respect to the protected attributes, there is no edge between race and sex, which suggests the researchers should perform their fairness-aware models on both these protected attributes. Figure 6: KDD Census-Income: relationship of marital status, age, sex and income

German credit dataset
The German credit 10 dataset (Dheeru & Karra Taniskidou, 2017) consists of samples of bank account holders. The dataset is used for risk assessment prediction, i.e., to determine whether it is risky to grant credit to a person or not. The dataset is frequently employed in fairness-aware learning researches (Appendix A).

Dataset characteristics:
The dataset contains only 1,000 instances without any missing values. Each sample is described by 13 categorical, 7 numerical and 1 binary attributes. An overview of all attributes is presented in Table 4. Attribute personal-status-and-sex contains information of marital status and the gender of people. We disentangle gender from personal status and create two separate attributes: marital-status and sex. The original personal-status-and-sex attribute is omitted from further analysis. at 25 (Kamiran & Calders, 2009;Friedler et al., 2019).
• sex = {male, female}. The dataset is dominated by male instances, the ratio of male:female is 690:310 (69%:31%). The percentage of women identified as bad customers is 35.2% while that of men is only 27.7%.
• age = {≤25, >25 }: The dataset is dominated by people older than 25 years, the ratio is 810:190 (81%:19%). We discover that there is a discrimination on the age of customers. There are 42.1% of young people are recognized as bad customers while this proportion in old people is 27.2%.
Class attribute: The class attribute is class-label ∈ {good, bad} revealing the customer's level of risk. The positive class is "good". The dataset is imbalanced with an IR 2.33 : 1 (positive:negative).
Bayesian network: We transform the numerical attributes into categorical as follows: duration = {≤6, 7-12, >12} (short, medium and long-term); credit-amount = {≤2000, 2000-5000, >5000} (low, medium and high income); age = {≤25, >25}. The extracted Bayesian network is shown in Figure 7; class-label is set as a leaf node. The Bayesian network consists of two disconnected components. First, class-label is conditionally dependent on the checking-account attribute. We investigate in more detail this relationship in Figure 8a. As we can see, a very high proportion of people, i.e., 88.3%, having no checking account is identified as the good customers while half of the customers having a balance less than 0 DM ( Deutsche Mark) in their checking account are classified as the bad customers.
Second, interestingly, credit-amount has a direct effect on many attributes such as installmentrate, duration. We discover that people who borrow a great amount of money tend to borrow for a long period. For example, 93.6% of interviewees make a loan of more than 5000 DM with a loan duration of more than 12 months. As illustrated in Figure 8b, the number of customers who have to pay the highest installment rate (visualized as the "red" columns) is inversely proportional to the credit-amount. Regarding the protected attributes, a direct edge between sex and age is observed. This is the starting point of the research question "Does the fairness-aware model obtain fairness w.r.t. sex if age is chosen as the protected attributes?" (b) Relationship between credit amount and installment rate Dataset characteristics: The dataset includes 60,420 samples 11 where each sample is described by 12 attributes. An overview of attributes is presented in Table 5. Class attribute: The class attribute is occupation ∈ {0, 1} demonstrating if an individual has a prestigious profession or not. The positive class is 1 (high-level). This is a fairly balanced dataset in our survey with an IR 1 : 1.10 (positive:negative).
Bayesian network: We use all attributes in the dataset to generate the Bayesian network. As illustrated in Figure 9, the leaf node occupation is conditionally dependent on economic status, education level and sex attributes. In fact, 62.6% of males (18,860 out of 30,147) have a highlevel occupation, while this proportion on females group is only 32.7%. In addition, people with high education are doing prestigious jobs and vice versa, as depicted in Figure 10. For example, 89.5% of people having tertiary level are working in high-level jobs while around 80% of people with lower secondary degrees are doing low-level work. Interestingly, age has a direct effect on many attributes.  There is a variety of researchers investigating this dataset in their studies (Appendix A). The classification goal is to predict whether a client will make a deposit subscription or not.
Dataset characteristics: The dataset comprises 45,211 samples, each with 6 categorical, 4 binary and 7 numerical attributes, as summarized in Table 6.  Fish, Kun, & Lelkes, 2016), they consider age as the protected attribute which is binary separated into people who are between the age of 25 to 60 years old and less than 25 or more than 60 years old.
Class attribute: The class attribute is y ∈ {Y es, N o} presenting whether a customer will subscribe a term deposit or not. The positive class is "Yes". The dataset is imbalanced with an IR 1 : 7.55 (positive:negative).  Figure 11 visualizes the Bayesian network of the Bank marketing dataset. The class label y, as illustrated in Figure 11, is conditionally dependent on poutcome, month and duration attributes. An insight about the relationship between the last contact duration and class label y is described in Figure 12. The ratio of clients who will make a deposit subscription is proportional to the duration of the last contact. When the talk is taken place in less than 2 minutes, 98.5% of people will not make the deposit subscription. However, if a marketing staff can maintain the talk with customers over 10 minutes, 48.4% of customers will say "Yes". Interestingly, in the Bayesian network, both protected attributes age and marital have no effect on the class label y. However, the protected attributes are connected together by an in-direct edge, which could be a reason for a similar accuracy of fairness-aware models of the related work  and .
The last contact duration (minutes)  Dataset characteristics: The dataset includes 30,000 customers described by 8 categorical, 14 numerical and 2 binary attributes, as depicted in Table 7. There is no missing value in the dataset.
Protected attributes: In the literature, sex , education, marriage Bera et al., 2019) are considered as the protected attributes.
Class attribute: The class attribute is default payment ∈ {0, 1} indicating whether a customer will suffer the default payment situation in the next month (1) or not (0). The positive class is 1. This is an imbalanced dataset with an IR 1 : 3.52 (positive:negative). Bayesian network: To generate the Bayesian network, we convert the numerical attributes: age = {≤35, 36-60, >60}; the amount of the given credit (limit bal), the amount of the bill statements (bill amt 1,. . ., bill amt 6), and the amount of the previous payments (pay amt -1, bill 1,. . ., pay amt 6 ) = {≤50,000, 50,001-200,000, >200,000} (corresponding to the low, medium, high levels); history of the past payments pay 0,. . ., pay 6 = {pay duly, 1-3 months, >3 months}. The Bayesian network is presented in Figure 13. The class label default payment is directly conditionally dependent on the repayment status in July 2005 (attribute pay 3 ), and the given credit (attribute limit bal) and indirectly dependent on the amount of bill statements (the attributes with a prefix bill amt). As demonstrated in Figure 14, the ratio of the default payment phenomenon is inversely proportional to the credit limit balance. Moreover, we discover that the percentage of males having the default payment in the next month is higher than that of females. In particular, the ratio of males with the default payment is 24.2% while that of females is only 20.8%. Interestingly, the protected attributes (sex, education, marriage) are conditionally dependent on each other.  Figure 14: Credit card clients: Relationship between the credit limit balance and default payment Summary of the financial datasets: In general, the financial datasets are very diverse as they were collected from several diverse locations (from US, to Taiwan) and at very different time points (from 1994 to 2013). With respect to the collection time, the datasets are pretty old, esp. Adult and KDD census datasets. These datasets have been heavily investigated in the related work and under different protected attributes. The most prevalent protected attribute is sex followed by race, age, marriage and education. An interesting observation is that the protected attributes are often related to each other (a strong or weak relationship), e.g., race with education. Due to these dependencies, ensuring fairness for one protected attribute may positively affect fairness for other protected attributes. Moreover, most of the datasets in this category are imbalanced, with the only exception of the Dutch census dataset which is almost balanced (see Table 1). In terms of class imbalance, datasets demonstrate different imbalance ratios.  Table 8 and Table 17 (Appendix B).

Criminological datasets
Missing data is a common phenomenon in both subsets. There are 6,395 rows (88.6%) containing missing values in the COMPAS recid. subset while this number in the COMPAS viol. revid. subset is 3,748 (79%). Based on (Angwin et al., 2016), we clean the dataset by removing the missing data, such as violent recid = NULL or the change date of a crime (attribute days bscreening arrest) was not within 30 days when he or she was arrested. The cleaned datasets used in our analysis contain 6,172 (COMPAS recid.) and 4,020 (COMPAS viol. recid.) records.
Protected attributes: Typically, race is employed as the protected attribute. In both subsets, black and white are the main races. In the COMPAS recid. subset, the black:white ratio is 3,175:2,103 (51.4%:34%) (computed on the total number of defendants), while this ratio in the COMPAS viol. recid. subset is 1,918:1,459 (47.7%:36.3%). Figure 15 describes the distribution of defendants w.r.t. race attribute. The recidivism rate in the black defendants is higher than that of the white defendants in both subsets.
Sex has been also considered as the protected attribute (Diana, Gill, Kearns, Kenthapadi, & Roth, 2021;van Berkel, Goncalves, Russo, Hosio, & Skov, 2021;Chakraborty, Majumder, Yu, & Menzies, 2020 Class attribute: The class attribute is two year recid ∈ {0, 1} indicating whether an individual will be rearrested within two years (1) or not (0). The positive class is 1. The COMPAS recid. subset is fairly balanced with an IR 1 : 1.20 (positive:negative) while the COMPAS viol. recid. subset is imbalanced with an IR 1 : 5.17.  as screening date (the date on which the risk of recidivism score was given), in custody (the date on which individual was brought into custody), and several ID-related attributes. A new attribute juv crime is computed by the sum of the juvenile felony count (juv fel count) and the juvenile misdemeanor count (juv misd count) and the juvenile other offenses count (juv othercount). We transform the numerical attributes into the categorical type: prior offenses count priors count = {0, 1-5, >5}; the juvenile felony count juv crime = {0, >0}. Figure 16 and Figure 17 are the Bayesian networks of the COMPAS dataset. The class label two year recid = {0, 1} is assigned as a leaf node. It shows the dependency of many attributes such as sex, age category (age cat) on prior offenses count (priors count) feature. For instance, the number of convictions directly affects the frequency of recidivism, as shown in Figure 18. If a defendant has a long history of convictions, his probability of recidivism is higher, especially when the number of convictions is more than 27 times, the recidivism probability is almost 100%.
Interestingly, score text attribute (defines the category of the recidivism score) has many ingoing and outgoing edges as depicted in Figure 17. To clarify this phenomenon, we investigate the distribution of age, recidivism score (score text) w.r.t. race, in Figure 19. The majority of recidivists are under the age of 30. In the recidivist group, the number of black criminals is four times and two times more than that of white criminals with a high recidivism score and : COMPAS recid. : distribution of recidivism score, age w.r.t. race medium recidivism score, respectively. In the group of defendants with a low recidivism score, the distribution of the race is balanced.

Communities and Crime dataset
The Communities and Crime 16 dataset (Dheeru & Karra Taniskidou, 2017) was a small dataset containing the socio-economic data from 46 states of the United States in 1990 (the US Census). The law enforcement data come from the 1990 US LEMAS survey, and crime data come from the 1995 FBI Uniform Crime Reporting (UCR) program. The goal is to predict the total number of violent crimes per 100 thousand population. Many researchers are investigating the dataset in their experiments (Appendix A).
Dataset characteristics: The dataset contains only 1,994 samples; each instance is described by 127 attributes (4 categorical and 123 numerical attributes). A description of attributes 17 is illustrated in Table 9, Table 18 and  (Kearns, Neel, Roth, & Wu, 2018), a label "high-crime" is set if the crime rate of the communities (possitive class) is greater than 0.7, otherwise, "low-crime" is given. The ratio of high-crime:lowcrime is: 122:1,872 (6.1%:93.9%). Therefore, the dataset is very imbalanced with an IR 1 :  Bayesian network: The dataset contains 122 numerical attributes normalized in the range of (0, 1), which is not competent to the Bayesian network. Hence, we use the median value 0.5 as a threshold to transform these attributes into categorical with two values {≤ 0.5, > 0.5}. Besides, to ensure the visibility of the chart and the computation time, we use 21 attributes that have a high correlation (at a threshold of 0.25) with the class label. The Bayesian network is visualized in Figure 20. In which, the percentage of kids born to never married (PctIlleg ) and the percentage of kids in family housing with two parents (PctKids2Par ) have a direct impact on the class label and the race. Looking into the dataset, we discover that 92.4% of the communities are dominated by black people, where the percentage of kids in family housing with two parents less than 50%), while only 55.6% of the communities are dominated by black people, where the percentage of kids in family housing with two parents greater than 50%.
Summary of the criminological datasets: In summary, the criminological datasets were only surveyed in the US.
Race and sex are considered as protected attributes, with race being the most prevalent protected attribute. Historical bias w.r.t race has been detected in the data, but comprises a challenge for ML models. Furthermore, the datasets consists of many attributes (the richer description among all datasets, see Table 1); hence, a careful selection of attributes for fairness-aware learning is required. Dataset characteristics: The dataset contains 101,766 patients described by 50 attributes (10 numerical, 7 binary and 33 categorical). Characteristics of all attributes 19 are summarized in Table 10 and Table 20 (in Appendix B). The attributes encounter id and patient nbr should not be considered in the learning tasks since they are the ID of the patients. Typically, weight, payercode, medical specialty attributes are removed because they contains at least 40% of missing values. Furthermore, we eliminate the missing values in race, diag 1, diag 2, diag 3 columns. The class label readmitted contains 54,864 rows with "no record of readmission", hence, these rows should be clean. The clean version of the dataset contains 45,715 records. Class attribute: The class attribute is readmitted ∈ {< 30, > 30} indicating whether a patient will readmit within 30 days. The positive class is "< 30". The dataset is imbalanced with an IR 1 : 3.13 (positive:negative). To reduce the computation time, we use 17 attributes that have an absolute correlation coefficient higher than 0.005 with gender and readmitted attributes to generate the Bayesian network in Figure 21.
The class label readmitted is directly conditionally dependent on the number of outpatient visits of the patient in the year preceding the encounter (number outpatient). The attribute numberoutpatient also has an impact on 8 other features. Interestingly, there is no connection between the protected attribute gender and the class label.

Ricci dataset
The Ricci 20 dataset was generated by the Ricci v.DeStefano case (Supreme Court of the United States, 2009), in which they investigated the results of a promotion exam within a fire department in Nov 2003 and Dec 2003. Although it is a relatively small dataset, it has been employed for fairness-aware classification tasks in many studies (Appendix A). The classification task is to predict whether an individual obtains a promotion based on the exam results.
Dataset characteristics: The dataset consists of 118 samples, where each sample is characterized by 6 attributes (3 numerical and 3 binary attributes), as presented in Table 11. Protected attributes: In this dataset, only attribute race can be used as a protected attribute. Race contains three values (black, white, and hispanic). As described in the literature, "black" and "hispanic" are grouped as "non-white" community. The ratio of white:non-white is 68:50 (57.6%:42.4%).
Class attribute: The class attribute is promoted ∈ {T rue, F alse} revealing whether an individual achieves a promotion or not. The positive class is "True". The dataset is almost balanced with an IR 1 : 1.11 (positive:negative).
Bayesian network: We encode 3 numerical attributes oral, written and combine as following: The Bayesian network of the Ricci dataset is demonstrated in Figure 22.
It is easy to observe that the combined grade (attribute combine) has a direct effect on the class label (promoted). Figure 23 illustrates the relationship between the combined grade and the promotion status. 100% of people whose combined oral and written exams are equal to or above 70 are promoted. Besides, as depicted in Figure 24, the number of promotions are granted for white people is higher than that for non-white people. The opposite trend is true in the group of candidates with no promotion.
Summary of the healthcare and social datasets: In summary, the datasets in healthcare and society domains were only surveyed in the US. Race and gender are considered as protected attributes. In terms of class imbalance, these datasets are less imbalanced than datasets in other domains, although the Diabetes dataset is still imbalanced. Interestingly, there is no connection between the protected attribute and the class label in both two datasets, which implies fairness can be observed in the results of fairness-aware ML models.

Educational datasets 3.4.1 Student performance dataset
The student performance dataset (Cortez & Silva, 2008) described students' achievement in the secondary education of two Portuguese schools in 2005 -2006 with two distinct subjects: Mathematics and Portuguese. 21 . The regression task is to predict the final year grade of the students. It is investigated in several researches (Appendix A) with fairness-aware regression and clustering approaches.
Dataset characteristics: The dataset contains information of 395 (Mathematics subject) and 649 (Portuguese subject) students described by 33 attributes (4 categorical, 13 binary and 16 numerical attributes). Characteristics of all attributes is described in Table 12. To simply the classification problem, we create a class label based on attribute G3, class = {Low, High}, corresponding to G3 = {<10, ≥10}. The positive class is "High". The dataset is imbalanced with imbalance ratios 1:2.04 (Mathematics subject) and 1:5.09 (Portuguese subject). Protected attributes: Typically, in the literature, sex is considered as the protected attribute.
In the work of (Kearns, Neel, Roth, & Wu, 2019;, they also select age as the protected attribute. Especially, in the research , they consider atttributes romatic (relationship) and dalc, walc (alcohol consumption) as the protected attributes. However, because of the unpopularity of these attributes, we did not consider those within the scope of this paper.
• sex = {male, female}: the dataset is dominated by female students. The ratios of male:female are 208:187 (52.7%:47.3%) and 383:266 (59%:41%) for the Mathematics subject and Portuguese subject, respectively. The class label attribute is conditionally dependent on the grade G2 in both subsets (Mathematics and Portuguese subjects). This is explained by a very high correlation coefficient (above 90%) between G2 and G3 variables. In addition, we investigate the distribution of the final grade G3 on sex because the attribute sex has an indirect relationship with the class label. Figure 27 reveals that the male students tend to receive high scores in the Portuguese subject, while the scores of Math are relatively evenly distributed across both sexes.

OULAD dataset
The Open University Learning Analytics (OULAD) dataset 22 was collected from the OU analysis project (Kuzilek, Hlosta, & Zdrahal, 2017) of The Open University (England) in 2013 -2014. The dataset contains information of students and their activities in the virtual learning environment (VLE) for 7 courses. The dataset is investigated in several papers (Appendix A), on fairnessaware problems. The goal is to predict the success of students. Dataset characteristics: The dataset contains information of 32,593 students characterized by 12 attributes (7 categorical, 2 binary and 3 numerical attributes). An overview of all attributes is illustrated in Table 13. Attribute id student should be ignored in the analysis. Typically, in the related work, they consider the prediction task on the class label final result = {pass, fail}. Therefore, we investigate the cleaned dataset with 21,562 instances after removing the missing values and rows with final result = "withdrawn". "Pass" is the positive class. The ratio of pass:fail is 14,655:6,907 (68%:32%). In other words, the dataset is imbalanced with the IR is 2.12:1 (positive:negative).
Protected attributes: gender = {male, female} is considered as the protected attribute, in the literature. Male is the majority group with the ratio male:female is 11,568:9994 (56.6%:46.4%).
Bayesian network: The numerical attributes are encoded for generating the Bayesian network: num of prev attempts = {0, >0}, studied credits = {≤100, >100}. The network is depicted in Figure 28. The final result attribute is directly conditionally dependent on the highest education level (highest education) and the number times the student has attempted the module (numof prev attempts) attributes, while gender has a more negligible effect on the outcome. We perform the analysis on the relationship of the highest education, number of previous attempts and the final result for each gender. As demonstrated in Figure 29, students have a higher probability of failing when they tried to attempt the exam many times in the past. The ratio of male students having the highest education is "A-level or equivalent" or "higher education (HE) qualification" is around 1.5 times higher than that of female students.

Law school dataset
The Law school 23 dataset (Wightman, 1998) was conducted by a Law School Admission Council (LSAC) survey across 163 law schools in the United States in 1991. The dataset contains the law school admission records. The prediction task is to predict whether a candidate would pass the bar exam or predict a student's first-year average grade (FYA). The dataset is investigated in a variety of studies (Appendix A).
Dataset characteristics: The dataset contains information of 20,798 students characterized by 12 attributes (3 categorical, 3 binary and 6 numerical attributes). An overview of all attributes is depicted in Table 14.  Kusner et al., 2017;Kearns et al., 2019;Yang et al., 2020) are considered as the protected attributes.
• race = {white, black, Hispanic, Asian, other }. As introduced in the related work, we encode race = {white, non-white} based on the original attribute. Non-white is the minority group with the ratio white:non-white is 17,491:3,307 (84%:16%).
Class attribute: The class label pass bar = {0, 1} is used for the classification task. The positive class is 1 -pass. The dataset is imbalanced with an imbalance ratio 8.07:1 (positive:negative).
Bayesian network: To generate the Bayesian network, we encode the numerical attributes as follows: The Bayesian network is visualized in Figure 30.
It is easy to observe that the bar exam's result is conditionally dependent on the law school admission test (LSAT) score, undergraduate grade point average (UGPA) and Race. We discover that 92.1% of white students (16,114/17,491) pass the bar exam, while this ratio in non-white students is only 72.3%. In general, the percentage of students who passed the bar exam is increased in proportion to the LSAT and UGPA scores, which is depicted in Figure 31.   Summary of the educational datasets: The educational datasets were collected in many countries around the world. Gender is the most popular protected attribute, followed by age and race. The typical learning task is to predict students' outcome or grades. Therefore, many machine learning tasks are applied to the datasets, such as classification, regression, or clustering. All datasets are imbalanced with very different imbalance ratios in terms of class imbalance. The bias is observed in the datasets w.r.t protected attributes, i.e., race, sex; hence, fairness-aware algorithms need to take into account these attributes to achieve fairness in education

Experimental evaluation
The goal of our survey is to summarize the different datasets on fairness-aware learning in terms of their application domain, fairness-aware and learning-related challenges. An experimental evaluation of the different fairness-aware learning methods (pre-,in-and post-processing) is be-yond the scope of this survey. However, in order to characterize the different datasets in terms of the difficulty of the fairness-aware learning task, in this section, we present a short fairnessvs-predictive performance evaluation 24 using a popular classification method (namely, logistic regression).

Evaluation setup
Predictive model. As our classification model, we use logistic regression (Cox, 1958), a statistical model using a logistic function to model a binary dependent variable. To simplify the task, we apply the logistic regression model to the binary classification problem.
Metrics. Based on the confusion matrix in Figure 32 (in which, prot and non-prot stand for protected, non-protected, respectively), we report the performance of the predictive model on the following measures.

Actual class Predicted class
Positive Negative

Positive
True Positive (TP) • Accuracy Accuracy = T P + T N T P + T N + F P + F N • Balanced accuracy • True positive rate (TPR) on protected group • TPR on non-protected group T P R non−prot = T P non−prot T P non−prot + F N non−prot (11) • True negative rate (TNR) on protected group • TNR on non-protected group • Statistical parity (Eq. 4) • Equalized odds (Eq. 6) • ABROCA (Eq. 7) Training/test set spliting. The ratio of training set and test set in our experiment is 70%:30% (single split) applied for each dataset. Table 15 describes the performance of the logistic regression model on all datasets. We believe that our experimental results can be considered as the baseline for the researchers' future studies.  [12.5, 47.8, 9.7, 30 Race [31.5, 28.7, 15.5, 24.3] [11.5, 4.4, 77.5, 6.6]  In general, a significant difference in terms of predictive performance and fairness measures is observed between the datasets. In particular, the Ricci dataset is an exception where the performance of the predictive model reaches the peak regarding both accuracy and fairness measures. Apart from that, the logistic regression model shows the best performance on the Communities & Crime dataset in terms of accuracy. The worst accuracy is seen in the result of the model on the OULAD dataset. Regarding balanced accuracy, the Student -Mathematics is the dataset showing the best result of the predictive model, followed by the Student -Portuguese and the Dutch census datasets. Logistic regression model shows the worst balanced accuracy on the Credit card clients, Diabetes and OULAD datasets.

Experimental results
Regarding the statistical parity measure, in general, 10/15 datasets have an absolute value of statistical parity less than 0.1. The Diabetes, Credit card clients and OULAD datasets have the best value (0.0) of statistical parity while the Bank marketing dataset has the worst value. Interestingly, in terms of the equalized odds measure, the best value (0.0) is observed in four datasets (Credit card clients, Diabetes, OULAD and Ricci). The predictive model results in the worst performance on the COMPAS recid. dataset with a high value of equalized odds, followed by the Law school and the Communities & Crime datasets.
In addition, we plot the ABROCA slicing of all datasets in Figure 33. In the Figure, the red ROC curve represents the non-protected group (e.g., Male) while the blue ROC is the curve of the protected group (e.g., Female). The best value of the ABROCA is seen in the Ricci dataset, followed by the OULAD and the KDD Census-Income datasets. The worst cases are the German credit and the COMPAS datasets.

Open issues on datasets for fairness-aware ML
In the previous sections, we have summarized the most popular datasets for fairness-aware learning. In this section, we extend the discussion to also include recently proposed (and therefore, not adequately exploited) real datasets (Section 5.1), synthetic datasets (Section 5.2 and datasets for sequential decision making (Section 5.3). We advocate that the community should focus more on new datasets representing diverse fairness scenarios, in parallel to new methods and algorithms for fairness-aware learning.  (Ding et al., 2021) and income as the prediction task. We consider two protected attributes: Sex = {male, female} and race = {white, non-white}, with "female" and "non-white" being the corresponding protected values.

Adult reconstruction and ACS PUMS datasets
In Figure 34, we depict the proportion of people in each income class over time split per gender and race. It is easy to observe a lower representation of the protected groups (female, non-white) over the years. In relation to the population size, which is shown in Figure 35, the number of people with an income above 50K$ gradually increases in both sexes. However, the growth rate in the male group is slightly higher than that in the female group. The ACS PUMS datasets were only recently proposed (Ding et al., 2021). We believe they comprise a very interesting collection since they also contain spatial and temporal information, albeit only for the US, and can therefore be used to analyze the dynamics of discrimination across space and time. As a preliminary investigation, in Figure 36, we illustrate the gender percentage differences in the positive class, i.e., income over 50K$, for different US states in 2011 and 2019. Many states have low gender differences (depicted in green) in 2011. However, the gender differences increase over the years, as seen in 2019. A further investigation of the potential effect of spatial and temporal parameters is of course required.

Synthetic datasets
Apart from using real-world datasets, it is typical for machine learning evaluation (Ntoutsi, Schubert, Zimek, & Zimmermann, 2019) to also employ synthetic data which allow for evaluation under different learning complexity scenarios. Synthetic datasets have been also used for the evaluation of fairness-aware learning methods Loh, Cao, & Zhou, 2019;D'Amour et al., 2020;Tu et al., 2020;Reddy et al., 2021) to produce desired testing scenarios, which may not yet be captured by the existing real-world datasets, but are essential for the development and evaluation of theoretically sound fair algorithms.
For example, the works (D' Amour et al., 2020;Tu et al., 2020) study the long-term effects of a currently fair decision-making system and therefore require data that capture the decision of In a different direction, (Iosifidis & Ntoutsi, 2018) use synthetic data augmentation to increase the representation of the underrepresented protected groups in the overall population. The synthetic instances are generated via SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) by interpolating between original instances.

Sequential datasets
While algorithmic fairness in decision-making has been mostly studied in static/batch settings, increasing attention has been gained in sequential decision-making environments (Liu, Dean, Rolf, Simchowitz, & Hardt, 2018;Wen, Bastani, & Topcu, 2021), where a sequence of instances possibly infinitely arrives continuously over time which widely exists in many real-world applications such as when making decisions about lending and employment. In contrast to the batch based static environments, sequential decision-making requires the operating model processes each new individual at each time step while making an irrevocable decision based on observations made so far (W. Zhang & Bifet, 2020;W. Zhang & Ntoutsi, 2019). Often times, the processing also needs to be on the fly and without the need for storage and reprocessing (W. Zhang, Bifet, Zhang, Weiss, & Nejdl, 2021).
The aforementioned unique characteristics require the datasets being used for fair sequential decision-making studies fulfill these additional demanding requirements. Among the previously discussed datasets, the Adult (Kohavi, 1996) and Census (Asuncion & Newman, 2007) are rendered as discriminated data streams to fit for this purpose by processing the individuals in sequence (W. Zhang & Bifet, 2020). In addition, the datasets are ordered based on the sensitive attribute of their particular task at hand before sequential processing to further simulating the potential concept and fairness drifts in the online settings (W. . Revevantly, the Crime and Communities dataset (Asuncion & Newman, 2007) is also sequentially processed for sequential fairness-aware studies . However, sequential-friendly datasets, due to their magnified requirements, are still in scarce, albeit their significance for the development of fair sequential models which are widely applicable in many real-world applications (W. Zhang, Tang, & Wang, 2019). A continuation on fair sequential datasets efforts is therefore required for a unified fairness-aware research. The new Adult dataset(s) (see Section 5.1) might be suitable for sequential learning as they contain temporal information (year of data collection).
More recently, the uncertainty due to censorship in fair sequential decision-making has also been researched (W. , 2022. Distinct from existing fairness studies assuming certainty on the class label by designed, this line of works addresses fairness in the presence of uncertainty on the class label due to censorship. Take the motivating clinical prediction therein as the example (e.g., SUPPORT dataset (Knaus et al., 1995)), whether the patient relapses/discharges (event of interest) could be unknown for various reasons (e.g., loss to followup) leading to uncertainty on the class label, i.e., censorship (W. Zhang, Tang, & Wang, 2016). This problem extends beyond the medical domain with examples in marketing analytics (e.g., KKBox dataset (Kvamme, Borgan, & Scheel, 2019)) and recidivism prediction instruments (e.g., ROSSI (Fox & Carvalho, 2012) and COMPAS dataset (Angwin et al., 2016)). The censorship information, including survival time and an event indicator, in addition to the observed features, is thus also included, which is normally excluded in fairness studies that do not consider censorship. As the exclusion of censorship information could lead to important information loss and introduce substantial bias (Wang, Zhang, Jadhav, & Weiss, 2021), more attention on the censorship of fairness datasets is warranted.
Related to the topic of fairness, is the topic of explainability. Explainability tools can help debugging ML models and uncover biased decision making. For sequential decision making, the notion of sequential counterfactuals (Naumann & Ntoutsi, 2021) seems prominent as it takes into account longer-term consequences of feature-value changes. The experiments were conducted on the Adult dataset, however the fairness of the decisions were not investigated. Further research in this direction is required.

Conclusion and outlook
There are several approaches and discussions that can be implemented in studies on fairnessaware ML. First, in this survey, we investigate the tabular data as the most prevalent data representation. However, in practice, other data types such as text (Zhao, Wang, Yatskar, Ordonez, & Chang, 2018) and images (Buolamwini & Gebru, 2018) are also used in fairnessaware machine learning problems. Obviously, these data types are closely related to the domain, and the method of handling data sets is also very different and specialized. This requires the fairness-aware algorithms to be tweaked to apply to different datasets.
Second, by generating the Bayesian network, we discover the relationship between attributes showing their conditional dependence. The results from data analysis and experiments show that the bias may appear in the data itself and/or in the outcome of predictive models. It is understandable that if a dataset contains bias and discrimination, it would be difficult for fairness-aware algorithms to find the trade-off between fairness requirement and performance. Furthermore, based on our experimental results, a significant variation in outcomes between the datasets suggests that the fairness-aware models need to be performed on the diverse datasets.
Third, bias and discrimination are the common problems of almost all domains in reality. In this paper, we study the well-known datasets describing the important aspects of social life such as finance, education, healthcare and criminology. The definition of fairness, of course, is different across domains. It isn't easy to evaluate the efficiency of fairness-aware algorithms because they must be based on such fairness notions. Therefore, it is crucial and necessary to select or define the appropriate fairness notions for each problem in each domain because there is no universal fairness notion for every problem. This remains a major challenge for researchers.
Fourth, the selection of the protected attributes is also a matter of consideration. In the datasets surveyed in this paper, gender (sex), race, age and marriage are the prevalent protected attributes. The selection of one or more protected attributes for the experiment depends on many factors such as domain, problem and the purpose of the experiment. In our experiments, for each dataset, we only demonstrate the performance of the predictive model w.r.t one of the most popular protected attributes. In addition, the identification and handling of "proxy" attributes is also an issue that requires more research.
Fifth, collecting new datasets is always a requirement of data scientists. The surveyed datasets were all collected quite a long time in the past with an average age of about 20 years. The oldest dataset was obtained 48 years ago, while the newest dataset was identified from 7 years ago. Of course, the newer the data, the more up-to-date with the trends of the modern society, so the analysis and application of fairness-aware algorithms on the new datasets will reflect the manifestations of the social behaviors more realistic. On the other hand, the old datasets are of reference value in comparing and contrasting the movement and variation of fairness in the same or different domains. The datasets are collected in the US and European countries where the data protection laws are in place. However, the general policies on data quality or collection still need to be studied and proposed .
To conclude, fairness-aware ML has attracted many recently in various domains from criminology, healthcare, finance to education. This paper reviews the most popular datasets used in fairnessaware ML researches. We explore the relationship of the variables as well as analyze their correlation concerning protected attributes and the class label. We believe our analysis will be the basis for developing frameworks or simulation environments to evaluate fairness-aware algorithms. In another aspect, an excellent understanding of well-known datasets can also inspire researchers to develop synthetic data generators because finding a suitable real-world dataset is never a simple task.

Funding Information
Ministry of Science and Education of LowerSaxony, German, project ID: 51410078 B Datasets' characteristics