An experimental study measuring the generalization of fine‐tuned language representation models across commonsense reasoning benchmarks

In the last 5 years, language representation models, such as BERT and GPT‐3, based on transformer neural networks, have led to enormous progress in natural language processing (NLP). One such NLP task is commonsense reasoning, where performance is usually evaluated through multiple‐choice question answering benchmarks. Till date, many such benchmarks have been proposed, and ‘leaderboards’ tracking state‐of‐the‐art performance on those benchmarks suggest that transformer‐based models are approaching human‐like performance. Because these are commonsense benchmarks, however, such a model should be expected to generalize, that is, at least in aggregate, should not exhibit excessive performance loss across independent commonsense benchmarks regardless of the specific benchmark on (the training set of) which it has been fine‐tuned. In this article, we evaluate this expectation by proposing a methodology and experimental study to measure the generalization ability of language representation models using a rigorous and intuitive metric. Using five established commonsense reasoning benchmarks, our experimental study shows that the models do not generalize well, and may be (potentially) susceptible to issues such as dataset bias. The results therefore suggest that current performance on benchmarks may be an over‐estimate, especially if we want to use such models on novel commonsense problems for which a ‘training’ dataset may not be available, for the language representation model, to fine‐tune on.


| INTRODUCTION
Commonsense reasoning has become a resurgent area of research in both the NLP and broader AI communities. 1 (Davis, 2014;Storks et al., 2019;Zang et al., 2013), despite having been introduced as an early AI challenge more than 50 years ago, in the context of machine translation (Bar-Hillel, 1960). Traditionally, it was believed that the problem could only be solved through a combination of techniques, including Web mining, logical reasoning, handcrafted knowledge bases and crowdsourcing (Davis & Marcus, 2015;H. Liu & Singh, 2004;Moore, 1982). More recently, the advent of powerful 'transformer' neural networks, especially in NLP (Devlin et al., 2018;, suggests that the time is right to build commonsense reasoners that generalize to a wide variety of situations, including those involving social and physical reasoning (Bisk et al., 2020;Sap, Rashkin, et al., 2019). There are several related reasons why commonsense reasoning is such an important topic in AI. Commonsense reasoning is an innately human ability that machines have (thus far) not proven adept at 'conquering' unlike other task-specific domains such as face recognition (W. Liu et al., 2017). Perhaps for that reason, it has always presented an enticing challenge to many AI researchers throughout the decades (Chklovski, 2003;Lenat et al., 1985;Marcus, 1998;Singh, 2002). There is also the widely held belief that, for a 'general AI' to truly emerge, commonsense reasoning is one problem, among others, that will need to be solved in a sufficiently robust manner (Baroni et al., 2017). A more functional reason for increased interest in commonsense reasoning is the rise of chatbots and other such 'conversational AI' services (e.g., Siri and Alexa) that represent an important area of innovation in industry (Basu, 2019;Gao et al., 2018;Ram et al., 2018;Young et al., 2017). Recently, the US Department of Defence also launched a machine common sense (MCS) program in which a diverse set of researchers and organizations, including the Allen Institute of Artificial Intelligence, is involved (Sap, Le Bras, et al., 2019).
Despite the success of these models, there is some evidence, not necessarily all quantitative, to suggest that the models are still superficial, that is, do not have the same commonsense abilities as humans, despite what the performance numbers suggest. (Davis & Marcus, 2015) suggested in a seminal review article that, for truly human-level, performance 'knowledge of the commonsense worldtime, space, physical interactions, people, and so onwill be necessary.' While we do not deny the theoretical possibility that a language representation model such as BERT, RoBERTa or GPT-3 may have learned these different aspects of the real world purely by 'reading' large corpora of natural language (Devlin et al., 2018;Brown et al., 2020), we do claim that such possibilities can, and must, be tested through rigorous evaluation. Unfortunately, as we cover in the Related Work, there has been little to no work by way of conducting such a systematic and focused analysis, with the central goal of evaluating generalization of a system on commonsense reasoning, using a publicly available and replicable system, although there is plenty of precedent for this type of study, as discussed in the Related Work.
In this article, we attempt to address this gap by carefully designing and conducting an empirical study with the specific intent of answering the question of whether fine-tuned commonsense language models generalize in robust ways. Our goal is not to attack either a model or a particular benchmark, or a set of benchmarks, but to present clear and cautionary evidence that the current set of evaluations and reported results, as well as evaluation practices, need to be considered with more scepticism by the community. Considering the pace at which research on commonsense reasoning continues, we posit that this is a timely study and could serve as a methodology for future such studies assessing the generalization of commonsense AI. Specific contributions in this article are as follows: 1. We present a novel methodology that uses a pre-existing language representation model and commonsense question answering datasets to systematically evaluate whether the model achieves robust generalization across the datasets without necessarily having been 'fine-tuned' on the target dataset's training set.
2. We define a performance loss metric to reliably measure the generalization ability of language models across commonsense reasoning datasets. Our metric has an intuitive interpretation, and can potentially also be used in other experiments, that is, using other language representation models (or even commonsense reasoning models that are not based on language representation models) and benchmarks as they are published in this fast-moving community.
3. Our empirical results show that the language representation model is not generalizing as expected. While the language representation model achieves human-like performance on 'in-domain' commonsense reasoning tasks, such performance is not guaranteed on 'out-of-domain' commonsense reasoning tasks, even though all the tasks are commonsense. In other words, our results show that the choice of dataset on which a model is fine-tuned can make a big difference on the performance of the model on test questions, and that arguably, more research is needed before we can safely apply such models on commonsense problems in the real world.
This paper is structured as follows: first, we discuss related work and background on relevant topics in Section 2, such as transformer-based language representation models and adversarial attacks. Next, in Section 3, we introduce the commonsense question answering benchmarks and transformer-based language model (RoBERTa) used in our experiments. We then describe the experimental methodology, including the definition of the performance loss metrics in Section 4.1. Following the methodology, the generalization of RoBERTa is carefully evaluated and analysed over five benchmarks in Section 4.2. Last but not least, in Section 5, we conclude this chapter and follow this with a discussion on limitations and future work.

| RELATED WORK
As noted in the Introduction, commonsense reasoning has recently experienced a resurgence in the AI research community. Central references that attest to this resurgence include (Davis, 2014(Davis, , 2017Zang et al., 2013;Tandon et al., 2018;Sap et al., 2020;. We also noted that commonsense reasoning has also been an ambitious agenda in the past. It is not feasible to cite all relevant work herein; instead, we refer the reader both to the review article by (Davis & Marcus, 2015), as well as more recent surveys on commonsense reasoning tasks and benchmarks (Storks et al., 2019).
Much progress has been made on specific kinds of commonsense reasoning, especially in reasoning about time and internal relations (Ladkin, 1986;Pinto & Reiter, 1995), reasoning about actions and change (Narayanan, 2000), and the sign calculus (Davis & Marcus, 2015).
In contrast with WordNet and ConceptNet, CycIC focuses on designing a universal schema (a higher-order logic) to represent commonsense assertions, which also supports reasoning systems (Panton et al., 2006;Ramachandran et al., 2005) conduct richer logical inference.
In the last decade, in particular, as more accurate but also more 'non-interpretable' (or explainable) models like neural networks have become more prevalent, a relevant line of research has developed in 'adversarially attacking' these models to understand their weaknesses in a variety of domains (Akhtar & Mian, 2018;Ilyas et al., 2018;Zügner et al., 2018). Other problems, that require more precise inputs and prompts, include bias in the data and also in the model (Kim et al., 2019;Lu et al., 2018). This line of work is valuable precedent for our own work, and there has been some early work already on conducting such robustness tests on transformer-based language representation models (Cheng et al., 2020;Hsieh et al., 2019;Michel et al., 2019). However, this paper significantly departs in at least one respect from these lines of worknamely, we do not adversarially or selectively modify the input or the model in any way. Our results show, in fact, that sophisticated adversarial modifications are not necessary for concluding that generalization is a concern for transformer-based language representation models.
The transformer-based language representation models are recurrent neural networks with encoder-decoder architectures (Vaswani et al., 2017). Different from other recurrent neural network architectures, transformer-based models rely entirely on the self-attention mechanism to capture the global dependencies between input and output. With self-attention only, the total computational complexity per layer is considerably reduced, the computation can be parallelized, and the path length between long-range dependencies (between input and output positions) in the network can increase. The transformer-based language representation models have excelled at various natural language processing tasks, such as machine translation (Lakew et al., 2018;Wang et al., 2019) and inference reasoning (Guo et al., 2019;Huertas-Tato et al., 2022), and their near-human performance has also been witnessed on commonsense reasoning tasks.
The fine-tuned transformer-based language models are frequently used as baseline or reference models in the introduction of new benchmarks (Bhagavatula et al., 2020;Bisk et al., 2020;Huang et al., 2019;Ponti et al., 2020). The prominent models among them, such as GPT (Radford et al., 2018), BERT (Devlin et al., 2018) and BERT-based models (Conneau et al., 2019), generally achieve an accuracy of about 65% or more. Other more recently developed models (He et al., 2020;Zhou et al., 2021) based on BERT and T5 (Raffel et al., 2019) even surprisingly achieve a performance higher than human performance on certain benchmarks (Bhagavatula et al., 2020;Mihaylov et al., 2018;Zellers et al., 2018;Zellers, Holtzman, et al., 2019). In (Chang et al., 2021), researchers show that supplementing the pre-trained language representation models with commonsense knowledge graphs could also yield good few-shot performance. (Wei et al., 2021) propose that the fine-tuned language models also improve the zero-shot performance on unseen tasks. However, the results presented herein show that the fine-tuned language models can suffer from a significant performance loss facing unseen, but still commonsense, tasks. These findings suggest that we should focus, not only on the accuracy performance, but also on the generalization of these models, especially on 'general' AI tasks such as commonsense reasoning.
Theoretical work on commonsense reasoning along the lines of cognitive science and computational commonsense paradigms should also be noted (J. Hobbs et al., 1987;J. R. Hobbs & Kreinovich, 2001;Gordon & Hobbs, 2017). We note this line of work because it could potentially be used for designing better evaluations, as well as for diagnosing why some transformer-based models are not generalizing better, despite (individually) good performance across the board on many benchmark datasets.

| Commonsense question answering (QA) benchmarks
As noted in both the introduction and the related work, commonsense reasoning has emerged as an important and challenging research agenda in the last several years. The usual way to evaluate systems (with the state-of-the-art systems being based, in some significant way, on the transformer-based models described in the next section) purporting to be capable of commonsense reasoning in the natural language setting is to use question answering benchmarks with multiple choices per 'question' from which exactly one correct answer must be selected by the system.
The NLP (and broader AI) community has developed numerous such benchmarks, especially in the last 3-5 years, using a range of methodologies both for acquiring the questions and for devising the answers. We describe the five benchmarks used in the research study in this paper below, with references for further reading. Examples are provided in Table 1. 1. aNLI (Abductive Natural Language Inference): Abductive Natural Language Inference (aNLI) 2 (Bhagavatula et al., 2020) is a new commonsense benchmark dataset designed to test an AI system's capability to apply abductive reasoning and common sense to form possible explanations for a given set of observations. Researchers employ crowdsourcing (specifically, Amazon Mechanical Turk workers with qualifications) to write a hypothesis, given a context and a target label. The human-generated examples (where an example includes a context, a label and a generated hypothesis) are checked by the base model before using them in the final dataset construction. Formulated as a binary-classification task, the goal of this benchmark is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. Compared to human performance of 0.93, the highest performance of a language representation model (at the time of writing) is 0.90, which is achieved by  (Bisk et al., 2020) is a novel commonsense QA benchmark for naive physics reasoning, primarily concerned with testing machines on how we interact with everyday objects in common situations. The dataset is inspired by the website T A B L E 1 Question-answer instances from five commonsense benchmark datasets used for the evaluations in this paper. The question-like 'prompt' is highlighted in yellow, and the correct answer in blue. instructables.com, which instructs users on using everyday materials to build, craft, and manipulate objects. Human annotators are asked to provide a reasonable solution given a clear goal. Additionally, annotators are asked to provide a 'perturbation' to their solution that would make it invalid, often in a subtle way. PIQA tests, for example, what actions a physical object 'affords' (e.g., it is possible to use a cup as a doorstop), and also what physical interactions a group of objects afford (e.g., it is possible to place an computer on top of a table, but not the other way around). The dataset requires reasoning about both the prototypical use of objects (e.g., glasses are used for drinking) but also nonprototypical (but practically plausible) uses of objects. Compared to human accuracy (0.95), the machine's best performance is 0.90, also achieved by UNICORN.
4. Social IQA: Social Interaction QA 6 (Sap, Rashkin, et al., 2019) is a QA benchmark for testing social common sense. In contrast with prior benchmarks focusing primarily on physical or taxonomic knowledge, Social IQA is mainly concerned with testing a machine's reasoning capabilities about people's actions and their social implications. Actions in Social IQA span many social situations, and answer-candidates contain both human-curated answers and ('adversarially filtered') machine-generated candidates. The benchmark is constructed by executing crowdsourcing tasks on Amazon Mechanical Turk to create context, question, and a set of positive and negative answers. We note that both human-and machine-performance on Social IQA are slightly lower than other benchmarks. Specifically, human accuracy on Social IQA is 0.88, with UNICORN achieving an accuracy of 0.83.

5.
CycIC: Cyc Intelligence Challenge dataset (CycIC) 7 is a set of multiple choice and true/false questions requiring general common sense knowledge and reasoning in a very broad variety of areas, including simple reasoning about time, place, everyday objects, events, and situations.
Some questions may require some logic to get the correct answer. Here, we only use the multiple-choice questions (and not true/false questions) for experiments. The human performance on CycIC is 0.90; however, UNICORN was able to achieve a performance higher than that of human performance (0.94).
One important aspect to note about these benchmarks is that, while all offer multiple-choice answer formats, the 'prompt' is not always a question. For example, in the case of the aNLI benchmark, the 'prompt' is a set of two observations, and the 'choices' are two hypotheses (of which the one that best fits these observations should be selected). For this reason, we refer to each question and corresponding answer choices as an instance. The instance formats of the benchmarks described above are stated in Table 2.

| Transformer-based models and RoBERTa
As covered in the Related Work, transformer-based models have rapidly emerged as state-of-the-art in the natural language processing community, both for specific tasks like question answering, but also for deriving 'contextual embeddings' as representations (for more details, we refer the reader to the citations in that section).
The pre-trained representations yielded by the previous transformer-based models (before the Bidirectional Encoder Representations from Transformers model or BERT (Devlin et al., 2018)) are unidirectional (Peters et al., 2018;Radford et al., 2018). These models use 'left-to-right' or 'right-to-left' language modelling objectives, where every token can only attend to the previous or later tokens in self-attention layers of the transformers. It can be sub-optimal to apply these unidirectional models on token-level tasks, such as question answering, which require models to incorporate context from both directions for a complete understanding. BERT alleviates the unidirectionality constraint by using a masked language model (MLM) as a pre-training objective. The MLM objective asks models to predict the original vocabulary of some randomly masked tokens from the input, enabling the trained representations to fuse the left and right contexts.
RoBERTa is (in essence) a highly optimized and better-trained version of BERT and has been a particularly successful model. Two steps are included in the framework of RoBERTa: pre-training and fine-tuning. During pre-training, the model is trained on massive unlabeled data over various pre-training tasks. Following that, the pre-trained RoBERTa model is further fine-tuned using labelled data from the downstream tasks. More T A B L E 2 'Instance formats' of commonsense QA benchmarks. The -> is used to separate what is given (e.g., obs1, obs2 for aNLI) and the answer choices (hypo1, hypo2). Here, 'obs', 'hypo', 'opt', 'ans', 'sol' and 'ctx' stand for observation, hypothesis, option, answer, solution and context respectively.
Unlike the most recent model (GPT-3), a pre-trained version of RoBERTa is fully available for researchers to use and can be 'fine-tuned' for specific tasks (Y. Liu et al., 2019). Compared with BERT, RoBERTa makes several important modifications to the pre-training process. Researchers pre-train the RoBERTa model for longer, using bigger batches, more data and longer sentences. Additionally, in contrast with BERT pre-training, the next sentence prediction objective is removed from the RoBERTa pre-training objective. Finally, RoBERTa dynamically changes the masking pattern applied to the training data, rather than implementing a single static mask. Although these steps seem technical, they are highly effective, generally leading to a drastic improvement over the original BERT model in several problem domains.
Indeed, many of the systems occupying the top-5 leaderboard positions 8 for the commonsense reasoning benchmarks described earlier are based on RoBERTa in some significant manner. The experiments in this paper, described next, use a publicly available RoBERTa Ensemble model 9 that was not developed by any the authors, either in principle or practice, can be downloaded and replicated very precisely, and on average, achieves over 80% on the five benchmarks when fine-tuned on the benchmark without any change to the model itself.

| EXPERIMENTS
We design and conduct a rigorous series of experiments (with full statistical analyses) to study the question noted in the title of this paper itself.
While the data and system have already been described in the previous section, we use the next section to provide some relevant technical and methodological details, followed by the results.

| Data and methodology
We use the five benchmarks described earlier for our evaluation datasets. Each of these benchmarks is publicly available, and even has a leaderboard dedicated to it. Many researchers have used these benchmarks for evaluating commonsense reasoning models (Storks et al., 2019). Note that the 'test' partition of these benchmarks is not available publicly; hence, for research purposes, the development or 'dev.' set is used as the test set. To ensure replication, we do the same. Our goal here is not to develop a superior algorithm that may do better on the unknown test set, but to explore the capabilities of a popular language model-based solution to this problem. Details on the benchmarks' training and development set partitions, as well as current state-of-the-art (SOTA) performance by a highly optimized RoBERTa system on the leaderboard (designed and fine-tuned on just that specific task or benchmark) are shown 10 in Table 3. As described in the previous section, we used the RoBERTa Ensemble model for our experiments, which achieves over 80% performance (on average) over the five benchmarks and is not substantially different from the SOTA model. While some formatting was necessary in order to ensure that the RoBERTa Ensemble system, when trained on one dataset (say CycIC), could be applied to instances from another dataset (say Social IQA), we did not modify any of the information content within the instances (either in the questions/prompts or in the answers).
Furthermore, since one of our goals is to test the generalization ability of precisely such models (i.e. a 'commonsense QA' model that has been trained on one kind of commonsense data, but evaluated on another), We define a performance loss metric (PL) to quantify the generalization ability of the model as follows: Here, Acc indomain is the 'in-domain' prediction accuracy achieved on the benchmark when we train the model on that benchmark's training set partition and evaluate on the development set from the same benchmark; Acc outdomain is the 'out-of-domain' prediction accuracy achieved when one benchmark is used for training and another benchmark's dev. set partition is used for evaluation. Since there are four training options (the other four benchmarks) once the dev. benchmark has been fixed, it is possible to compute four separate Acc outdomain metrics for any given benchmark.
The PL has an intuitive interpretation: how much of the 'in-domain' performance (in percentage terms) does the model 'lose' when the evaluation benchmark is changed? The PL metric ranges from negative infinity to 1, as shown in Figure 1. A negative PL is theoretically possible (and potentially unbounded from below, as Acc indomain tends to 0), but not observed in our experiments. When PL equals to 0, the out-of-domain prediction accuracy is the same as the in-domain prediction accuracy. It represents that the model generalizes well and did not suffer from any loss in prediction accuracy when the evaluation benchmark is changed. A high PL implies that the model is generalizing less. The PL can never be greater than 1 (when Acc outdomain = 0).
Given the descriptions of the five benchmarks used in this paper in the previous section, we would expect that the PL would be greatest for the benchmarks that are highly narrow in their domains (e.g., if we train on Social IQA and test on PIQA) as opposed to cases when the training benchmark is broad (such as when the training benchmark is aNLI, HellaSwag and CycIC). In the next section, we assess the validity of this hypothesis.

| Results and analysis
The absolute accuracy results of the RoBERTa Ensemble model when trained on one benchmark and tested on another are tabulated in Table 4.
Overall, we see very clear evidence that, regardless of the train. dataset used, out-of-domain performance inevitably declines, sometimes by F I G U R E 1 Value range of the performance loss (PL) metric.
T A B L E 4 Accuracy (fraction of questions of which the answers are correctly predicted) of the Roberta Ensemble model in different evaluation settings. The row represent the benchmark of which the training set partition is used to train, while the column name represents the benchmark of which the development (dev.) set partition is used to test. or whether we test on a broad or narrow benchmark (e.g., PIQA). For better analysis of relative differences, we tabulate the performance loss (PL) metric in Table 5. The diagonals are all 0, since the performance loss is 0 when the training and testing benchmark are the same (per Equation 1). The numbers correspond closely to those in Table 4, but generally tend in the opposite direction (i.e. PL is lower when the absolute accuracy is higher for a test benchmark, all else being the same.).
Recall that, in the previous section, we stated a hypothesis that we expect test benchmarks that are too 'narrow' (such as PIQA or Social IQA) to exhibit more PL than benchmarks which are broader, except (possibly) when the training set is also broad. Table 5 shows that the data on this question are surprisingly mixed. In particular, PL on PIQA is always low when it is used as a test set, despite the fact that it covers such a narrow domain. In contrast, the PL on Social IQA is high (usually, the second highest after CycIC). Similarly, with respect to testing on the 'broader' benchmarks, PL is low on aNLI but higher on HellaSwag. When training on aNLI or HellaSwag, and comparing to training on either PIQA or Social IQA, we find that the difference is not considerable e.g., the system trained on HellaSwag achieves PL of 16.8%, 33% and 56.7% respectively on aNLI, Social IQA and CycIC, and the system trained on PIQA achieves PL of 16.9%, 33.6% and 54.4%, respectively, on the same three test sets. Therefore, it is simply not true that performance loss is observed simply because the 'domains are different' (though by definition, they are all commonsense benchmarks), which is sometimes the cause in similarly designed (and more traditional) transfer learning and weak supervision experiments.
Interestingly, based both on the data in Tables 4, 5, we find clear evidence that CycIC is the most 'different' benchmark, since the PL is markedly higher with CycIC used as the training (and also the testing) dataset. Namely, the PLs observed in the CycIC 'column' are the highest among the values in the corresponding training dataset's 'row' e.g., when PIQA is used as the training dataset, the PL of 54.3% observed for CycIC is the highest in that row.

| Significance analysis
The paired Student's t-test methodology (the 'first' significance test mentioned in the previous section) was first applied to all values in Table 4, and compared against the 'diagonal' value. For example, the results obtained when training (respectively) on HellaSwag, PIQA, Social IQA and CycIC, and testing on aNLI, are tested individually using the test statistic from a paired Student's t-test analysis against the in-domain aNLI setting (the diagonal accuracy value of 81.9% in Table 4). We find that the null hypothesis can be rejected at the 99% level for all such paired tests, for all test benchmarks. The differences in accuracy are therefore significant, as are the PLs in Table 5, which is just a scaled, affine transformation of

| Discussion on observed differences
The previous results clearly illustrate that the choice of the training benchmark matters, often in surprising (but statistically significant) ways. One hypothetical reason why this behaviour is observed is that PIQA may just be an 'easier' dataset, and CycIC may just be a 'harder' dataset (hence T A B L E 5 The performance loss (PL; Equation 1) of the Roberta Ensemble model when evaluated on a different benchmark ('out-of-domain') than it was trained on ('in-domain'). The row represent the benchmark of which the training set partition is used to train, while the column name represents the benchmark of which the development (dev.) set partition is used to test. leading to lower and higher PLs respectively). However, if this were the case, then the 'diagonal' values in Table 4 would reflect it. The observed values in Table 4 tell a different story; all results are clustered in the range of 75-83%, and the in-domain result for PIQA is similar to that of Social IQA, suggesting that the two are of reasonably equal difficulty. Yet, one proves to be more 'generalizable' than another in out-of-domain settings.
Another hypothetical reason could be the number of answer choices available per prompt. The hypothesis is that, once there is a mismatch between the training and testing setting, the performance becomes more random. While this hypothesis may explain why CycIC has the highest PL (it has five answer choices, generally, for every question; see Table 2), it does not explain why Social IQA (which has three answer choices per question) has higher PL than HellaSwag (which has four). Furthermore, the large differences in accuracy observed in the out-of-domain settings in Table 4 cannot be explained only by differences in the number of answer choices available in the different benchmarks. Finally, if the model had become more random, expected accuracy on CycIC and aNLI would be around 20% and 50% respectively (assuming relatively equal distribution of answer choices); however, the accuracies are significantly higher, according to Table 4.
Colloquially, the model is clearly learning 'something'. Furthermore, given the relatively large sizes and broad domains covered by some of these datasets (see Table 3; even CycIC, the smallest 'training' dataset has more than 6000 questions in its training partition), it is unlikely that the fault lies purely with the data. The most plausible explanation that remains is that the fine-tuned RoBERTa model is subject to some degree of dataset bias, and that leaderboard performance numbers on individual benchmarks should not necessarily be assumed to be a reflection of advances in human-like 'commonsense reasoning' without significantly more qualification.

| Analysis of incorrect/correct question partitions
While Tables 4 and 5 reveal that there are differences between in-domain and out-of-domain settings, it is not clear from that data where the different models are 'going wrong'; specifically, where the in-domain training produces the right predictions, but an out-of-domain training setting does not. all out-of-domain models (ODMs) give the same wrong answer. On the aNLI and PIQA dev. sets, this situation is more common than the situation when all ODMs get the wrong answer, but at least two ODMs differ on the answer (the same situation arises in the bottom table). More research is required to investigate the deeper implications of this result, but one hypothesis is a 'choice bias', at least in aNLI and PIQA. A choice bias arises when all answer choices do not have the same prior probability of being selected by the model. In particularly egregious cases, the question itself may not be necessary 12 for the model to choose the right answer. A set of experimental studies, not dissimilar in design to the studies herein, could be used to prove or disprove the existence of a choice bias by not using the actual 'question' (only the answer choices) during test-time (and subsequent to data collection, testing for significance against a random baseline). We leave such a study for future work.

| Qualitative analysis
To gain an 'intuitive' sense of what the different out-of-domain models are learning, it is also instructive to consider samples of questions from all five benchmarks such that the in-domain model made the correct prediction but all out-of-domain models made the same wrong prediction.   Tables 7 and 8 list some examples of such questions. We find that, in some cases, the 'wrong' answer is an 'easy' wrong answer (yet either the in-domain or the out-of-domain model got it wrong). The example for aNLI in the third column (IDMW & ODMR) is particularly telling. In other cases, the wrong answer is qualitatively 'harder' (such as the example for Social IQA in the same column). In yet other cases, we may be more inclined to agree with the machine learning models, such as for the Social IQA row and the last column. Some of the questions may require a degree of specialized or culture-specific knowledge (such as the HellaSwag question in the last column).
We also provide some examples wherein out-of-domain models yielded incorrect answers that are all different in Table 9. We find that these instances tend to have shorter contexts and longer candidate answers. In some special cases, such as the IDMR CycIC instance, the candidate answers should also be considered as parts of the provided context or question. These instances are undoubtedly difficult for language representation models to deal with. In addition, the correct answers in some instances are confusing, even from a human standpoint. The most obvious case is the IDMW SocialIQA example. In our opinion (which may not be shared by, for instance, the authors of the benchmark itself ), 'Aubrey' and his friend may choose to hit the ball, or have a lot of fun, if they decided to play baseball, rather than watch the game. Regardless of whether our opinion is the minority or majority opinion, this example suggests (at minimum) that human filtering and verifying, as well as cultural dependencies and biases on commonsense instance construction, need to be more emphasized in the description of the benchmark's methodology. There also need to be appropriate quality-control measures (such as inter-annotator T A B L E 7 Example instances for selected settings from Tables 6. IDMR and ODMR represent the population of instances where the indomain model is right, and all out-of-domain models are also right. ODMW represents the instances where all out-of-domain models retrieved the same wrong answer. For each instance, the truth labels are highlighted in green, and the incorrect answers chosen by out-of-domain models are highlighted in orange. agreement). It is not evident that the standard commonsense benchmarks were subjected to such rigorous procedures during their initial construction.

IDMR & ODMR IDMR & ODMW
The analysis here suggests several different aspects of an evaluation setup that are serving as 'confounds' to testing not only the generalization of commonsense reasoning systems, but commonsense reasoning ability itself. Some of the questions in these benchmarks may not be commonsense questions, in the way that cognitive scientists have understood the phenomenon, and may have a domain-or culture-specific dependence that may cause noise in evaluations. However, in a few questions, we clearly see that the model gets some very obvious answers wrong. Therefore, it is not the case that the model only has trouble on difficult cases, though this also plays a role in the results that we observe.
Addressing the flaws in these benchmarks, as well as using additional metrics (such as the performance loss) to evaluate candidate models, are both important avenues for future research. Eventually, a more advanced kind of benchmark may be warranted, one that has weighted or dynamically generated questions. Some researchers have suggested moving beyond multiple-choice questions altogether and focusing on generation tasks instead (B. Y. Lin et al., 2019;Nan et al., 2020).
T A B L E 8 Example instances for selected settings from Tables 6. IDMW and ODMW represent the instances where the in-domain model was wrong, and all out-of-domain models retrieved a same wrong answer (may be different from the IDM wrong answer). ODMR represents the population of instances where all out-of-domain models are right. For each instance, the truth labels are highlighted in green. The incorrect answers are highlighted in pink. Note that, in the last column (IDMW & ODMW), it may be that the IDM wrong answer differs from the wrong answer retrieved by the ODMs, in which case, we highlight the former in pink, and the latter in orange. When the in-domain models and out-ofdomain models retrieved the same incorrect answer, we highlighted those incorrect answers in pink.

| CONCLUSION
Language representation models such as BERT, RoBERTa and (more recently) GPT-3 have received prolific academic and media attention due to their ability to achieve near-human performance on a range of individual benchmarks, including on several commonsense benchmarks. In this paper, we showed that there is still a significant performance drop when one such competitive model is trained on one kind of commonsense dataset but tested on other commonsense datasets. It is important to remember that all datasets considered in this work were supposed to test 'commonsense reasoning', although some are more diverse and broader than others. The breadth of either the training or testing dataset is not found to significantly impact the overall conclusions. At minimum, our analyses suggest commonsense models are not generalizing enough and there is a potential source of dataset bias when evaluating commonsense reasoning.
More research is required before 'human-like' commonsense reasoning performance can be confidently said to be within reach of these language representation models. Clear evidence in our experiments suggests that, on the one hand, we need to construct a 'holistic' benchmark that could be used to systematically train and evaluate the full suite of commonsense reasoning capabilities . The benchmark should not be limited to some 'local' commonsense knowledge but be truly 'human-like' commonsense (i.e., be more general and bias-free). The filtering and cleaning of the current benchmarks may also be useful next steps to more accurately evaluate models purporting to do commonsense reasoning. On the other hand, the argument could be made that the benchmarks are collectively 'doing their job' and that more fundamental innovations are needed to improve the robustness of the language representation model in the first place. Besides, it is still not clear whether the transformer architecture in language representation models has particular effects on the generalization of models, compared to the previous recurrent neural networks, although the former achieved excellent performance on current benchmarks.
T A B L E 9 Examples wherein out-of-domain models retrieved several different incorrect answers. Similar to Tables 7, 8, the truth labels and incorrect answers chosen by in-domain models are highlighted in green and pink, respectively. In the IDMR column, the pink highlight does not exist because the in-domain model gave the correct answer. The incorrect answers chosen by out-of-domain models are highlighted in orange. Note that for the two-option benchmarks (aNLI and PIQA), instances only have one incorrect answer by definition; hence, examples where different out-of-domain models choose several different incorrect answers do not exist. Therefore, these benchmarks are excluded from the table.

IDMR IDMW
HellaSwag ctx_a: 'A group of cheerleaders run onto a stage before a cheering audience.', ctx_a: 'A man is holding a pocket knife while sitting on some rocks in the wilderness.', ctx_b: 'they', ctx_b: 'then he', opt1: 'get into formation, then begin dancing and flipping as male cheerleaders join them.', opt1: 'opens a can of oil put oil on the knife, and puts oil on aknife and press it through a can filled with oil then cuts severalpieces from the sandwiches.', opt2: 'are then shown performing the type of cheerleading dance, using batons and pole vaults.', opt2: 'uses the knife to shave his leg.', opt3: 'perform a cheer routine before the girls, along with makeup artists, spread out and pose.', opt3: 'takes a small stone from the flowing river and smashesit on another stone.', opt4: 'take turns jumping on each other like they are performing karate.' opt4: 'sand the rocks and tops them by using strong pressure.' Social IQA context: 'Jan took me to NYC to celebrate my birthday before the baby was born.', context: 'Aubrey met a friend at the park and they decided to go play baseball.', question: 'What will others want to do after?', question: 'What will happen to others?', answerA: 'leave me in NYC', answerA: 'will watch the game', answerB: 'wish me a happy birthday', answerB: 'will hit the ball', Ultimately, such language representation models would be used in the real world, not only in multiple-choice settings, but also in so-called generative settings where the model may be expected to generate answers to questions (without being given options). Even in the multiplechoice setting, without robust commonsense, the model will likely not be usable for actual decision making unless we can trust that it is capable of generalization (Kejriwal, 2021;Misra, 2022;Wahle et al., 2022). One option to implementing such robustness in practice may be to add a 'decision-making layer' on a pre-trained language representation model rather than aim to modify the model's architecture from scratch 13 (Hong et al., 2021;Tang & Kejriwal, 2022;Zaib et al., 2020). This layer would be trained to indicate when the model is uncertain about answering a given question, with its candidate answer-choices. Properly developed, such a layer and architecture could help language representation models become more robust by skipping adversarial questions that cause degradation in performance, and achieve more human-like behaviour even when the target distribution is different from the distribution on which it was originally fine-tuned. However, building such architecture remains an open problem.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request. How to cite this article: Shen, K., & Kejriwal, M. (2023). An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks. Expert Systems, 40(5), e13243. https://doi.org/10.1111/exsy.13243