Predictive modeling of individual human cognition Upper bounds and a new perspective on performance

Model evaluation is commonly performed by relying on aggregated data as well as relative metrics for model comparison and selection. In light of recent criticism about the prevailing perspectives on cognitive modeling, we investigate models for human syllogistic reasoning in terms of predictive accuracy on individual responses. By contrasting cognitive models with statistical baselines such as random guessing or the most frequently selected response option as well as data-driven neural networks, we obtain information about the progress cognitive modeling could achieve for syllogistic reasoning to date, its remaining potential


Introduction
"What I cannot create, I do not understand."This famous quote by Richard Feynman is one of the core maxims of model-driven research.Only if we are able to capture the fundamental mechanics of nature, effectively allowing us to simulate or re-create the associated behavior, can we speak of having gained true understanding.Translated to the domain of cognitive science, this quote is a reminder to constantly keep pushing cognitive models to their limits in order to improve not only their performance but ultimately our understanding of the mental processes they reflect.
Recently, however, voices have surfaced questioning the merit of current modeling endeavors.On the one hand, there is an ongoing debate about the role of individual data in modeling.Critics of the prevailing focus on data aggregation and corresponding population-based models have demonstrated a lack of group-to-individual generalizability both for experimental (Fisher, Medaglia, & Jeronimus, 2018) and for statistical research (Molenaar, 2004).They argue that while potentially useful for insight into typical human behavior, research into aggregates cannot be used to gain understanding about a single individual's cognitive system (Miller et al., 2002).On the other hand, though undoubtedly related, there is ongoing discussion about the methodologies used in cognitive modeling.For example, with the recent efforts to make Bayesian inference models applicable for the broader research community, probabilistic models and corresponding modeling paradigms (especially with respect to model evaluation and selection) have seen a surge in popularity ( Vandekerckhove, Rouder, & Kruschke, 2018).However, critics argue that while ideal for discovering statistical relationships which can be tied to high-level theoretical assumptions, Bayesian models cannot be used as algorithmic or process-focused approximations of cognition (Fugard & Stenning, 2013;Stenning & Cox, 2006).Instead, some authors consider probabilistic models "a confession of ignorance, in that one might be viewed as obliged to push the model one step further and specify the mechanisms that would generate one set of probabilities as opposed to another" (Guyote & Sternberg, 1981).
In this article, we wish to add to the ongoing discussion about the explanatory power of current cognitive models.We adopt a bird's-eye view posing the fundamental question inspired by Richard Feynman's quote: To which degree are state-of-the-art models capable of reflecting what we are fundamentally interested in-the human mind?We investigate this for the exemplary domain of syllogistic reasoning, one of the core fields of human reasoning research.
With a long history of research stretching over 100 years and a state of the art encompassing at least 12 cognitive theories (Khemlani & Johnson-Laird, 2012), syllogistic reasoning lends itself as a demonstrative domain to investigate the levels of understanding research has achieved.In this domain, we define a prediction task querying models for precise responses to given syllogistic problems.The final model evaluation is performed by comparing the predictions with the actual human responses.To determine the absolute quality of models, we contrast cognitive accounts with data-driven methods from machine learning, namely a set of neural networks based on different features of the data.By comparing cognitive models with the data-driven results, we explore the potential that remains in the field and determine empirical upper bounds of performance to set goals for future modeling endeavors.
A syllogism is a form of categorical assertion consisting of two premises interrelating a set of three terms via quantifiers (All, Some, No, Some . . .not).In experimental settings, participants are asked to relate the end terms of the premises (A and C in the example below), that is, the terms occurring in only one of the premises: Psychological research has shown that human syllogistic reasoning does not strictly follow formal logic principles (Wetherick & Gilhooly, 1995).Instead, past research has produced various theories attempting to explain the cognitive principles underlying syllogistic inferences (Khemlani & Johnson-Laird, 2012).Since the domain is well defined (taking the arrangement of terms into account, there are 64 distinct syllogistic problems and a total of nine possible responses including "No Valid Conclusion" indicating that the end terms cannot be related based on the premise information), syllogisms are an accessible domain for cognitive modeling to investigate what is assumed to be one of the fundamental concepts of human reasoning.
The remainder of this article is structured as follows.First, we introduce the state of the art in modeling human syllogistic reasoning.Second, we define the predictive modeling task as the foundation of our analysis and introduce the baseline models used to put cognitive model performances into perspective.Finally, we present the results of our analysis and discuss their implications for modeling syllogistic reasoning in particular and cognitive science in general.

Related work
Traditionally, research on human syllogistic reasoning focuses on investigating deviations between human inferences and normative first-order logic (Wetherick & Gilhooly, 1995).Over the course of time, the empirical phenomena of syllogistic reasoning matured and were integrated into theories relating statistical effects such as the figural effect (Bara, Bucciarelli, & Johnson-Laird, 1995) with assumptions about mental representations (e.g., in the Mental Models Theory; Johnson-Laird, 1983) or fundamental principles of cognition (e.g., the Probability Heuristics Model by Chater & Oaksford, 1999).
A meta-analysis (Khemlani & Johnson-Laird, 2012) compiled a list of 12 contemporary theories along with the corresponding sets of derived conclusions for each syllogism.By comparison with a set of "liable pooled conclusions," that is, a dichotomization based on which responses were selected by at least 16% of participants, they performed an analysis assessing how well individual theories were able to predict human responses.
Employing classification metrics (hits, misses, correct predictions), the authors concluded that no single model clearly outperformed the others.Instead, they found that depending on the metric of choice, all models exhibited distinct strengths and weaknesses rendering a conclusive ordering based on performance difficult.
More recent work leveraged the differences in predictive properties of heuristics for syllogistic reasoning by constructing portfolios exploiting the strengths while avoiding the weaknesses of individual models (Riesterer, Brand, & Ragni, 2018).We showed that the predictive accuracy of the resulting composite model (43%) clearly outperformed individual models (ranging between 37% and 18% for the best and worst cognitive model, respectively).In contrast to the meta-analysis discussed above, we directly based our analysis on individual responses instead of aggregates.The resulting accuracies demonstrated lacking capabilities of heuristic models when confronted with an individual prediction task.
This shift in perspective from modeling population data via pooled conclusions to modeling individual responses is motivated by the fact that the core objective of modeling human reasoning is the development of functionally equivalent computational formalisms capturing the essence of the processes driving human inferences.In today's research into syllogistic reasoning, process-driven performance analyses directly on the level of individuals are scarce.Especially in light of recent work in statistics showing that group-to-individual generalizability is limited if not impossible for parts of psychology and other empirical fields of science (Fisher et al., 2018;Molenaar, 2004), modeling individual data directly will become unavoidable to ensure the future success of the field.Currently, though, cognitive models do not allow for the integration of individuals either based on a history of responses, or additional background information containing individual difference measures (such as working memory capacity, which was shown to be linked to reasoning ability; S€ uß, Oberauer, Wittmann, Wilhelm, & Schulze, 2002).
In the following analyses, we investigate the potential remaining in the field by contrasting cognitive models with data-driven approaches in a prediction scenario focusing on individual human responses.It is important to note that the following work is not targeted toward model assessment in the traditional sense, but a comparison with methods that are expected to yield an upper bound for predictive performance.

Method
In this section we present the core modeling task of this article: predicting individual responses for given syllogistic reasoning problems.
As the foundation for our evaluation, we rely on the Ragni 2016 dataset supplied with the Cognitive Computation for Behavioral Reasoning Analysis (CCOBRA) Framework. 1 This dataset stems from an experiment in which 204 participants (125 female, 79 male) responded to the full set of 64 syllogisms by selecting which of the nine conclusion options could be followed from the premises.The experiment was conducted online via Amazon Mechanical Turk and participants received a nominal fee as reimbursement for their participation.From the set of 204 participants, 65 had to be excluded due to performances below guessing level or exceptionally long response times.The dataset we use thus consists of 139 participants (for additional details regarding the available data, see Ragni, Dames, Brand, & Riesterer, 2019).The model evaluation was performed in a leave-one-out cross-validation setting where for each subject to be predicted, the models were fitted using the remaining 138 participants as training data.All code and data required for the analyses are publicly available on GitHub. 2

The predictive modeling problem
The modeling problem is defined as the task to generate a conclusion for a given syllogism.More formally, the goal is to find a function f: X?R which transforms a problem input x 2 X into a response r 2 R, where X and R correspond to the sets of 64 syllogistic problems and nine possible conclusions, respectively.Models are finally evaluated based on their predictive accuracy, that is, the proportion of correct predictions on a given evaluation dataset.In sum, the modeling problem can be formulated in terms of an optimization problem for a prediction function f (x) dependent on input x (syllogistic problem).The optimization procedure maximizes an accuracy score h, for example, hits, dependent on the prediction f (x t ) for problem x t and target output y t (human response), where t identifies the position in the experimental sequence for a dataset of size N: hðf ðx i;t jx i;1 ; :::; x i;tÀ1; y i;1 ; :::; This problem definition has properties which are beneficial for cognitive modeling.First, it relies on a highly descriptive performance metric with a close connection to modern machine learning (optimization via error reduction).Consequently, good performance results (evaluated on unseen test data) are likely to translate to a sensible estimate of performance in application contexts.Second, the performance metric stretches over a clearly defined range of values between all misses (0%) and perfect prediction (100%) allowing for an assessment of absolute performance.The higher the score, the better a model is capable of approximating human reasoning behavior.The modeling task can be considered solved only if performance converges toward 100%.Finally, and arguably most importantly, it directly uses the data recorded in experiments without introducing the risk of misinterpretation due to making statements about populations or "average" reasoners which might not even exist (Miller et al., 2002).

Cognitive models for syllogistic reasoning
As a starting point for our analysis, we relied on the prediction table reported in Khemlani and Johnson-Laird (2012, table 7).To compile this list of predictions, Khemlani and Johnson-Laird went to great lengths collecting the most up-to-date versions of the respective approaches while maintaining close communication with the theories' inventors or current maintainers.Since the prediction table contains lists of possible conclusions for each syllogism, we generate specific responses for our analysis by performing uniform sampling.Unfortunately, however, the simplicity stemming from organizing model predictions in such a static tabular form fails to capture the intricacies of some methods (e.g., Baratgin et al., 2015).As a result, one should treat these representations as baselines for cognitive models' performances instead of comprehensive accounts reflecting their theoretical merit.Still, since prediction-oriented implementations of syllogistic models are rare, and custom implementation introduces the risk of integrating incorrect assumptions stemming from misconceptions about a theory's intent, we rely on the data from Khemlani and Johnson-Laird (2012) to obtain a conservative estimate of the general performance of cognitive models.

Baseline models for syllogistic reasoning
In order to put the predictive performances of cognitive models into perspective, we introduce a set of baseline models.The Random model assumes a uniform distribution over the nine syllogistic responses.When queried for a response, one out of the nine options is randomly sampled from a uniform distribution with probabilities of 1/9.This model serves as a random baseline all models are expected to exceed.
On the upper end of the performance spectrum, we provide the Most-Frequent Answer (MFA) model which computes the response distribution per syllogism from given training data.Predictions are generated by returning the response with highest probability mass (ties are resolved by uniform sampling).Since the predictive modeling scenario forces models to generate a single response to a given syllogism, the MFA is the optimal strategy when no information about the individual reasoner is provided.In the present analysis, we do not rely on additional background information containing individual difference measures.However, by providing feedback about the true response a participant gave, models can leverage the history of responses to adapt their behavior.This implies that models are only able to surpass the MFA prediction performance, if they identify and differentiate between human response strategies.

Neural models for syllogistic reasoning
To answer the question about remaining potential in the field of human syllogistic reasoning, we need to provide the upper bounds of performance.Since it is not trivially possible to quantify the numerous noise components in the data which stem from inconsistent responses or highly individual inference strategies, we focus on providing empirical upper bounds obtained from data-driven methods from machine learning.While not offering explanatory insight, the resulting accuracies give an indication about which proportion of the data can be successfully predicted by following the structural properties of the data.In particular, we introduce three neural networks focusing on three different perspectives of the predictive modeling problem.Even though neural networks are severely limited with respect to providing high-level explanation for cognitive processes, they have proven to be capable of achieving high levels of performance over the course of the last years and are suitable candidates for obtaining information about the potential remaining in the field.
The first neural network model is a Multilayer Perceptron (MLP), a standard feed-forward neural network featuring a topology of 12-256-256-9, that is, a 12-dimensional input consisting of three blocks of four bits each for the onehot-encoded quantifiers and figure, 3 which is fed into two hidden layers of dimensionality 256 equipped with rectified linear activation units, and finally into the nine-dimensional output layer, which indicates the generated response.The model is initially trained by providing batches of individual syllogistic problems and corresponding human responses, and it is optimized using the Adam optimizer (Kingma & Ba, 2014) with mean squared error as the loss function.After a prediction is obtained, the model is supplied with the true response in order to allow for an adaption to individual reasoning processes.This adaption step is realized by training the model for an additional epoch using the new datapoint.
Second, a recurrent neural network (RNN) is employed, which explicitly integrates temporal dependencies into the conclusion generation process (for a conceptual introduction, see Elman, 1990).The model features a 12-64-64-9 topology consisting of the 12-dimensional inputs, two recurrent Long Short-Term Memory layers (Hochreiter & Schmidhuber, 1997), and the nine-dimensional outputs.Again, the model is trained using Adam, but it uses categorical entropy as the error function (Deng, 2006).This model does not incorporate interindividual differences.
However, by actively modeling the task sequence, it is technically able to identify sequence effects that may be beneficial features for the prediction generation process.
Finally, a Denoising Autoencoder is applied which frames the predictive modeling problem as a reconstruction task.Similar to the domain of image restoration in which autoencoders have successfully been applied (Xie, Xu, & Chen, 2012), we supply the model with incomplete data about a reasoner.The goal of the model is to correctly fill in the blanks.This model is implemented as a 576-2000-576 network featuring a 576-dimensional input obtained by concatenating the onehot encoded responses of a single reasoner to the 64 syllogistic problems.As such, the inputs represent an individual reasoner's profile.In the hidden layer, this profile is expanded to a high-dimensional space in which relationships between the input dimensions become explicit.From this intermediate representation, the original input can be decoded again.During training, the model is presented with batches of input vectors, that is, reasoner profiles, which were manipulated by randomly setting values to zero.By training the model to approximate an identity function between noisy inputs and complete outputs which is realized by minimizing their mean squared error via Adam, it learns to associate the available information in a way enabling reconstruction of missing values.Over the course of the model evaluation, the autoencoder collects the individual's responses in the adaption step to continuously complete the originally empty reasoner profile.Over time, it leverages the growing information about the individual which allows it to improve its predictive accuracy.

Predictive accuracies
The general evaluation results are depicted in Fig. 1.The image shows that all models exceed the random model's predictive accuracy of 11% attesting the general ability of models to capture the most basic properties of human syllogistic reasoning.The blue block of models encompasses the entirety of the cognitive models spanning a range from 18% to 34% of predictive accuracy.Verbal Models, the best cognitive model, is followed by a substantial gap of performance to the RNN and more importantly MFA, the model always responding with the conclusion most frequently occurring in the training dataset.This constellation of model performances has a major implication for the state of the art in modeling syllogistic reasoning: There is considerable potential left to improve models even without taking interindividual differences into consideration.The fact that both the autoencoder and the MLP surpass the MFA performance shows that they were able to leverage the response histories of human reasoners by identifying underlying strategies.Note that the adaption task forces models to initially start predicting without having information about the individual to be predicted for (the so-called cold-start problem).Therefore, MFA is the best strategy for the first predictions until enough feedback has been collected.This implies that improvements of performance are much harder to achieve above MFA than below and means that comparing absolute differences in performance above and below MFA should be handled with caution.
Going beyond MFA, the adaptive neural networks (autoencoder and MLP) demonstrate a basic capability to capture individual reasoning patterns and exploiting them to boost predictive accuracy.However, within this family of models, differences in performance emerge.Relying on temporal dependencies, the RNN model achieves the lowest accuracy scores falling even short of MFA.Reasons for this could be manifold ranging from the application of an unsuitable model topology to problems emerging from the limited amount of training data.However, a more data-centric argument could be that by increasing the data complexity due to the integration of a temporal axis, the models are presented with a problem that is much more difficult to learn than the basic syllogismresponse transformation.As a result, temporal dependencies, or more precisely sequence effects (Aczel & Palfi, 2016), cannot be recognized and leveraged to boost the predictor's accuracy.
The autoencoder which transforms the modeling problem into a reconstruction task achieves higher accuracies than the RNN exceeding the MFA strategy.It shows that the treatment of responses as some form of reasoning profile is a suitable representation to base predictors on.As a result, the performance obtained by applying the MFA strategy is surpassed.
Finally, the MLP achieves the highest accuracy overall (48.3%).It demonstrates that an integration of adaption to individual properties of cognition via continuous retraining with the newly obtained information can be successfully applied to boost model performance.This approach is not exclusively tied to neural network approaches but should generalize to arbitrary parameterized models which are fitted to training data.

Training performance
Analyzing the reasons causing networks to perform poorly on data is a difficult task (Lee, Agarwal, & Kim, 2017).To rule out a network's inability to learn the fundamental properties of the syllogistic reasoning data, we investigate the training procedure illustrating accuracy progression on the training and test data per training epoch.
The accuracy progression of the network models during training is depicted in Fig. 2. The blue and orange lines represent the mean accuracies (with the shaded band reflecting the 95% confidence interval) on the training and test datasets, respectively.For the RNN, the rise of the training dataset accuracy beyond 90% suggests that, in principle, the network is able to capture the properties of the training data.However, the fact that the performance on the test data only rises for a short duration in the beginning of the training process indicates that the learned patterns cannot be generalized successfully to the test instances.The center plot for the autoencoder model paints a similar picture.Even though the effects of overfitting are not as dramatic as for the RNN, training accuracy is clearly improved while damaging the network's generalization capabilities to the test data.An alternative explanation for the superiority of the autoencoder could be that information about individual reasoners is more important for the prediction process or more directly related to specific responses.Finally, the MLP model, despite its predictive capabilities, shows the least amount of learning behavior.After a quick initial bump, the model drops in performance almost instantly and remains constant for the remainder of training.This is most likely due to the limited and inconsistent input and target data.Since in the case of the RNN and autoencoder each training example is high-dimensional and directly incorporates interindividual differences, it is unlikely to observe inconsistencies, that is, different outputs for the same input.In contrast, the MLP is fed with 12-bit vectors representing syllogistic problems and produces response predictions for individuals.Since individuals respond differently to the same problems, the data are highly inconsistent and force the model to adopt a strategy similar to MFA in which an average reasoner is approximated.Classical overfitting is not possible in this scenario.
The observed training performance leads to two conclusions.On the one hand, human syllogistic reasoning appears to follow systematic patterns, which, to some degree, can be leveraged by data-driven methods.The fact that both the RNN and the autoencoder are able to learn to fit the training data up to nearly 100% additionally suggests that inconsistencies in the given sequence data (RNN) and reasoner profiles (autoencoder) are minimal.On the other hand, the raw training capabilities of the networks do not generalize well to unseen data.Even though the accuracy on the test data is substantially higher when compared to cognitive models, the training progression shows quick stagnation.Reasons for this could be numerous, ranging from problems with respect to data complexity, informational content, or the small size of the training dataset (138 training instances).

Adaption performance
In a final analysis, we put the adaption performance of the data-driven models into the focus of attention.By continuously retraining the MLP, actively considering the potential sequential effects of reasoning in the case of the RNN, and incrementally updating the reasoner profile for the autoencoder, all network models possess basic capabilities for improving their predictions for individual reasoners.The effects of adaption are illustrated in Fig. 3, which depicts moving averages of prediction accuracies (dashed colored lines) throughout the experimental sequences of syllogistic problems.To obtain the figure, a window (of size 3 in our case) was slid across the sequences of 64 syllogistic problems computing the averages of three consecutive accuracies at each time step.The resulting graph, which also contains regression lines (solid colored) and the MFA performance bound (dashed black), thus visualizes the progression of performance achieved by the different adaption strategies.Note that the steep drop in the beginning of the experiment is due to an initial training phase in which the first four syllogisms were deliberately set to be easy problems in order to allow participants to familiarize themselves with the task (see Ragni et al., 2019).However, because these tasks do not reflect regular participant and, in extension, model prediction performance, the first four predictions were excluded for the regression.
The upward trends for all models show that the different adaption approaches are successful albeit to different degrees.While the improvements for the MLP and autoencoder appear similar, the RNN does not benefit from the additional information available for predicting later tasks to a similar degree.Again, this could be related to the RNN only being able to generalize to a limited degree which, in turn, could prevent it from leveraging the full potential of sequential data.
Assessing the progression with respect to the MFA leads to an intriguing observation.Initially, both the MLP and the autoencoder perform worse than the MFA baseline model.Only after the 11th (MLP) and 15th problem (autoencoder) are they able to outperform the MFA and thereby perform at levels unreachable by aggregate models.By continuing this upward trend in performance throughout the experimental sequence, the average performances ultimately reach the high accuracies depicted in Fig. 1.This result is crucial to note because it highlights an additional aspect of adaption.Not only is it important to integrate general adaption capabilities into models.If the unadapted base performance and the adaption improvement rate are low, the effects of adaption might not necessarily lead to improvements over unadapted models.For researchers, this implies that behavioral factors need to be identified which allow models to quickly obtain information about an individual's reasoning strategies.
In sum, the results show that a current upper bound in performance can be located at a predictive accuracy of roughly 50%.The fact that cognitive models fall significantly lower with a maximum of 35% highlights the potential remaining in the field.Even if the current focus on aggregate evaluation of models is continued, the models should be able to arrive at MFA's performance (44%).The network models demonstrate that by integrating assumptions about individuals, even higher predictive accuracies can be achieved.However, even data-driven neural networks stagnate shortly after MFA.While this could be due to technicalities (e.g., network topologies or optimization methods), it could also indicate that the purely response-focused data are approaching an upper bound of predictability.

General discussion
We introduced a predictive modeling task to shift the focus of cognitive model evaluation from relative model selection to a form of model assessment based on absolute performance, that is, predictive accuracy.In the demonstrative domain of syllogistic reasoning (our methods are by no means restricted to this one domain), we illustrated that the current state of the art exhibits shortcomings with respect to the quality of model predictions.Without the intention of uncovering individual flaws of specific models, our analysis showed that at most 34% of our data could be successfully predicted by cognitive models.Especially when compared to baseline strategies such as responding with the most frequently chosen answer in the training dataset (MFA), which manages to achieve an accuracy of 44%, this performance is worrisome.For application in real-world scenarios such as in human-agent interaction, syllogistic models are far from being ready for deployment.Even if these theories are generally able to account for core phenomena and statistical effects of syllogistic reasoning, they are of limited use if their assumptions cannot be generalized to useful predictions.
The lingering question is how much potential is left in the domain for future cognitive models to tap into.We introduced a set of neural network models focusing on different properties of the data.Since neural networks are known for being highly capable function approximators, we expected them to provide an upper bound of performance future generations of cognitive models should be expected to achieve.The definition of such upper bounds allows us to identify prediction-based shortcomings from an unbiased point of view, that is, without adopting the perspective of a particular cognitive theory.In this spirit, the MFA allows us to analyze model performances in comparison to a theoretically optimal aggregate model, that is, one that does not consider interindividual differences in its predictions.Similarly, neural networks provide a data-driven upper bound, that is, one that allows us to compare models with an approach that (in theory) is capable of optimally leveraging the information contained in the data.Our results show that the networks were able to significantly outperform the cognitive models, arriving at predictive accuracies of up to almost 50% for the adaptive MLP, the overall best predictor.Two of the networks, MLP and the autoencoder, were able to leverage information about an individual's reasoning processes to a point that allowed them to surpass MFA.Finding optimal ways to integrate these interindividual differences into models of cognition is key for achieving high accuracies.The discussion about which features allow for interindividual differentiation has already begun (Bara et al., 1995;Stenning & Cox, 2006;S€ uß et al., 2002) and should become a central focus of future research in cognitive modeling.For now, though, cognitive models do not explicitly integrate interindividual differences into their prediction generation processes.In a future of cognitive modeling, where the data and corresponding models actively put the focus on accounting for interindividual differences, the data-driven analyses can be adjusted and reconducted to investigate new upper bounds.
In conclusion, our work illustrated that cognitive models for syllogistic reasoning have potential left for improvement.Currently, the state of the art is unable to reflect the processes underlying human syllogistic reasoning adequately.However, even if models manage to improve, without adjusting the modeling task to focus on individual responses, they will get stuck at the levels of MFA.The network models demonstrate that trivial individualization in the form of training continuation (MLP) is technically successful but does not lead to substantial improvements over MFA.Rather, future models and cognitive theories should integrate interindividual differences into their core mechanics to give rise to the next level of cognitive models exhibiting properties useful for research (explainability) and application (predictive accuracy) alike.As a potential first step, this goal could be approached via an analysis of model predictions on the level of individual responses.Comparing patterns of human responses with corresponding model predictions was out of scope for this article but could help to shed light on the particular shortcomings current models or even the identified upper bounds exhibit.
We strongly feel that the discussed shortcomings originate from a prevailing focus on relative model evaluation and selection as well as statistical analyses and are not limited to the domain of syllogistic reasoning but could potentially generalize to other domains of cognitive modeling.As such, evaluations in terms of absolute performance scores such as predictive accuracies should be added to the toolbox of modelers.In combination with comparisons to noncognitive data-driven approaches used to determine upper bounds of performance, they allow for painting a more comprehensive picture about the capabilities of individual models.Notes 1. https://github.com/CognitiveComputationLab/ccobra.2. https://github.com/nriesterer/iccm-neural-bound.3. E.g., "All A are B; All B are C" is (1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0), "Some B are A; Some B are not C" is (0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0).

Open Research badges
This article has earned Open Data and Open Materials badges.Data and materials are available at https://github.com/nriesterer/iccm-neural-bound.

Fig. 1 .
Fig. 1.Predictive performance of the models for human syllogistic reasoning.Cognitive models are depicted in blue, baseline models in orange, and neural networks in green.The top plot depicts the total proportion of model prediction accuracy with error bars indicating 95% confidence intervals.The bottom plot visualizes the prediction accuracy variation between predicted individuals.Whiskers stretch along 1.5 times the interquartile range.

Fig. 2 .
Fig. 2. Training progression of the recurrent neural network (RNN) and the autoencoder and multilayer perceptron (MLP).The top plots depict the progression of the raw loss metric used for network optimization.Bottom plots represent the progression of prediction accuracy on training and test data.

Fig. 3 .
Fig.3.Moving averages visualization of the prediction accuracies for the data-driven models.The values (dashed colored lines) were computed by sliding a window of size 3 along the sequence of 64 syllogistic problems and computing averages.Regression lines for the obtained values are depicted as solid colored lines.The dashed black line denotes the prediction accuracy obtained by the most-frequent answer (MFA) strategy.