Event knowledge in large language models: the gap between the impossible and the unlikely

Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.


Introduction
A vital component of human intelligence is our ability to learn, store, and flexibly use rich, structured knowledge about the world. World knowledge spans different domains (from physical properties to social conventions) and covers different types of information, including knowledge of objects, agents, actions, and ideas. One important component of world knowledge is our generalized event knowledge (GEK) -templates of common events observed in the world (e.g., McRae & Matsuki, 2009). We acquire GEK both through sensorimotor experiences (i.e., from performing and observing events in the world) and through linguistic experiences (i.e., from event descriptions generated by other people). The close link between event knowledge and language behavior (e.g., Bicknell et al., 2010;Federmeier & Kutas, 1999;Kamide et al., 2003;Matsuki et al., 2011;McRae & Matsuki, 2009) raises the question to which extent GEK can be learned from linguistic input alone, as a consequence of acquiring rich statistical knowledge of word co-occurrence patterns in text.
Large language models (LLMs) allow us to test the possibility that GEK can emerge naturally from tracking co-occurrence patterns in linguistic input. State-of-the-art LLMs, trained to predict words based on their context, have achieved remarkable success across a variety of tasks, such as generating syntactically and semantically coherent paragraphs of text (Brown et al., 2020), sentiment analysis and logical inference (e.g., Devlin et al., 2018;Liu et al., 2019;Radford et al., 2019;Yang et al., 2019), closed-book QA (Roberts et al., 2020), and certain aspects of commonsense reasoning (Talmor et al., 2020;Zellers et al., 2018).
Studies of world knowledge in LLMs so far have produced mixed results. On one hand, LLMs perform well on multiple linguistic tasks designed to probe world knowledge, such as the Winograd Schema Challenge (WNLI; Levesque et al., 2012), the Story Cloze Test (SWAG; Zellers et al., 2018), and the Choice of Plausible Alternatives Test (COPA; Roemmele et al., 2011), so much so that some authors have proposed and evaluated their use as off-the-shelf knowledge base models (Kassner et al., 2021;Petroni et al., 2019;Roberts et al., 2020;Tamborrino et al., 2020). Moreover, co-occurrence patterns learned from language and from other domains (such as vision) exhibit a remarkable degree of correspondence (Abdou et al., 2021;Lewis et al., 2019;Patel & Pavlick, 2021;Roads & Love, 2020;Sorscher et al., 2021), suggesting that language might be able to replace other modalities as a source of world knowledge, consistent with the Symbol Interdependency Hypothesis (Louwerse, 2011). On the other hand, studies using more fine-grained tests have shown that world knowledge in contemporary LLMs is often brittle and depends strongly on the specific way the problem is stated (Elazar et al., 2021a;Ettinger, 2020;Kassner & Schütze, 2020;McCoy et al., 2019;Niven & Kao, 2019;Pedinotti et al., 2021;Ravichander et al., 2020;Ribeiro et al., 2020). For example, some authors have noted that, when low-level co-occurrence statistics are properly controlled for, LLMs that were considered to have high accuracy on world knowledge tasks start to perform randomly (Elazar et al., 2021b;Sakaguchi et al., 2021), highlighting the potential discrepancy between the word-in-context prediction objective (which benefits from tracking surface-level statistics) and world knowledge acquisition (which should be invariant to surface-level statistics).
In this work, we test whether prediction-based LLMs encode human-like generalized world knowledge in the domain of events. To minimize the effect of confounding factors, we use highly curated, syntactically simple minimal sentence pairs. In two datasets, Datasets 1 and 3 (see Methods for details), plausibility within a sentence pair is manipulated via swapping the agent and patient of the sentence (e.g., The teacher bought the laptop vs The laptop bought the teacher). This manipulation ensures identical word-level content within a sentence pair, such that the plausibility inference requires identifying the role played by each participant (e.g., teacher = agent, laptop = patient). In Dataset 2, plausibility is manipulated by replacing the event patient (e.g., The actor won the award/battle). The three datasets were selected to span event descriptions across a range of event participant compositions (interactions between two animate or one animate and one inanimate event participant) as well as varying degrees of semantic incongruence of the manipulated sentence (ranging from impossible to moderately implausible events). We focus on our largest dataset (Dataset 1, see Methods) for most analyses but show in SI that the findings extend to other datasets too.
In Sections 3.1 and 3.2, we ask whether LLMs assign higher likelihood scores to descriptions of plausible events compared to their implausible counterparts. In Sections 3.3 and 3.4, we investigate the degree to which these scores are generalized, i.e., abstracted away from the surface-level properties of the input. Finally, we conduct detailed analyses of LLM performance by studying their error patterns (Section 3.5) and probing their internal representations of event plausibility (Section 3.6).
We hypothesize that, if general event knowledge emerges naturally from the word-in-context prediction objective, LLMs should be more likely to generate plausible sentences than implausible sentences. Furthermore, plausibility judgments should generalize across sentence surface form. If, on the other hand, LLMs fail to acquire robust event knowledge, they would fail to systematically generate event descriptions that align with GEK.
To foreshadow our key result, we find that language models perform well when distinguishing events that are possible (e.g., The teacher bought the laptop) from events that are, in the absence of contextual information, impossible (e.g., The laptop bought the teacher). However, LLMs fall short of human performance when distinguishing events that are likely (e.g., The nanny tutored the boy) from events that are unlikely but not impossible (e.g., The boy tutored the nanny). Thus, we uncover a major factor underlying the difference between sentence generation patterns in contemporary LLMs and knowledge of plausible event schemas.

Sentence sets
We compare event plausibility scores in humans and language models using three sentence sets adapted from previous cognitive science and neuroscience studies (see Tables 1 and 2 for a summary): Dataset 1 -main (based on Fedorenko et al., 2020). This sentence set contains 391 items, each of which includes (i) a plausible active sentence that describes a transitive event in the past tense (e.g., The teacher bought the laptop) and (ii) the implausible version of the same sentence, constructed by swapping the noun phrases (NPs) (The laptop bought the teacher). The dataset also includes passive voice versions of the same sentences (The laptop was bought by the teacher and The teacher was bought by the laptop). Further, 249 of the 391 items are grouped into pairs with synonymous meanings (e.g., The teacher bought the laptop and The instructor purchased the computer).
The items are split into two types: (1) animate-inanimate (AI) items (e.g., The teacher bought the laptop vs. The laptop bought the teacher; n=128; 76 with synonyms); (2) animate-animate (AA) items (e.g., The nanny tutored the boy vs. The boy tutored the nanny; n=129; 82 with synonyms). Due to the animacy differences, the role reversal manipulation on AI sentences often violates the animacy selectional restrictions on the verb, making the sentence mostly semantically impossible, whereas the plausibility violations in AA sentences are more graded. Finally, the dataset includes a set of animate-animate, reversible (AA-control) items (n=134; 78 with synonyms), where both event participants are animate and both agent-patient combinations are plausible (e.g., The cheerleader kissed the quarterback vs. The quarterback kissed the cheerleader) and that we used as control in some of the analyses. Vassallo et al., 2018). This sentence set contains 395 items, each of which includes (i) a plausible active sentence that describes a transitive event in the past tense, where the animate agent entity is interacting with an inanimate patient entity that is prototypical/canonical for the agent (e.g., The actor won the award), and (ii) the less plausible version of the same sentence, constructed by varying the inanimate patient entity (The actor won the battle). All sentence pairs in this dataset describe interactions between an animate agent and an inanimate patient, making them most comparable to the AI sentence pairs from Dataset 1. However, unlike in Dataset 1, word content and not word order distinguishes between plausible and implausible sentences within a pair. Note further that the plausibility manipulation in this sentence set is graded: the events can be described as typical/atypical rather than possible/impossible. Ivanova et al., 2021). This sentence set contains 38 items, each of which includes (i) a plausible active sentence that describes a transitive event in the present tense (e.g., The cop is arresting the criminal), and (ii) the implausible version of the same sentence, constructed by swapping the NPs (The criminal is arresting the cop). All sentence pairs in this dataset describe non-reversible interactions between two animate entities, making them comparable to the AA sentence pairs from Dataset 1. As in Dataset 1, only word order but not word content distinguishes between plausible and implausible sentences within a pair.  Dataset 2 (Vassallo et al. 2018) yes no active active --

Dataset 3 (based on
The actor won the award. The actor won the battle. Dataset 3 (Ivanova et al. 2021) yes no active active -- The cop is arresting the criminal. The criminal is arresting the cop.

Human data collection
For all three sentence sets, we compared language model predictions with human plausibility judgments. Human judgments for Dataset 2 had been previously collected by Vassallo et al. (2018) on Prolific, a web-based platform for collecting behavioral data. Participants in this experiment answered questions of the form "How common is it for an actor to win an award?" on a Likert scale from 1 (very atypical) to 7 (very typical). Human judgments for Dataset 1 and 3 were collected on Amazon Mechanical Turk, another web-based platform. Here, participants evaluated the extent to which each sentence was "plausible, i.e., likely to occur in the real world" on a Likert scale from 1 (completely implausible) to 7 (completely plausible). The protocol for the study was approved by MIT's Committee on the Use of Humans as Experimental Subjects (COUHES). All participants gave written informed consent in accordance with protocol requirements.
For Dataset 1 (our main dataset), we recruited 966 participants, restricting our task to participants with IP addresses in the US. The sentences were divided into 32 experimental lists such that each of the items occurred only in one of its versions in any given list. The median response time was 20.6 min. Each participant completed between 1 and 3 lists (mean=1.1).
Participants were included in the analyses if they satisfied all the following criteria: i) self-reported location ("USA"), ii) native English proficiency (evaluated via self-report and two sentence completion trials), iii) fewer than 20% of blank responses, and iv) accurate responses to attention checks ("Please select the leftmost/rightmost option"). We additionally filtered participants based on their responses to the AI items (The teacher bought the laptop vs. The laptop bought the teacher), retaining participants with a minimum plausibility difference of 1 point (out of 7) between plausible and implausible items in this condition. These criteria left data from 658 participants for analysis. Each sentence had a minimum of 18 ratings (average: 22.9 ratings; maximum: 27 ratings). Participants were paid $4.25 (estimated completion time was 25 min), with payment contingent only on the attention-check questions and excessive blank responses (>30%).
For Dataset 3, we recruited 100 participants, restricting our task to participants with IP addresses in the US. The sentences were divided into 2 experimental lists and each of the items occurred only in one of its versions in any given list. The median response time was 15.7 min. Each participant completed 1 list. We filtered the data using the same criteria as for Dataset 1, except for the sentence completion trials for assessing English proficiency (which were not included) and the minimum plausibility difference criterion. The inclusion/exclusion criteria left data from 96 participants for analysis (48 ratings per sentence). Participants were paid $2.70, with payment contingent only on the attention-check questions and excessive blank responses (>30%).
For the unidirectional LLMs, we define the sentence score as the sum of the log-probabilities of each token in the sequence, conditioned on the preceding sentence tokens .

<
For the bidirectional LLMs, we use a modified version of the sentence's pseudo-log-likelihood under the model (PLL; Salazar et al., 2020;A. Wang & Cho, 2019), which defines the sentence score as the sum of the log-probabilities of each token given all other tokens (see Figures S10 and S11 for evidence that sentence generation likelihood is a more robust indicator of event knowledge in bidirectional LLMs than other prediction-based metrics, such as last-word prediction probability or verb prediction probability for our datasets). To avoid biasing the scores in favor of multi-token lexical items, we modify the original procedure to additionally mask tokens within multi-token words if they are located to the right of the target (see SI 7 for details and justification; and Figure S12 for supporting results).

Baseline models
To investigate whether knowledge of event plausibility depends on specific linguistic patterns, we additionally compared the performance of the LLMs against four baseline models. This comparison allows us to evaluate the added value of LLMs in comparison to more "traditional" but less complex distributional semantics models, typically trained on a much smaller amount of data (Lenci & Sahlgren, in press).
TinyLSTM is a two-layer LSTM recurrent neural network trained with a next-word prediction objective on the string data from the 1-million-word English Penn Treebank §2-21 (Marcus et al., 1993). Like for unidirectional LLMs, a sentence score for TinyLSTM is estimated as the sum of negative log probabilities of each token conditioned on the preceding tokens. The model is available through the LM Zoo library (Gauthier et al., 2020).
Thematic fit models the degree of semantic compatibility between an event's "prototype" verb argument, calculated from distributional text information (McRae et al., 1998), and the role filler proposed by the sentence. We follow the approach for calculating prototypical argument representations by Lenci (2011) and compute a prototype representation for the event patient slot as the centroid vector representations from the most associated entities with the predicate and agent in the sentence. However, instead of computing updates to the prototype using Distributional Memory vectors (as in Lenci, 2011), we here do the same computations using FastText (Bojanowski et al., 2017) static embeddings (see also Rambelli et al., 2020). A sentence's plausibility score is computed as the cosine similarity between the FastText embedding of the proposed patient and the relevant prototype vector.
The Structured Distributional Model (SDM; Chersoni et al., 2019) is a model of thematic fit that computes both a context-independent and a context-dependent representation of the prototype role filler based on the current linguistic context. The context-independent representation is obtained via summing the FastText embeddings of all lexical items in the current linguistic context. The context-dependent representation is derived based on a dynamic representation of the context: given the lexical items in the current context and the syntactic function of the next word to be predicted, SDM queries a distributional event graph (DEG) to retrieve the words with the strongest statistical associations with those items for the target function (the DEG was extracted from a large number of dependency-parsed corpora: words are linked with their syntactic collocates and the links weighted with mutual information scores). It then computes the centroid of the FastText embeddings associated with the highest-ranked lexical entities according to DEG. Finally, a sentence's plausibility score is calculated as the sum of the SDM thematic fit scores for each verb argument (in our case: agent and patient), whereby each score is derived as the average cosine similarity of the argument filler's representation with the context-dependent and context-independent prototype representations of the role.
Lastly, the PPMI-syntax model quantifies the statistical association between verbs and their dependents (marked for syntactic role, i.e., PPMI(arrest, cop subj ) ≠ PPMI(arrest, cop obj )) in terms of Positive Pointwise Mutual Information (PPMI). It is trained on the same dependency-parsed corpus as SDM. We apply Laplace smoothing and compute the plausibility score of a sentence as the PPMI score between the verb and the subject plus the PPMI score between the verb and the object.
See SI 2 for additional baseline model description details.

Word frequency estimation
To account for potential effects of word frequency, we estimated the average frequency of the word/phrase denoting the agent, patient, and verb of each sentence, as well as the average frequency of all words in the sentences. Frequency was operationalized as the log of the number of occurrences of the word/phrase in the 2012 Google NGram corpus. Laplace smoothing was applied prior to taking the log.

Probing analysis
To investigate the emergence of explicit plausibility information in LLMs, we trained a decoding probe to distinguish plausible and implausible sentences from their embeddings at different LLM layers. Separate logistic regression classifiers were trained for each model layer and the static word embedding space of the models. For each sentence, the input was the model-specific sequence summary token; the output was a binary plausibility label. The choice of model-specific sequence summary tokens followed the default settings from Huggingface transformers: for the bidirectional LLMs, BERT and RoBERTa, we used the representation of the special token [CLS], which was prepended to each stimulus and was designed and trained specifically for sequence classification tasks. For the unidirectional LLMs, GPT-J and GPT-2, we prepared the stimulus by adding the [EOS] token to the beginning and end of the sequence and used the representation of the final token as the sequence's summary representation. For all analyses, probes were trained using 10-fold cross-validation, ensuring that plausible and implausible versions of the same sentence remain in the same split (train or test). To estimate the best-case model performance, we computed empirical ceiling values by training probes on the average human plausibility ratings for each sentence. The probe setup and the cross-validation procedure for ceiling probes were the same as for LLM probes.
To probe the generalization ability of the LLMs, we trained the classifiers on just one type of sentence (either on specific animacy combinations, AI or AA, or specific voice, active or passive) and evaluated the performance on the held-out type.
We used sklearn's (Pedregosa et al., 2011) Logistic Regression module with a liblinear solver for all probing analyses.

Statistical analyses
Binary accuracy. Binary accuracy results were compared to chance performance of 0.5 using a binomial test. Tests of equal proportion were used to compare model performance to human performance, as well as AI sentence accuracy to AA sentence accuracy within each metric.
Correlations. All reported correlations are Pearson correlations. Correlation significance was assessed using the test for correlation for paired samples (cor.test in R). Model correlation was compared to human correlation using the cocor package's (Diedenhofen & Musch, 2015) implementation of Raghunathan et al.'s (1996) test for nonoverlapping correlations based on dependent groups.

Mixed effects modeling.
We fitted separate linear mixed effects models to human ratings and each language model's scores. The key predictors for Dataset 1 were plausibility, item type (AI vs. AA vs. AA-control), and voice (active vs. passive), as well as interactions between them. We also included agent, patient, verb, and average sentence frequencies, sentence length in tokens (for LLMs) or words (for humans and baseline models). Random effects included the item number intercept and item number by plausibility slope. For Datasets 2 and 3, the formula was simplified to account for dataset structure (i.e., no item type or voice predictors).
Continuous variables were normalized before fitting. We used dummy coding for plausibility, with "plausible" as the reference level, dummy coding for item type, with "AA" as the reference level, and sum coding for voice. The analysis was conducted using the lme4 R package (Bates et al., 2014).

Probing analyses.
To compare the performance of probing classifiers across LLM layers, we divided LLM layers into three same-sized groups: early, middle, and late. Within each layer group, we compared average probe performance to the ceiling value (probe trained on human ratings; see Section 2.4), as well as the linear trend within each layer group (i.e., whether classifier performance increases, decreases, or stays constant within that layer group).
In all analyses, the results were FDR-corrected for the number of models within each category (humans, LLMs, and baselines). For probing analyses, the results were additionally corrected for the number of classifiers used within each analysis (e.g., 5 for generalization across trial types; 5 classifiers x 4 LLMs = 20 comparisons). Analysis code and data files can be found on GitHub: https://github.com/carina-kauf/lm-event-knowledge.

Results
We report a variety of tests to establish whether prediction-based LLMs are sensitive to event plausibility. In our main test (Sections 3.1 and 3.2), we investigate whether LLMs systematically assign higher scores to the plausible sentence compared to the implausible sentence within the same minimal pair. We compare LLM performance with human performance (whether crowdsourced plausibility scores are higher for plausible than for implausible sentences within each pair) and with baseline model performance. Then we move beyond the minimal pair setup to conduct detailed analyses of all sentence scores, aiming to determine the relative contributions of event plausibility and surface-level properties to LLM sentence scores (Section 3.3). We investigate whether the event knowledge learned by LLMs is generalized, by investigating model judgment robustness to descriptions of the same event using (i) a different syntactic structure and (ii) different lexical items (Section 3.4), conduct an error analysis of LLM performance (Section 3.5), and use a probing analysis to track the emergence of explicit event plausibility signatures across LLM layers, as well as test whether these signatures generalize across event types (Section 3.6).

LLM results reveal a gap between impossible and unlikely events
Our primary sentence set (Dataset 1) contains two types of plausible-implausible sentence pairs: AI (animate/inanimate actors, e.g., The teacher bought the laptop vs. The laptop bought the teacher) and AA (animate/animate actors, e.g., The nanny tutored the boy vs. The boy tutored the nanny).
In most cases, AI plausibility violations result in impossible events, whereas AA plausibility violations make the event unlikely but not impossible. We found that all language models exhibited differential performance on these sentence sets, with substantially better results for AI than for AA sentence pairs.
In the main analysis, we compared model scores for plausible and implausible sentences within the same minimal sentence pair. For each sentence pair, a model received a score of 1 if it assigned a higher score to the plausible version of the sentence and 0 otherwise. The same procedure was performed on human plausibility ratings for each sentence pair.
All models showed good performance on AI sentences ( Figure 1A, left). RoBERTa scores were not significantly different from the human accuracy of 1, and other LLMs also had high performance, although slightly lower than humans ( On AA sentences, all LLMs still performed at the above-chance level ( Figure 1A, right) but their performance was significantly below the human accuracy of 0.95 (RoBERTa: 0.78, χ2=22.04; BERT: 0.77, χ2=24.56; GPT-J: 0.75, χ2=27.12; GPT-2: 0.74, χ2=29.73; all p<.001). All baseline models performed at chance except for thematicFit (accuracy 0.62), indicating that information about AA sentence plausibility is more difficult to extract from subject-verb-object co-occurrence patterns in natural language.
As shown in Table 3, similar to humans, LLMs and two of the baseline models show a performance gap between AI and AA sentence sets. However, the size of the gap for the models (average 0.19 for LLMs, 0.23 for baseline models) is much larger than the one in humans (0.05), a result we explore further in Section 4.5.
For completeness, we also test the models on a set of AA-control items from Dataset 1, for which both sentences in a pair describe a plausible event (e.g., The cheerleader kissed the quarterback vs. The quarterback kissed the cheerleader). As expected, in that case the models produced comparable scores for the two events within each pair (Figures S4, S5). In addition, LLMs and most baseline models show comparable performance on the passive voice versions of AI and AA sentences ( Figure S6).
Finally, we directly correlate model scores with human ratings ( Figure S7) and show that the correlation is only moderate for AI sentences (mean LLM r=.59, human inter-rater r=.86) and poor for AA sentences (mean LLM r=.19, human inter-rater r=.65).  Figure S1 for detailed analyses of the score distributions for the baseline models.

The gap in model performance between implausible and impossible events is not fully explainable by animacy or lexical variables
The gap between model and human performance on AI and AA sentences from Dataset 1 could be explained by several factors. First, implausible AI sentences in Dataset 1 mostly described impossible events (The laptop bought the teacher), whereas implausible AA sentences were often unlikely rather than impossible (The boy tutored the nanny), which resulted in a wider distribution of plausibility scores ( Figure 1B). Second, as follows from their name, AI sentences described animate-inanimate interactions, such that switching the agent and the patient typically violated the animacy selectional restriction on the verb; in contrast, AA sentences described animate-animate interactions, so our plausibility manipulation did not violate the animacy restriction. Finally, the AA sentences were more difficult overall (human accuracy 0.95 vs. 1 for AI sentences), possibly because AA sentences had a lower average word frequency (Google Ngram log frequency of 10.8 for AA vs. 11.1 for AI). 1 To determine whether the latter two factors might explain differential model performance, we compared model and human performance on two additional sentence sets. For detailed results, see Figure S8.

Dataset 2 (based on Vassallo et al., 2018)
This sentence set describes animate-inanimate (AI) interactions; plausibility is manipulated by varying the object (e.g., The actor won the award vs. The actor won the battle; Table 2). Unlike AI sentences in Dataset 1, implausible sentences here are simply unlikely rather than impossible. This difference is reflected in the distribution of human judgments for this sentence set, which are less polarized than for AI sentences from Dataset 1 (mean difference 0.55; see Figure S8 for details). If actor animacy determines model performance, their accuracy on Dataset 2 should be similarly high to that for AI sentences from Dataset 1. If, on the other hand, unlikely events are more challenging for the models to evaluate compared to impossible events, then models should perform better on AI sentences from Dataset 1.

Dataset 3 (based on Ivanova et al., 2021)
This is a small sentence set from a neuroimaging study by Ivanova et al. (2021) with the same manipulation as in Dataset 1: implausible sentences are generated by switching the agent and the patient (The cop arrested the criminal vs The criminal arrested the cop; Table 2). Both agents and patients are animate. However, average word frequency is higher than in Dataset 1 sentences (Google Ngram log frequency of 11.9), and human ratings are more polarized than those of AA sentences from Dataset 1 (mean difference = 0.76). If frequency is an important contributor to performance, model accuracy on Dataset 3 should be higher than that on AA sentences from Dataset 1.
Together, the results from Sections 3.2.1 and 3.2.2 suggest that although actor animacy and frequency contribute to model performance, they do not fully explain performance patterns. In particular, unlikely sentences (across animacy configurations) pose challenges for LLMs, in spite of being easy for humans.
In the remainder of the paper, we focus on LLM performance; detailed analyses of baseline model performance can be found in SI 3. Figure 1), as well as Datasets 2 and 3 (the second and third set of bars); results ordered by LLM performance. Dotted lines indicate chance-level performance.

LLM scores are strongly influenced by surface-level sentence properties
In addition to comparing scores within minimal pairs, we examined the extent to which human and model scores depend on surface-level stimulus properties, such as syntactic structure (active vs. passive), word frequency, and sentence length. If the scores reflect general event knowledge, we should expect them to be primarily determined by sentence plausibility (i.e., meaning) and not by surface-level factors (i.e., form). Figure 1B, human score distributions for plausible and implausible sentences in Dataset 1 show little overlap (mean difference for AI sentences = 0.78, AA sentences = 0.38). In contrast, all language models show much more overlap between plausible and implausible score distributions (mean difference for LLMs: AI = 0.19, AA: 0.06; for baseline models: AI = 0.09, AA: 0.01), which suggests that their scores are determined predominantly by factors other than plausibility.

3.3.2.
Switching the agent and the patient strongly influences human ratings but not LLM scores. Our plausibility manipulation (switching the agent and patient in a sentence) was specifically designed to alter the plausibility of the described event while preserving the identities of individual words. If the scores primarily track event plausibility, we should observe a negative correlation between the scores for plausible and implausible sentence versions. If, however, the scores depend primarily on the word-level makeup of the sentence, the correlation between the scores for the two versions should be positive.
Human judgments show a negative correlation for plausible and implausible versions of the same AI sentence (r=-0.29, p<.001) and a non-significant correlation for AA sentences (r=-0.17, p=.06). In contrast, LLM scores show a strong positive correlation (ranging from 0.57 for RoBERTa on AI sentences to 0.94 for BERT on AA sentences; all significantly different from humans, p<.001 for χ2 comparison), indicating that LLM scores are largely driven by individual word features, rather than by assignment of event roles to their arguments.

Both plausibility and surface-level features predict LLM scores: mixed effects modeling.
To systematically test how different factors contribute to the final sentence score in humans and models, we fitted mixed effects models to scores from each model and to human scores ( Table 4; see Methods for model and contrast definition). Note that, due to the fact that we normalize the scores for each metric (humans and models), the resulting coefficients can be interpreted as effect sizes and are comparable across metrics.
As expected, human scores are primarily driven by the plausibility manipulations. Notably, the effect of AI vs. AA plausibility violation (-.37) is as strong as the implausibility effect for AA sentences (-.38). All LLMs are also sensitive to both plausibility effects; however, these effects are much weaker than the effects in humans, and the implausible AI>implausible AA effect (-.13) is larger than the implausible AA>plausible AA effect (-.06), consistent with the performance gap that we observed for AI and AA sentences.
In addition, models but not humans are sensitive to the main effects of surface-level sentence properties. Each LLM's performance on the critical task is affected by at least three of the following factors: voice, agent frequency, patient frequency, average word frequency, and sentence length 2 , whereas human plausibility judgments are not affected by any of these features.
Finally, the AI implausibility effect in humans is modulated by some surface-level properties. Compared to AA sentences, humans are likely to assign more polarized scores to AI sentences presented in active voice than in passive voice (higher for plausible, lower for implausible). RoBERTa and GPT-2 capture this effect weakly, and BERT shows an effect in the opposite direction, penalizing passive implausible AI sentences more harshly. Thus, LLMs fail to capture the fine-grained effects of surface-level properties on human judgments.
Overall, the mixed-effects model analysis is consistent with other analyses, showing a performance gap between AA and AI sentences and highlighting that LLM scores are driven by surface-level properties in addition to sentence plausibility. However, the fact that all LLMs show significant effects of plausibility indicates that they do learn certain real-world event plausibility trends.

LLMs generalize well across syntactic sentence variants, but only partially across semantic sentence variants
Here, we evaluated the extent to which model scores exhibit invariance to the surface form of the sentence by manipulating sentence voice (active vs. passive) and testing sentences with synonymous meanings.

LLMs generalize across active and passive sentences.
To test invariance to sentence syntax, we calculated the Pearson correlations between the active and passive voice versions of the same sentence (The teacher bought the laptop vs. The laptop was bought by the teacher; Figure 3A). Human scores were highly correlated (r=0.96), indicating that human plausibility ratings are indeed invariant to sentence voice. LLM scores were also strongly correlated (max: BERT, r=.93; min: GPT-J/GPT-2, r=.79), indicating that LLMs can successfully generalize across active and passive voice forms of the same sentence (cf. Pedinotti et al., 2021), probably because the distributional signal can rely on the overt morphosyntactic marking of the voice change (cf. the agent-patient swap cases, as in AA active-voice sentences).

LLMs show some generalization across synonymous sentences.
To test invariance to individual word identity, we compared scores for sentence pairs where subject, verb, and object words were synonymous (The teacher bought the laptop vs. The instructor purchased the computer; Figure 3B). Human judgments were highly correlated across synonymous sentence pairs (r=.90), indicating that they are largely invariant to specific word identity. LLMs showed some generalization (max: RoBERTa, r=.56; min: BERT, r=.27), indicating that these models are somewhat consistent in assigning scores to synonymous utterances, but this relationship is far weaker than that observed in humans or than their syntactic generalization capabilities. This result is surprising given that the internal representations of lexical items formed by LLMs are allegedly geared towards identifying semantically similar terms. 3.5. LLM deviations from ground-truth labels are partially, but not fully explained by plausibility violation strength To understand the nature and severity of LLM errors, we conducted a quantitative and a qualitative analysis of the sentence pairs that most LLMs got wrong.
We first tested whether the severity of the plausibility violation correlates with model performance. To do so, we correlated the violation magnitude in each sentence pair (operationalized as the difference between human scores for plausible and implausible sentence versions) and the number of LLMs (0 through 4) that correctly evaluated that sentence pair. For both AI and AA sentences, we observed a moderate positive correlation, suggesting that sentence pairs that are more ambiguous to humans are also more challenging for LLMs.
Then, we conducted a qualitative analysis of sentence pairs that all or most LLMs got wrong ( Table 5). We found that these include several sentence pairs where human judgments actually deviated from ground truth labels (e.g., The orderly assisted the dentist vs. The dentist assisted the orderly; see Table S4), but in ⅔ of the cases there was at least a 0.1 difference between plausible and implausible sentence ratings in humans. Some errors might be explained by low-level features of the input such as non-standard spelling (e.g., tour-guide instead of tour guide) and low-frequency words (e.g., milliner), but some likely reflect a failure to identify typical agent/patient roles (e.g., all LLMs fail to identify trainee as a typical patient for the verb taught, even though human judgments in this example are rather unambiguous). Overall, we conclude that the knowledge gap for unlikely (AA) sentences cannot be fully explained by labeling errors nor low-level input properties.  Figure S3 for baseline model results. See Figure S9 for the same analysis for Datasets 2 and 3. 14 AA 1 -0.11 The orderly assisted the dentist. The dentist assisted the orderly.

Event plausibility is linearly decodable from middle and late LLM layers
The previous sections have investigated the behavioral performance of LLMs in distinguishing plausible and implausible events. In this section, we investigate which LLM layers contain linearly decodable information about event plausibility and whether the features that determine plausibility generalize across different sentence types (impossible vs. unlikely; active vs. passive). We investigate these questions by training a diagnostic classifier that takes layer-specific sentence representations as its input and predicts sentence plausibility, systematically holding out parts of the dataset.
Across model architectures, we find that sequence representations of later model layers are more suitable for decoding sentence plausibility than those of earlier layers (Figure 5; Table 6). This finding is consistent with previous results showing that semantic information tends to be encoded more strongly in later layers (Belinkov et al., 2017;Papadimitriou et al., 2022;Tenney et al., 2019). Probes that are trained to distinguish plausible vs. impossible AI sentence representations perform best, reaching ceiling performance in middle layers (or, for BERT, in late layers). Thus, linearly decodable information required to distinguish possible and impossible events emerges relatively early on in the LLM processing pipeline and stays high throughout. For other decoding types, all models reach peak decodability in late layers except GPT-J, whose performance plateaus in middle layers. Generalizing to AA sentences from AI sentences leads to a drop in probe accuracy compared to testing an AI-trained probe on AI sentences ( Figure 5A). This is true for ceiling values (classifying based on human ratings), but model probe performance is significantly worse than even this lower ceiling value. In contrast, probes that are trained to distinguish plausible vs. implausible AA sentences have similar performance on AI and AA test sets, although they fall short of the probes trained and evaluated on sentence representations from both sentence sets (labeled "all" in the figure).
Furthermore, we find that probes fail to generalize across syntactic structures when trained on representations from only one voice type. Evaluating an active-voice-trained probe on passive sentences and vice versa (i) substantially decreases the model's plausibility prediction performance relative to control conditions across layers and (ii) leads to below-chance performance for the early layers of the model (Figure 5B; light and dark red lines). Nevertheless, when the training set includes both active and passive sentences, the probe reliably decoded plausibility judgments from LLM sentence representations, indicating that the embeddings do contain syntax-invariant plausibility information. Note that given the high correlation between active and passive scores in the human data ( Figure 3A), empirical ceiling values remain high across syntactic generalizations. For detailed statistical comparison for voice generalization probes, see Table S9. For probing results across the three datasets, see Figure  S13; Table S10. Table 6. Statistical analysis of probing results, generalization across trial type (AI vs AA). "Trend" refers to a linear trend within each layer group.

Discussion
To what extent can language be a source of generalized event knowledge? Do prediction-based LLMs trained on vast amounts of natural language data learn to generate descriptions of plausible events with higher likelihood than descriptions of events that are implausible? To find out, we compared the likelihood scores that LLMs assigned to plausible vs. implausible event descriptions using syntactically simple, tightly controlled minimal pair sentence stimuli. We demonstrated that LLMs acquire substantial event knowledge and improve over strong baseline models, especially when it comes to distinguishing possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher); however, they fall significantly short of human performance when distinguishing likely events from events that are unlikely but not impossible (The nanny tutored the boy vs. The boy tutored the nanny). Using three different sentence sets, we demonstrated that this gap in performance cannot be fully explained by the animacy of the participants or word frequency.
We further conducted a rigorous set of analyses to elucidate the relationship between an LLM sentence score (which reflects its generation probability) and plausibility, showing that LLM scores depend both on sentence plausibility and surface-level sentence properties. In generalization analyses, we found that both LLM and human scores are consistent for active and passive voice versions of the same sentence, but LLMs are less consistent than humans for synonymous sentence forms. Lastly, we found that linearly decodable plausibility information peaks in the middle layers of the LLMs and persists in later layers, with the same gap between impossible and unlikely event performance as that observed in behavioral tests.

When identifying impossible events, LLMs might leverage selectional restrictions
LLMs in our study were significantly worse at distinguishing likely and unlikely events than possible and impossible events. A notable feature of the impossible event descriptions in our datasets is the violation of selectional restrictions (sometimes also called selectional preferences) on the verb, i.e., the set of semantic features that a verb requires of its arguments (such as an animate agent) (Chomsky, 1965;Katz & Fodor, 1963;Levin, 1993). When plausibility violations were not driven by selectional restrictions (as in the "unlikely" sentence sets), model performance dropped.
Our findings suggest that selectional restrictions are a linguistic property that is learnable from corpus data (as also confirmed by the large number of computational methods for selectional restriction acquisition from texts; e.g., Erk, 2007;Thrush et al., 2020) and whose violations are meaningfully distinct from violations of graded world knowledge (Warren et al., 2015;Warren & McConnell, 2007;cf. Matsuki et al., 2011). The asymmetry between acquisition of selectional restrictions vs. acquisition of graded event knowledge is evidenced not only by the LLM performance gap, but also by the fact that our baseline models similarly performed above chance on distinguishing possible and impossible events but struggled with distinguishing likely and unlikely events. Furthermore, a classifier probe trained on possible vs. impossible sentence embeddings performed almost perfectly on other sentences from the same category but completely failed to generalize to likely vs. unlikely events, indicating that selectional restrictions have a distinct representational signature. These results are consistent with psycholinguistic evidence from reading times and EEG indicating that violations of selectional restrictions and violations of world knowledge evoke distinct processing signatures (e.g., Paczynski & Kuperberg, 2012;Sitnikova et al., 2008;Warren et al., 2015;cf. Hagoort et al., 2004), as well as recent computational evidence suggesting that BERT models are able to generalize their knowledge of selectional restrictions in novel word-learning paradigms (Thrush et al., 2020) and can partially rely on the semantics of the head predicate to predict upcoming event participants (Metheniti et al., 2020).
The ability to master selectional restrictions but not fine-grained event schema knowledge is an important limitation of LLMs, as both of these factors affect plausibility judgments in humans (e.g., Hagoort et al., 2004;Warren et al., 2015). To verify and extend our findings, future work should test LLMs' knowledge of selectional restrictions on features other than animacy (as in S. Wang et al., 2018), as well as evaluate their performance on impossible events that do not violate selectional restrictions per se (e.g., She gave birth to her mother, The man was killed twice, or After 10 coin tosses, she got 12 heads.). Furthermore, the fact that LLMs perform below humans even for syntactically simple sentences (The X Ved the Y) suggests that testing them on longer sequences of text might uncover even larger deviations from GEK.

LLMs can infer thematic roles
The stimuli in Datasets 1 and 3 are constructed such that the model has to leverage word order information to successfully determine event plausibility. LLMs successfully accomplish this task for most possible vs. impossible events and for a number of likely vs. unlikely events. Furthermore, they produce highly correlated scores for active and passive versions of the same sentence, suggesting that thematic role information generalizes beyond a specific word order.
Probing results produce additional insight into the emergence of thematic role information in the LLMs (Figure 5b). A probe trained on a mix of active and passive sentences performs as successfully as the probe trained and tested on only one voice type, suggesting that plausible and implausible sentence embeddings in late LLM layers are linearly separable by the same hyperplane across syntactic structures. This finding aligns with recent computational work showing that even though most sentences in the language input describe prototypical events , LLMs are able to correctly represent the argument structure of non-prototypical event descriptions in late layers (Papadimitriou et al., 2022). Thus, LLMs' decreased performance on distinguishing likely and unlikely events is unlikely to be caused by the models' failure to appropriately assign thematic roles.

The 'reporting bias' in language corpora makes it harder to distinguish likely and unlikely events
A core challenge for modeling plausibility based on linguistic input is the fact that the frequency with which events are described in the language is not a reliable predictor of the frequency with which events occur in the real world; Because much of our world knowledge is shared across individuals (e.g., McRae et al., 2005) and human communication is shaped by efficiency (Gibson et al., 2019) and cooperation (Grice, 1975), language is biased towards reporting extraordinary facts and events rather than the trivial (Gordon & Van Durme, 2013). Many commonsense facts about the world are thus presupposed rather than stated explicitly; in contrast, unusual events are discussed extensively. As a result, likely events are underrepresented in linguistic corpora, whereas unlikely events are overrepresented.
The reporting bias of rare and newsworthy events in language corpora has traditionally provided difficulty for modeling semantic knowledge via text mining (e.g., Lucy & Gauthier, 2017;S. Wang et al., 2018). Recent studies probing world knowledge in LLMs show that although the generalization capabilities of these models are able to overcome the reporting bias to some extent (Shwartz & Choi, 2020;Weir et al., 2020), they still tend to reflect biases that exist in their training corpus (e.g., Shwartz & Choi, 2020;Vig et al., 2020;Zmigrod et al., 2019). As a result, one explanation of the performance gap that we observe for likely vs. unlikely events in LLMs could be that unlikely events are overrepresented in the corpus, leading the models to predict them as frequently as likely events. In contrast, impossible events are nearly absent from the training data, and so the models correctly assign them low likelihood scores.
A possible solution to overcoming the reporting bias would be to adjust the event distribution via injecting manually elicited knowledge about object and entity properties into models (S. Wang et al., 2018) or via data augmentation (e.g., Zmigrod et al., 2019). Alternatively, information about event typicality might enter LLMs through input from different modalities, such as visual depictions of the world in the form of large databases of images and/or image descriptions. In the future, we plan to extend our analysis of generalized event knowledge to multimodal LLMs (e.g., CLIP; Radford et al., 2021) in order to investigate the role of extralinguistic evidence, which might reduce the impact of the reporting bias and better simulate the multimodal information humans use to acquire GEK.

Sensitivity to surface-level features complicates the use of LLMs as knowledge bases
We have shown that the probability for generating a particular sentence under a given LLM depends not only on plausibility, but also on surface-level features of that sentence, such as word frequency. This result is largely expected, because distributional models are naturally geared toward producing more frequent tokens more often. However, this means that the score distributions we observe for plausible and implausible sentences are highly overlapping, suggesting that many implausible sentences have higher likelihood generation simply because they contain frequent words.
Should we expect LLMs to prefer plausible sentences to implausible ones? On the one hand, sentence plausibility substantially facilitates language processing in humans (e.g., Bicknell et al., 2010;Federmeier & Kutas, 1999;Kutas & Hillyard, 1984;McRae & Matsuki, 2009). If prediction-based LLMs learn to produce text that is not only fluent, but also matches humans' expectations of the structure of everyday events, they should be relatively less likely to produce descriptions of implausible events rather than plausible ones (e.g., Porada et al., 2021). On the other hand, humans are also sensitive to lexical frequency effects when processing linguistic inputs (e.g., Broadbent, 1967;Goodkind & Bicknell, 2021;Haeuser & Kray, 2022;Rayner & Duffy, 1986) and can use both linguistic knowledge and event knowledge in real time depending on task demands (Willits et al., 2015). Thus, the fact that LLMs are sensitive to both plausibility and frequency effects actually makes them better candidate models of human language processing.
That said, sensitivity to linguistic features of the input makes LLMs unreliable as knowledge bases. Due to this sensitivity, they produce inconsistent results if the same description is phrased differently (Elazar et al., 2021a;Ravichander et al., 2020;Ribeiro et al., 2020) and fail to learn commonsense event schemas (Pedinotti et al., 2021; see also Section 4.2). The ability to abstract away from specific inputs is a key feature of GEK; thus, the ability of future LLMs to acquire robust, flexible event schemas will depend crucially on their ability to generalize beyond corpus statistics.

Linguistic and conceptual knowledge dissociate in humans
Distributional models like LLMs provide us with the unique opportunity to test the relationship between language and world knowledge. The fact that LLMs master selectional restrictions but not fine-grained event schemas suggests a distinction between linguistic and conceptual knowledge. The striking difference in score distributions in humans and LLMs (Figure 1B,C;  Figures S4 and S14) further highlights the fact that the way in which semantic categories are represented and combined in language models differs markedly from how they are represented and used by humans.
The dissociation between language and GEK observed in LLMs is consistent with the wealth of human evidence showing that language processing relies on mechanisms that are distinct from other cognitive capacities, such as logic and math (e.g., Amalric & Dehaene, 2016;Coetzee & Monti, 2018;Monti et al., 2007Monti et al., , 2009Monti et al., , 2012Varley et al., 2005), music perception (e.g., Basso & Capitani, 1985;Chen et al., 2021;Luria et al., 1965), gesture perception (Jouravlev et al, 2019;Pritchett et al, 2018), and social reasoning (Lecours & Joanette, 1980;Paunov et al., 2019Paunov et al., , 2022R. Varley & Siegal, 2000). Many of these capacities are important for language use in real-life situations, yet their neural processing mechanisms are distinct from the core language network (Fedorenko & Varley, 2016;Mahowald, Ivanova et al., in prep). GEK, as well as semantic knowledge more generally, might be considered somewhat of an outlier among these functions due to a tight coupling between language and semantics/pragmatics. After all, how is it possible to process language without accessing the underlying meaning? Nevertheless, evidence from brain-damaged individuals points to a dissociation between linguistic and conceptual processing (e.g., Caramazza et al., 1982;Lambon Ralph et al., 2017;Patterson et al., 2007), including an event plausibility task performed on pictures (Ivanova et al., 2021). That said, the language network does respond during event plausibility judgments performed on both verbal and nonverbal stimuli (Ivanova et al., 2021), indicating that the information stored in the language circuits might be recruited-even if not required-during event processing in humans. Our results here support this account: distributional linguistic information acquired by LLMs carries some event knowledge but is not equivalent to GEK. 4.6 Generating descriptions of unlikely events: a feature rather than a flaw? Do we even want LLMs to serve as knowledge bases? We argue no. Language and world knowledge are two fundamentally different capabilities; even if world knowledge can, in principle, be acquired through linguistic input, the objective functions for linguistic proficiency and world knowledge acquisitions are vastly different. As discussed in the previous section, LLMs' sensitivity to surface-level input features makes them ill-equipped to serve as knowledge bases but, at the same time, makes them better at mimicking human language processing. Thus, the word-in-context prediction objective that LLMs are trained with is well-suited for acquiring the formal competence needed for modeling human language (e.g., Gauthier et al., 2020;Hu et al., 2020;Mahowald, Ivanova et al., in prep) but not event knowledge or world knowledge in general. Future models, if trained appropriately, might be able to successfully balance linguistic fluency and systematic world knowledge. However, we predict that robust GEK cannot be acquired for free simply from the word-in-context prediction objective.
In fact, the fact that LLMs can easily generate both likely and unlikely event descriptions could be considered a feature rather than a flaw. The power of language is not only in its ability to convey factual knowledge: language allows humans to brainstorm, fantasize, discuss counterfactuals, speculate, and dream. With enough backstory, even an impossible event like The laptop bought the teacher can be rendered plausible, eliminating the processing difficulty in humans (e.g., Jouravlev et al., 2019;Nieuwland & Van Berkum, 2006;Warren et al., 2008) and in LLMs (Michaelov et al., 2022). Thus, restricting the models to the realm of a priori plausible events would handicap their potential as models of human language. Of course, in the absence of contextual information (as is the case in our study), we would still expect LLMs to generate plausible event descriptions more often than implausible ones. However, an overly strong alignment between an LLM and a knowledge base will likely be counterproductive for its linguistic fluency.
Language models are an important tool for investigating which cognitive capacities can, in principle, rely on language processing mechanisms. Contemporary LLMs show that large amounts of world knowledge can be learned from language alone -yet controlled, targeted manipulations like the ones used in this study can also reveal their limitations and highlight areas of knowledge where LLM behavior is not aligned with human behavior. Future work should explore the extent to which LLMs master other types of event knowledge, such as knowledge of typical/possible event sequences, and the extent of their sensitivity to selectional restrictions other than animacy. Overall, detailed investigations of world knowledge in language models are a valuable source of evidence for clarifying the relationship between language and meaning.

SI2. Baseline model description details
Baseline models. We are interested in investigating whether knowledge of event plausibility emerges as a natural by-product of attending to word co-occurrence statistics. As a result, we compare the performance of the LLMs against four baseline models designed to encode relevant information for building an accurate event representation from linguistic input.
The TinyLSTM model is a vanilla two-layer LSTM recurrent neural network, trained with a next-word prediction objective on the string data from the 1-million-word English Penn Treebank §2-21 (Marcus et al., 1993). For TinyLSTM, a sentence's plausibility score is estimated as the average surprisal (Hale, 2001;Levy, 2008) of each sentence token in the sequence, conditioned on the preceding sentence tokens , i.e., its conditional negative log probability.
The model is available through the LM Zoo library (Gauthier et al., 2020).
Thematic fit models the degree of semantic compatibility between an event's "prototype" verb argument, calculated from distributional text information (McRae et al., 1998), and the role filler proposed by the sentence. Different extractional models to measure thematic fit have been proposed (e.g., Erk, 2007;Greenberg et al., 2015;Lenci, 2011;Sayeed et al., 2016). Here, we follow Lenci (2011) for calculating prototypical representations: Given an event, described by the predicate and subject, 1) we use Local Mutual Information (Evert, 2008) to retrieve the 200 entities most strongly associated with each (in the specific syntactic position); 2) next, we compute the intersection of the two entity lists to find the entities compatible with the compositional event description. In case the intersection is empty, we prioritize the entities associated with the verb and use only them to create the prototype; 3) we rank entities based on the product of their association scores with the subject and the verb, and select the 20 entities most strongly associated with both; 4) we compute the prototype vector as the centroid of these entities' representations, i.e., as the average of their FastText (Bojanowski et al., 2017) word embeddings. After computing the prototype representation, we obtain a sentence's plausibility score as the cosine similarity between the FastText embedding of the proposed object and the relevant prototype vector.
The Structured Distributional Model (SDM; Chersoni et al., 2019) improves on standard models of thematic fit by leveraging insights from Discourse Representation Theory (DRT) (Heim, 1982;Kamp, 1981), a formal theory of dynamic semantics, in addition to distributional information extracted from text corpora. DRT assumes that each clause describes an event or situation, and that listeners dynamically build representations of these events as the sentence unfolds over time. The novel contribution of SDM is to infuse these dynamic discourse representations with distributional knowledge about events and their typical participants. In computing the compatibility between a proposed role filler and its distributional prototype, SDM combines two tiers of semantic meaning representation. On the one hand, SDM computes a context-independent representation of the linguistic context (linguistic condition; LC) via summing the embeddings associated with all lexical items in the leftward context. On the other hand, SDM computes a context-dependent representation of the prototypical argument via a distributional event graph (DEG) that is external to the model and was extracted from parsed text corpora. In this graph, the nodes represent the lexical items in the corpus and the edges encode the statistical syntactic relations between these items. Given a set of linguistic items, SDM queries the DEG for the most common role fillers associated with the items in the active context (AC) and computes a prototype representation for that slot as the centroid of FastText embeddings from the highest-ranked entities. A sentence's plausibility score is finally calculated as the sum of the average cosine distance between the representations of each proposed verb argument filler (provided by the linguistic input) with (i) the average representation of the preceding context, ) and (ii) the context-dependent prototype for the target role, Finally, the syntax-based PPMI (PPMI-syntax) model is trained to encode statistical associations between verbs and their dependents, on a concatenation of the dependency-parsed ukWaC corpus (Baroni et al., 2009), a dump of the English Wikipedia from 2018, and the British National Corpus (Leech, 1992). For each sentence, we first extract triplets of syntactic relations <verbal head, nominal dependent, syntactic role> of minimum frequency 2, and compute the Positive Pointwise Mutual Information (PPMI) score for each such triplet (with N = total frequency of all triplets): In the testing phase, a sentence's plausibility score is then computed as the sum of the PPMI score for the verb and the subject, and the PPMI score for the verb and the object. We apply Laplace smoothing (also called add-one smoothing) consisting of adding 1 to all the counts.
SI3. Baseline models: detailed results (Dataset 1) Figure S1. Baseline performance on Dataset 1 (results in A are the same as in Figure 1       The orderly assisted the dentist. The dentist assisted the orderly.

SI6. Alternative metrics for LLM models
For bidirectional models, which cannot evaluate the likelihood of a sentence via chain rule, it is unclear whether a sentence's pseudo-log-likelihood score (Salazar et al., 2020) is a good proxy for sentence plausibility. To investigate the effect of metric choice for evaluating the plausibility of an event, we additionally compare the following ways of computing sentence plausibility under a bidirectional model: • Last-word probability, i.e., the average log-likelihood of the subtokens that compose the last word in the sequence according to the model's tokenizer • Left-to-right (l2r) causal sentence-generation probability, i.e., average log-likelihood for each token in the sequence, conditioned on only the preceding tokens according < to the model.
We find that a sentence's pseudo-log-likelihood score is a more robust indicator of event knowledge in bidirectional LLMs than other prediction-based metrics, such as last-word-or verbproduction likelihood (Figures S10 and S11). This finding aligns with recent research showing that estimating the plausibility of a proposition via comparison of the prediction probabilities for target linguistic items at a single masked-out position (as used in e.g., (Abdou et al., 2020;Kocijan et al., 2019;Pedinotti et al., 2021) can result in underestimation of model performance. This underestimation derives in part from the number of suitable competitors (i.e., other surface forms representing the same underlying concept, such as computer and PC) across which the LLM has to split its probability mass (Holtzman et al., 2021) and biases derived from factors irrelevant to the task, such as a word's number of tokens under a given LLM tokenizer (Elazar et al., 2021b). Although our framework does not fully address these issues, we show that comparing likelihoods across minimal sentence pairs matched on many common confounding factors provides a principled way of estimating model performance even when the critical manipulation is not sentence-final.

Multiclass probing
In the MTurk study, humans do not perform binary classification but rather assign a graded plausibility score of 1-7 to each sentence. To investigate whether LLMs are able to predict the more fine-graded plausibility estimates, and to account for the difference between impossible and unlikely events, we trained an ordinal multiclass classifier from LLM activations to the rounded average human judgments for each sentence. Across the seven possible classes, the human score distributions are unbalanced. To enable classifier convergence, we aggregated the human estimates into three classes: "impossible", subsuming human judgments scores of 1 and 2, "plausible", subsuming human judgment scores of 6 and 7, and "unlikely", subsuming human judgment scores of 3 to 5. We used sklearn's (Pedregosa et al., 2011) Support Vector Classification module with a linear kernel and balanced class weights. We found that the overall results were similar to the results reported in the main text, with some generalization differences.