Modeling structure-building in the brain with CCG parsing and large language models

To model behavioral and neural correlates of language comprehension in naturalistic environments researchers have turned to broad-coverage tools from natural-language processing and machine learning. Where syntactic structure is explicitly modeled, prior work has relied predominantly on context-free grammars (CFG), yet such formalisms are not sufficiently expressive for human languages. Combinatory Categorial Grammars (CCGs) are sufficiently expressive directly compositional models of grammar with flexible constituency that affords incremental interpretation. In this work we evaluate whether a more expressive CCG provides a better model than a CFG for human neural signals collected with fMRI while participants listen to an audiobook story. We further test between variants of CCG that differ in how they handle optional adjuncts. These evaluations are carried out against a baseline that includes estimates of next-word predictability from a Transformer neural network language model. Such a comparison reveals unique contributions of CCG structure-building predominantly in the left posterior temporal lobe: CCG-derived measures offer a superior fit to neural signals compared to those derived from a CFG. These effects are spatially distinct from bilateral superior temporal effects that are unique to predictability. Neural effects for structure-building are thus separable from predictability during naturalistic listening, and those effects are best characterized by a grammar whose expressive power is motivated on independent linguistic grounds.


Introduction
At the sentence level, there remain many unanswered questions regarding language comprehension. What algorithm best describes this cognitive process? The interactive and dynamic character of this sort of cognition (Marslen- Wilson, 1975;Tanenhaus et al., 1995) has motivated a turn to the brain. The hope is that neural data with more granular spatial and/or temporal resolution can help to tease key pieces apart. This brain-informed strategy is promising, but faces a number of challenges related to scale. One issue is how insights from experimental designs that probe isolated phrases (Bemis and Pylkkänen, 2011;Zaccarella et al., 2017a;Murphy et al., 2022;Matchin et al., 2019) and sentences (Pallier et al., 2011;Nelson et al., 2017;Zaccarella et al., 2017b) might scale to more natural instances of language processing (Hasson and Egidi, 2015). A promising approach to this scale problem deploys broad-coverage tools from natural-language processing (NLP) to operationalize cognitive models for language understanding and to annotate more naturalistic instances of language. Statistical alignment between annotations and neural signals is then used to localize specific cognitive processes and to adjudicate between alternative models (e.g. Brennan and Pylkkänen, 2017;Brennan et al., 2016Brennan et al., , 2020Wehbe et al., 2014;Reddy and Wehbe, 2021;Shain et al., 2020;Nelson et al., 2017;Bhattasali et al., 2018). Such research has revealed a compelling alignment in the neural bases of different aspects of sentence comprehension. Over the past few years, results with both experimentally-controlled stimuli as well as more naturalistic materials have converged upon a view of posterior temporal regions (Wilson et al., 2011;Murphy et al., 2022;Zaccarella et al., 2017b;Brennan et al., 2020) as working in tandem with anterior temporal and inferior frontal areas to subserve the apperception of linguistic structure. Of the latter, anterior temporal brain areas have been linked with semantic combinatorics (e.g. Li and Pylkkänen, 2021) while inferior frontral regions may subserve working memory operations (e.g. Amici et al., 2007;Matchin et al., 2014); Matchin and Hickok (2020) offer a integrated framework for these findings.
The body of work cited above rests on two key assumptions that have yet to be put to the test. The first assumption has to do with the grammar: existing studies have for the most part modeled syntax using context-free grammars (CFG) that capture some constituency facts, but are not expressive enough to capture the full range of natural language dependencies (Joshi, 1985;Stabler, 2013;Steedman, 2000). 1 The second assumption concerns the trade-off between complexity and predictability. It has long been recognized that language use might be modulated by structural complexity (Frazier, 1985;Hawkins, 2004;Miller and Chomsky, 1963), perhaps giving rise to a "bottleneck" whereby processing costs are primarily driven by usage-based factors like predictability which may, in turn, reflect factors including structural complexity (see Levy, 2008 but also Hawkins, 1994Hawkins, , 2004Hawkins, , 2014. While previous work has sought to tease these apart, the newest generation of transformer-based Large Language Models (LLMs) (Vaswani et al., 2017, et seq.) offers an unprecedented tool to capture these usage-related factors. Indeed, the apparent match between the outputs of these models and human neural signals suggest they offer a very strong baseline for isolating neural responses that reflect statistical patterns alone (Schrimpf et al., 2021;Kumar et al., 2022;Caucheteux and King, 2022;Caucheteux et al., 2023;Heilbron et al., 2022).
In this paper, we model structure-building using a Combinatory Categorial Grammar (CCG) with human-like expressiveness Steedman, 2019, 2020). We quantify the number of parsing steps incrementally, word-by-word, and use that quantity to model whole-brain fMRI time-series recorded while participants listen to an audiobook story. This neural modeling effort includes stateof-the-art estimates of word-predictability from the Chinchilla LLM (Hoffmann et al., 2022). With these materials, we ask three questions: First, whether CCG parser steps capture neural variance better than steps derived from a CFG. Second, whether such correlations hold above-and-beyond LLM-based estimates of predictability. And, third, we compare alternative formulations of CCG parsing to determine which one is most human-like.
1.1 Incremental parsers as models for neural signals Stabler (2013) highlights the "hidden consensus" among diverse groups of linguists such that world's languages surpass the limits of context-free grammar (CFG), but not by very much. This consensus implicates the "mildly context-sensitive" formalisms characerized by Joshi (1985). One example of the kind of structures that require such greater expressivity are crossing serial dependencies found in numerous languages. Example (1) illustrates such a dependency with an expression from Dutch; here, the embedded clauses are ordered such that the paired subjects and verbs are inter-leaved, rather than being nested (Steedman, 2000, p. 25 Prior efforts to model naturalistic neural signals associated with structure build-4 ing have not, on the whole, reflected this consensus. Structural complexity estimates based on less-expressive CFGs have been found to correlate with regions in the left temporal lobe (Brennan et al., 2012;Nelson et al., 2017;Brennan et al., 2016;Reddy and Wehbe, 2021). Attempts to extend this to more expressive grammars have been extremely limited. Shain et al. (2020) start from a Generalized Categorial Grammar, which may be quite expressive, but ultimately proceed to compile that down to a CFG. Brennan et al. (2020) use a Recurrent Neural Network Grammar (RNNG; Dyer et al., 2016) and estimate structural complexity via the number of parser transitions attempted across a parallel "beam" of partial analyses. Yet, the phrase structures parsed by this particular RNNG were comparatively naïve. They reflected only the constituency annotations in the Penn Treebank (Marcus et al., 1993), without explicitly treating long-distance dependencies such as filler-gap constructions and WH questions (see §4 of Bies et al., 1995). Finally, Brennan et al. (2016) estimate structure using a CFG as well as a more expressive Minimalist Grammar (MG); the latter captures neural variance robustly in posterior temporal regions and more modestly in anterior regions. However, the MG deployed there was hand-built for the stimulus text; it was not broad-coverage in a way that would generalize to other instances of natural language. Despite these limitations, prior studies paring structural annotations with naturalistic data have revealed left-temporal correlates for structural complexity that are broadly consistent with results from more constrained experimental designs. However, these studies show individual differences in more detailed terms of the relative contribution of anterior versus posterior middle and superior temporal regions. Such discrepancies could reflect the limitations of context-free grammar per se, differences in parsing strategy, or other analytical differences such as the selection of regions of interest. The present effort aims to address each of these limits.

Combinatory Categorial Grammar
Combinatory Categorial Grammar (CCG) is a mildly context-sensitive formalism that fits into realistic processing models (Steedman, 2000). Its flexible constituency allows for a very high degree of incrementality even with simple analysis methods like shift-reduce parsing. In addition, it is directly compositional in the sense of 2 Human sentence processing can proceed in a way that is closely time-locked to the incoming stream of words (Marslen- Wilson, 1975;Altmann and Steedman, 1988;Tanenhaus et al., 1995). Sag and Wasow (2011) distill this and related considerations into a short list of requirements on performance-plausible grammars. In subsequent years, Stabler's (1997) formalization emerged as a version of transformational grammar that meets Sag and Wasow's requirements (see e.g. Hale, 2006;Graf et al., 2017;Stanojević and Stabler, 2018;Hunter et al., 2019;Chen and Hale, 2021 Figure 1: Comparison of different syntactic representations for the sentence "Mary reads papers". (A) shows a phrase-structure constituency tree that follows a context-free grammar. (B-C) Left and right-branching combinatoric categorical grammar derivations; both yield the same final semantic interpretation yet differ in the order by which phrases are composed. Barker and Jacobson (2007); each syntactic constituent is assigned a corresponding semantic interpretation in the form of a lambda term that has explicit truth conditions. Together, these properties of the grammar offer a good match to human sentence processing, where comprehension operates immediately and incrementally. 2 Thanks to the treebanking efforts of Julia Hockenmaier, CCG is broad-coverage in exactly the manner required for matching naturalistic experimental stimuli (Hockenmaier and Steedman, 2007). Panels (b) and (c) of Figure 1 present two different CCG analyses that provide the same semantic interpretation. These contrast with the naïve phrase structure in panel (a) that typifies context-free approaches to sentence structure.
As suggested above, these existence of two alternative analyses grant a CCGbased parser the power to operate incrementally, word-by-word. These alternatives differ in terms of the eagerness of a corresponding parsing process: the "leftbranching" derivation 1c eagerly composes structure as soon as it is grammatically possible. The "right-branching" derivation 1b is less eager; there are points in the 6 derivation where multiple words must be recognized before a composition step can be taken. For instance the Subject "Mary" is not composed with its main verb "reads" until the third and final word. Both analyses are perfectly well-formed and are associated with exactly the same semantic interpretation as shown in red.
Although context-free grammar lacks flexible constituency, there exists a universe of alternative parsing strategies for these grammars that likewise can be viewed as varying in eagerness. Top-down parsing is maximally eager, whereas bottom-up parsing is minimally eager. This generalized perspective is elaborated in the Methods section, which includes Figures 4-5 illustrating all of these strategies. Hale (2014, chapter 3) offers a pedagogical treatment. Crucially, to model naturalistic behavioral and neural signals from language-users, one must commit both to the grammatical structures that the language-user is using, and also to the strategy (more or less eager) by which structures are composed (Brennan, 2016).
A traditional, if Anglo-centric, view is that human sentence processing operates according to the "left-corner" strategy (Johnson-Laird, 1983;Abney and Johnson, 1991). This traditional view is supported by patterns of expected memory-use during comprehension (Resnik, 1992). Indeed, Brennan and Pylkkänen (2017) present evidence that neural signals from the anterior temporal lobe recorded with magnetoencephalography during reading are consistent with the left-corner, not bottom-up, strategy. Nelson et al. (2017), on the other hand, report electrophysiological data recorded from intra-cranial recordings in frontal and temporal regions that are most consistent with either a bottom-up or left-corner strategy but not, in their analysis, with an eager top-down strategy. Among their differences, this latter neurolinguistic study involves isolated sentences, while the former uses more natural story-book reading; this highlights the potential tension between different literatures mentioned above.
Bottom-up parsing is thus a relatively a simple strategy which enjoys some empirical support. It can be applied together with CCG or naïve phrase structure. Regardless of which grammar is used, bottom-up parsing always faces a prima facie problem capturing the incremental interpretation that humans appear to show during real-world comprehension, especially regarding right adjunction. The issue can be illustrated with the simple sentence "Mary reads papers daily." From an incremental perspective, a bottom-up analysis would yield a complete sentence after just the words "Mary reads papers" -that is indeed a grammatically acceptable string. When confronted with the modifier "daily", the parser would need an extra sequence of processing steps to reject the existing analysis and re-compose the modifier "daily" with the verb phrase "reads papers". The extra processing steps seems to be at odds with human processing evidence which shows little to no difficulties with right-adjunction of this sort. Lewis (1993, ch. 2) reviews a host of what he terms "unproblematic ambiguities" in this vein which together place con-straints on the kind of flexibility needed to account for human sentence processing patterns.
Experimental evidence sharpens the challenge for bottom-up analyses and CCG. Sturt and Lombardo (2005a) examine patterns of eye-fixations while participants read sentences combining reflexive anaphora and conjunction, such as "The pilot embarrassed Mary and put herself an a very awkward situation." Here, co-reference between "the pilot" and "herself" requires a connected path ("c-command within a local domain" in the terminology of Chomsky, 1981). Yet, bottom-up strategies do not make such a path available until the second coordinated phrase has been fully parsed, as schematically illustrated in Figure 2. Despite this lack of syntactic connection, eye-fixation data reveals that comprehenders in fact do resolve the relevant co-reference relationship immediately, without waiting for the clausefinal word "situation." Evidence for immediate interpretation of co-reference in sentences such as this is incompatible with bottom-up parsing strategies discussed thus far. Steedman (2019, 2020) present a CCG parser with components designed to address this challenge. 3 They do so by implementing an incremental tree-rotation algorithm via a REVEAL operation added to the parser along-side SHIFT and REDUCE. Under this revised strategy, the parser incrementally converts left-branching to right-branching structures that are suitable for composing with the modifier.
With this backdrop, we aim to test which of a family of broad-coverage syntactic parsers offers the best fit to human neural signals. We compare models that differ in the grammar used to build structures (less expressive CFG or human-like CCG), as well as the parsing strategy that is applied incrementally. For CCG, we test between a model that includes the REVEAL operation, and one which does not. For CFG, we evaluate both a top-down and a bottom-up strategy (as discussed in the Methods section below, the top-down and left-corner strategies are not distinguishable with the fMRI data used in this study.) Comparing models along these dimensions alone, however, is not sufficient. Indeed, language processing is highly sensitive to the predictability and frequency of expressions; we turn next to psycholinguistic and computational models of predictability that we might probe in comparison to parsing models discussed thus far.

Predictability, processing, and large language models
Language users are exquisitely sensitive to the statistical properties of their linguistic experience (e.g. Bybee, 2006). Indeed, Levy (2008) proposes that linguistic expectations might serve as a "causal bottleneck between the linguistic representations constructed during sentence comprehension and the processing difficulty incurred at a given word within a sentence." (page 1128). The leading idea is that language processing is modulated, perhaps in large part, by expectations grounded in usage and, in turn, usage is affected by linguistic -including syntactic -complexity. This idea builds on a rich tradition of research at the intersection of psycholinguistics, typology, and linguistic theory, including observations that language processing is facilitated when input is syntactically predictable (e.g Hale, 2001Hale, , 2011, studies showing preferences to disambiguate towards structurally simpler expressions (Frazier, 1985), and evidence that such preferences shape the typological distribution of languages as they exert performance pressures as languages develop and change (Hawkins, 2004, see also Futrell et al., 2015). The well-documented link between structural complexity, usage, and linguistic expectations demands careful attention in any effort to tease out neural signatures of structure-processing.
In NLP, neural-network language models, especially LLMs based on the transformer architecture of Vaswani et al. (2017), have shown tremendous capacity to capture and reproduce certain statistical characteristics of huge amounts of text. These models are usage-based in the sense that their architecture is not tuned to the structural properties of human language, and their initial state is random; the linguistic patterns that they induce are due solely to input they receive. The inputs to the models are strings of text and they are trained via gradient descent to predict a masked word given some amount of surrounding ("bidirectional") or just preceding ("unidirectional") context. Left to right unidirectional, or "causal" language models operate incrementally, word by word, in the same order as human language processing.
Next-word prediction, the task optimized by LLMs, seems at odds with the comparatively rich task of language comprehension. Yet these models have offered an unprecedented window into the rich information latent in the statistical patterns of language use, showing evidence of inducing aspects of syntactic structure (Manning et al., 2020) with some notable limitations (Ettinger, 2020); see Linzen and Baroni (2021) for a review. Such structure may come to bear on behavioral and neural correlates of processing via the "causal bottleneck" mentioned above. Indeed, probing for evidence of structure through the lens of how it might constrain expectations has been precisely the focus of prior work examining behavioral and neural indices of syntactic structure in naturalistic settings (Frank and Bod, 2011;Frank et al., 2015;Brennan and Hale, 2019;Shain et al., 2020;Henderson et al., 2016). Those efforts complement the path we take here to probe structural complexity more directly and independently of any mediation between structure and predictability.
Given that both LLMs and humans are highly sensitive to patterns of language use, it is perhaps not surprising that LLMs have been found to reliably correlate with a variety of human neural signals. Heilbron et al. (2022) evaluates the fit between next-word predictions furnished by a LLM and neural signals in a variety of ways (see also Caucheteux et al., 2023). In that work, model outputs are quantified in terms of surprisal, which is a transformation of conditional probability. These quantities, as defined by a language model, can be binned into syntactic categories or classes of defined by shared phonological form(s). Previous work has reported robust correlations between model-derived prediction at these various levels and distinct neural signals recorded with both electroencephalography (EEG) and magnetoencephalography (MEG).
Other groups have focused on testing for alignment between the internal states of LLMs and human neural signals. Schrimpf et al. (2021) evaluate the fit between the neural network activations of a range of language models, including transformer-based LLMs, and human neural signals recorded with fMRI as well as electrocorticography (ECoG). They find statistically reliable fits between patterns of activation in the models and human neural signals recorded while participants comprehend sentences and more natural stories. Moreover, they find the next-word prediction performance of the models is the single best predictor of the degree of match between model activations and neural signals. Caucheteux and King (2022) also find that LLMs with better missing-word prediction performance on text show a better statistical match with fMRI and MEG neural signals recorded while participants read isolated sentences. These effects span broad swatchs of left frontal and temporal cortices. Probing the nature of these fits further, they find the best model-to-brain match obtains with the middle layers of the LLMs (e.g. layers 8-9 of a 12-layer feed-forward transformer network) rather than input or output layers. Kumar et al. (2022) report similar results for fMRI data collected while participants listen to narratives.
A persistent challenge for the above-mentioned work is that the LLMs whose activation patterns show a high degree of match to brain signals are themselves "black box" models whose internal states are not directly interpretable. Kumar et al. propose an interesting strategy to confront this limitation by focusing specifically on how different "attention heads" in the transformer network, which serve in this architecture to mediate the spread of feed-forward activation as a function of context, drive neural performance in different brain regions. Analyzing the distribution of activation across attention heads is one of several strategies that have been pursued with some success in rendering LLMs interpretable in terms of linguistic structure (e.g. Kuncoro et al., 2016;Voita et al., 2019;Manning et al., 2020;Yu and Ettinger, 2020, and others).
We take a different, parallel, approach in this project by beginning with parsing models whose internal states are directly interpretable in terms of linguistic representations and computations. Along-side measures from those models, we add estimates of next-word predictability from a state-of-the-art LLM in order to tease apart structural processing from usage-based expectations. To this end, we use the Chinchilla LLM introduced by Hoffmann et al. (2022). This particular model was developed with a focus on finding the optimal tradeoff between model size and training data. It is a top-performer of the current generation of transformer-based language models on a range of tasks, including next-word prediction, outperforming DeepMind's Gopher (Rae et al., 2021), OpenAI's GPT-3 (Brown et al., 2020), and other current state-of-the-art LLMs.
In sum, we test for alignment between neural signals and processing steps from a parser with human-level expressive power (CCG) along-side the strongestto-date baseline model for quantifying next-word expectations. To preview our results: we find that the CCG parser correlates with left-localized fMRI signals above-and-beyond simpler contex-free estimates of parsing, these correlations are made stronger when CCG is augmented with the REVEAL operation that better handles right-adjunction. These effects for structure-building localized to left posterior temporal cortices (and elsewhere) are linearly independent of neural effects for next-word predictability, which separately show high correlations with neural activity along the left superior temporal gyrus.

Computational modeling
Our target predictors come from three types of models: (1) constituency tree parser steps, (2) CCG parser steps and (3) Chinchilla large-language model surprisals. We describe each of these in the following sections.

Predictors from Constituency Parsers
Most constituency parsers fall into three different categories that differ by how much they speculate about the future words that have not been observed yet. 4 A bottom-up parser, sometimes also called Shift-Reduce parser, is the least specula-tive and forms tree nodes only over the words that have been observed. Figure 3 shows an example parsing trace of this strategy for the sentence "Mary reads papers daily". Because it speculates very little, this strategy is not very incremental. For instance, words Mary and reads not get connected until the last step when all words "Mary reads papers daily" are observed. This clearly does not fit with the incremental nature of human sentence processing.
In the other extreme is a top-down parser that is maximally speculative. Figure 4 shows the trace of top-down parser for the same example sentence. Here the parser speculates the whole path from the root node S to the observed words, even though this path may not be consistent with future words that are not yet observed. This parser is as incremental as possible: all words are connected as soon as they are observed. For instance, the top-down parser can establish that Mary is a subject of reads as soon as it observes "Mary reads".
Left-corner parsing is a third type of constituency parser that fills the space between extreme speculation (top-down parsing) and extreme non-incrementality (bottom-up parsing).A left-corner parser builds all constituents bottom-up, just like the bottom-up parser, but it also predicts the parent node that is not fully complete and of whom the current constituent is a left child ("left-corner"). For instance, in Figure 5 Step 2 builds the same constituent as the bottom-up parser, but in Step 3 it also builds (i.e. speculates) the parent that is not fully complete. This partial speculation makes the left-corner parser more incremental than the bottom-up strategy, but still less incremental than top-down. Left-corner parsing establishes relation between Mary and reads after consuming words "Mary reads papers".

Measuring the effort of constituency parsing
We apply a notion of computational work that is inspired by Kaplan's Number of Transitions Made or Attempted (Kaplan, 1972). This is a count of basic operations in a parsing mechanism such as an Augmented Transition Network (for an overview of computational psycholinguistics that introduces this mechanism see Hale, 2017). In this work with naturalistic texts as opposed to known garden-path sentences, we set aside ambiguity resolution and simply count derivation tree nodes in the manner of Frazier (1985). To arrive at an incremental complexity metric, we sum up the number of nodes that would be visited (on a given parsing strategy) between one derivation-tree leaf node and its successor in the linear word-string. This summation is schematized as the number of panels per row in the diagrams shown in Figures 3-5. It is this count that quantifies the amount of work predicted for each word. We have one indicator of processing effort for each constituency parsing strategy: cfg_bottomup and cfg_topdown (cfg_leftcorner is excluded for methodological reasons discussed in 2.3.). The parsing strategies are

Mary
Step 2. S NP

Mary
Step 3. S NP Mary reads Step 4.  computed over the constituency trees provided by the Benepar parser (Kitaev et al., 2019) that is a state-of-the-art constituency parser with a accuracy of 95.9% in recovering labelled constituents in English.

Predictors from Combinatory Categorial Grammar (CCG)
As introduced earlier on page 5 Combinatory Categorial Grammar (CCG) has multiple advantages over context-free grammar. These arise from CCG's flexible notion of a constituency and its flexibility in building up this constituency structure. The first advantage is that CCG can generate some constructions that appear in human language whose analysis within context-free grammar would be linguisticallyinadequate (Stanojević and Steedman, 2021). The second advantage is that CCG can be very incremental without resorting to a speculative top-down parsing strategy.
We illustrate the contrast between CFG and CCG with an example repeated from Figure 1 for the sentence "Mary reads papers". In that sentence, a CFG offers no way of building a node for subject-verb combination "Mary reads" while CCG can represent it as S/NP; this term may be glossed as «a sentence that is missing a noun-phrase on its right». Such flexibility allows CCG to derive the same semantic representation in several different ways. As shown in Figure 1 CCG can form both left-and right-branching structures for sentence "Mary reads papers". By contrast, CFG can form only a right-branching structure. To accomplish this flexibility CCG makes use of a set of combinators with type-raising and function composition that we do not review here. For a compact introduction to CCG combinators see Steedman (1996).
The CCG right-branching tree is quite similar to the CFG tree and processing it with a bottom-up parser would be equally non-incremental as shown by the trace in Figure 6. However, the left-branching tree has completely different properties; the key advantage of the left-branching structure is that now even the simplest bottom-up parser is very incremental. Figure 7 illustrates this with the trace of a left-branching bottom-up CCG parser.

Right Adjunction and CCG Parsing with Revealing
While left-branching trees offer a simple solution for achieving incrementality, they are not sufficient. One particular problem is presented by this is right adjuncts, or optional modifiers that appear to the right of the constituents they are modifying. Their optionality makes these constituents hard to predict across parsers and grammar formalisms. Hale (2014) proposed a solution for the CFG case in the form of generalized left-corner parsing (Demers, 1977). This generalized form allows NP Mary Step 1. Step 3. Step 5. Step 1.

S/(S\NP) NP Mary
Step 2.  for CFG rules to be annotated with the level of incrementality they should have. Hale proposes to make adjunction rules less incremental than non-adjunction rules. The problem with that approach is that there is no way for an incremental parser to know if it should use the adjunction or non-adjunction rule until after it sees whether there is an adjunct.
Ideally the parser would be able to incrementally process the sentence and assume that there is no right adjunct, but in case the right adjunct appears the parser should be able to incorporate it without costly backtracking. This is precisely what the Revealing CCG parser from Stanojević and Steedman (2019) does by exploiting the flexibility of CCG trees and combinators. To understand the Revealing parsing strategy, consider the example shown in Figure 8. Until Step 7 the parsing steps here are the same as in a left-branching CCG tree (7). In Step 7 the Revealing strategy applies a ROTATION operation that converts the already-built left branching tree into a semantically equivalent right branching tree. This is a computationally efficient operation that, crucially, is fully deterministic. It does not incur any significant processing cost, but it does provide a representation that will be useful later if a right-adjunct appears. In Step 8 the right adjunct has appeared. From its type the parser recognizes that it looks to modify a verb phrase (i.e. a constituent with CCG type S\NP). The parser looks for the candidate for modification in the right edge of the preceding subtree. It finds the constituent "reads papers" to be of the right type and REVEALs it for modification in Step 9.
Just as in the CFG case, we measure parsing effort in CCG by counting the operations that the parser needs to conduct for each word of the input sentence. We compute parsing effort for two different CCG parsing strategies: CCG leftbranching (ccg_left) and CCG-revealing (ccg_revealing). The trees over which we derive these CCG parsing predictors are obtained using Rotating-CCG parser by Stanojević and Steedman (2019) which is a highly accurate parser that can recover labelled dependencies in English with 90.8% accuracy.

Modeling predictability with a Large Language Model
To quantify the predictability of successive words, we use Chinchilla which is DeepMind's state-of-the-art large language model (Hoffmann et al., 2022). This is Transformer-based neural network with 70 billion parameters that was trained on 1.4 trillion tokens of text (close to 10 TB of data).
As a predictor of processing difficulty of a probabilistic language model we use surprisal (Hale, 2001) which is the negative logarithm of the probability of the next word, given all the preceding words. More formally: NP Mary Step 1. Step 2. Step 5. Step 8. Step 9.
A technical issue exists in that the probabilities provided by Chinchilla are over tokens instead of words. Tokens are automatically constructed using Sentence-Piece (Kudo and Richardson, 2018) which produces tokens that are not necessarily meaningful in a linguistic sense, i.e. they need not represent morphemes. To get word probabilities we sum the log-transformed probabilities of all the tokens that are extracted from the given word. If the k tokens of word word i start from token at position f (word i ), we compute surprisal for word i as:

Control Variables
In addition to the model-derived predictors described above, we also created a set of control predictors to capture lower-level lexical and acoustic properties of the stimulus. For lexical properties, we use include log lexical frequency (freq) derived from the SUBTLEX-US corpus (Brysbaert and New, 2009) along-side a simple predictor for word-offsets (wordrate), following Brennan et al. (2012). For acoustic properties, we include continuous measures of root-mean-squared power of the audiobook story (rms) and the prosodic contour of narration in terms of the fundamental frequency of speech (f0) derived using the Praat software package (Boersma, 2001).

Participants & Data
The data for this project come from the LPPC-fMRI corpus . The full corpus comprises 112 fMRI datasets collected while participants listen to the audiobook The Little Prince by Antoine de Saint-Exupéry in their first language of either English (N = 49), French (N = 28), or Mandarin (N = 35) ( Figure  9, top left). In brief, participants listened on MRI-safe headphones to the audiobook story for about 90 minutes while lying supine for fMRI scanning. The audiobook was presented across nine runs, each corresponding to a chapter and lasting about 10 minutes. Images were acquired at 3T (MRI GE Discovery MR750). Structural scans were acquired with a T1-weighted MPRAGE sequence; functional scans were acquired with a multi-echo planar sequence with parameters: TR = 2 s, Figure 9: The data comprises N=20 fMRI time-series recorded while participants listened to an audiobook story. Voxels were grouped according to the Human Connectome Project's multi-modal cortical parcellation and time-series were averaged within each region (top-right). The audiobook story was annotated for acoustic and lexical control variables along with state-of-the-art estimates of lexical prediction from the Chinchilla LLM and a family of structure-based predictors using both a CFG and the more expressive CCG formalism (bottom left). Hierarchical linear regression was used to evaluate the fit between linguistic predictors and neural time-series (bottom-right). TEs = [12.8 27.5 43] ms, FA = 77 • , matrix size = 72 × 72, FOV = 240.0 × 240.0 mm, image acceleration = 2×. The first four volumes of each run were discarded. Functional data were denoised using multi-echo independent component analysis (Kundu et al., 2012) prior to co-registering functional and structural volumes and normalization to the MNI atlas. We use the first 20 English-language datasets here (subject IDs 57-78). The choice of language reflects the fact that English-language training resources available for the CCG parser (Hockenmaier and Steedman, 2007). This particular subset of English-language dataserts was chosen randomly; the number of datasets seeks to balance balances broad-coverage against computational feasibility as detailed more below. All of the selected participants showed high accuracy on the comprehension questions (Mean = 94%, Range = [86% 100%]).
For each of these datasets, structural scans were parcellated into 360 cortical regions based on a volumetric preparation of the HCPMM1 atlas (Glasser et al., 2016) using the Nilearn package (Abraham et al., 2014). 5 Functional data from voxels within each parcel were averaged together yielding fMRI time-series for each participant for each of 360 regions spanning the whole cortex (Figure 9 top right). These time-series were then converted to z-scores separately for each participant and for each scanner run. These time-series were divided into training and test sets by extracting the first 400 fMRI samples per participant and per ROI for model training (1,404,000 datapoints per hemisphere spanning approximately 13.3 minutes of scan time, excluding the first 10 volumes and the first volume from each subsequent run) and samples 401-800 per participatn per ROI for out-of-sample testing (1,436,400 datapoints per hemisphere excluding the first volume from each run).

Statistical Analysis
To align the target and control predictors with the fMRI time-series, each predictor was convolved with the hemodynamic response function (modeled with the Nilearn library as a mixture of two gamma distributions) and re-sampled to 0.5 Hz to match the sampling rate of the data. For this operation, all predictors defined at the word level were represented as unit impulse functions time-aligned to wordoffset. The convolution operation increased colinearity between predictors as they all are influenced by the speech rate of the audiobook stimulus. For each target regressor we isolated the orthogonal component of that term to the space spanned by the wordrate control regressor and the constant vector 1. Even after orthogonalization, the correlation between cfg_topdown and cfg_leftcorner was very high (r > 0.95) and so only one term, cfg_topdown, was included in our analysis. Figure 10 shows the bivariate distributions and correlation coefficients for all pairwise combinations of regressors that were entered into a statistical model against the fMRI time-series.

Fitting procedure
We quantify the fit from target and control regressors to the fMRI time-series using a Bayesian multi-level linear regression fit with the Stan platform for statistical modeling via the brms package in R (Bürkner, 2018;Stan Development Team, 2022). As schematically illustrated on the bottom of Figure 9, this model was fit against the parcellated fMRI data set aside for training described above (first ≈ 13.3 min per dataset); target and control regressors were included in the model as fixed effects along with random slopes by participant and for brain region (see Shain et al., 2020). The statistical model was defined as given below using an  Regressors were mean-centered prior to being entered into the model; note that because the data were z-scored, therefor also centered, we simplify the modeling by excluding by-participant and by-region intercepts. We also simplify by not estimating the covariances between random terms. We specified weakly informative priors, including N (0, 1) for the model intercept, N (0, 2.5) for all other fixed effect parameters; and an exponential distribution with a rate of 1 for the standard deviation of hierarchical terms. Priors for all other terms conformed to defaults for brms 2.17.0. The posterior distribution was sampled with four Markov chains, each comprising 2,000 samples (1,000 for warm-up).
We two such models, one per hemisphere, using 32 cores (3.0 GHz Intel Xeon Gold 6154) on the "Great Lakes" high-performance computing cluster at the University of Michigan. The fitting procedure made use of within-chain parallelization across 8 threads per-chain and QR decomposition was applied to the populationlevel design matrix; fitting took approximately 6 days to complete per hemisphere. Model diagnostics did not indicate any problems with the procedure: there were no divergent transitions and we observed good mixing between chains (allR < 1.01).

Statistical inferences
Inferences were conditioned on the fitted models in two ways. First, we examined the posterior distribution of the regression coefficients (β) for each target ef-fect per region (see Figure 9 bottom right). Second, we also evaluate how each term contributes to the out-of-sample goodness of fit per region in the following way. 6 We define a measure of goodness-of-fit using the root-mean-squared error, where y comes from the n held-out testing samples of fMRI data per participant andŷ is the estimated value for each held-out out sample using 500 draws of the posterior distribution of coefficients. We compute RM SE f ull for the model with all terms, and iteratively, RM SE term where each regressor is ablated by applying a circular shift by n 2 to each predictor vector of length n prior to computingŷ. This disrupts the contribution of each term to the model out-of-sample fit by removing temporal alignment between predictor and data. From this, we derive the change in goodness-of-fit that can be uniquely ascribed to each term: ∆RM SE term = RM SE term − RM SE f ull . This value is higher when a particular term has a greater impact on out-of-sample goodness-offit in a particular region.
We recognize as "statistically reliable" results where the following two conditions hold: (i) the 99% credibility interval (CI) of the β coefficient posterior distribution excludes zero, and (ii) the 99% CI of the ∆RM SE distribution excludes zero. ∆RM SE also offers a means to directly compare which of two terms offer a "better fit" to a given region. For a pair of terms ∆RM SE target and ∆RM SE control we evaluate whether the posterior distribution of one is reliably greater than the the other. This condition is met when the 99% credibility interval of ∆RM SE target − ∆RM SE control excludes zero and the above conditions (i-ii) hold for the target term. This quantifies where held-out data are better captured when one term is added to the model over-and-above another.

Validation Checks
Before turning to our principle questions we first validate our procedures by checking that control predictors yield familiar results. The sound power of the audiobook stimulus, rms, should drive neural activity in primary auditory regions, and that is exactly what we observe in the results shown in Figure 11A. We also see similar confirmatory results for sentence prosody via the f0 control predictor ( Figure  Figure 11: Posterior mean of the regression coefficients (β) for several control variables; regions are colored where both CI β 99% and CI ∆RM SE 99% exclude zero. (A) Root-mean square sound power localizes to the primary auditory cortex bilaterally. (B) fundamental frequency, or F0, correlates with superior temporal regions and shows a right hemisphere bias. Together, these results show a concordance between our approach and several familiar effects.
11B) which shows a reliable positive correlation with superior temporal activity bilaterally with a right-hemisphere bias; this matches other fMRI investigations of sentence prosody (e.g. Humphries et al., 2005).

CCG-based parser metrics outperform CFG metrics
We now turn to the primary analyses of interest, beginning with the question of whether predictors derived from the more human-like CCG parser, operationalized with ccg_left and ccg_revealing, better capture neural signals compared to the CFG parser steps used in prior work, operationalized as cfg_bottomup and cfg_topdown. Figure 12 shows the fitted values for the two CFG predictors along with a comparison between them. On their own terms, these results accord with previous studies of parser steps (e.g. Bhattasali et al., 2018;Brennan et al., 2016), including the statistical comparison favoring the more eager of the two strategies ( Figure 12C; but cf. the electrophysiological results of Nelson et al., 2017). Note that the two comparisons of ∆RM SE plotted in panel C and in sub-sequent figures are not symmetrical: each comparison plot is masked based on the posterior distribution of the target regression coefficient. For example, ∆RM SE values shown on left-hand side of Figure 12 are only plotted for regions where the 99% of the posterior distribution for the cfg_topdown coefficient excludes zero.
Our focus, however, is on the comparison of these predictors to those derived from the CCG parsers; those fitted values are summarized in Figure 13 and the comparison between CCG and CFG predictors is shown in Figure 14. CCG predictors, especially with the REVEAL operation, show a strong statistical fit with left perisylvian regions extending to the temporo-parietal junction (Figure 13). This pattern is also observed when comparing the relative contribution to out-of-sample model fit for each pair of CCG and CFG predictors ( Figure 14, especially panel A). When they are ablated, CCG-derived predictors show a more reliable impact on model-fit in fronto-temporal regions in comparison to the cfg_topdown predictor. On the other hand, we see some evidence that the cfg_bottomup predictor has a greater impact on the fit with fMRI activity in bilateral temporo-parietal regions (see right-hand sides of Figure 14B ,D.) This complex pattern of results points to an important nuance in our modeling effort: processing strategies need not be mutually exclusive. This recognizes that interpreting sentence-structure is multifaceted and draws on a range of neural circuits for which the models we are considering are partial estimators (see Brennan and Pylkkänen, 2017 for some discussion.) Still, to offer a more global comparison of these two classes of models, we sum ∆RM SE across all cortical regions for the two CFG terms and two CCG terms together. This yields a distribution estimating the effect of the family of terms on overall model fit.

The REVEAL operation improves the match between CCG and the brain
Next we turn to which formulation of the CCG parser offers the best fit to these fMRI data. The measure of parser steps derived when the REVEAL operation is included, ccg_revealing, reliably correlates with a range of left temporal, inferior frontal, and temporo-parietal regions, as shown in Figure 13A. The pattern of results is qualitatively different than what we observe with the ccg_left predictor that is derived without the REVEAL operation, which shows instead reliable fits with ventral and medial frontal regions ( Figure 13B). A direct comparison of how each term affects goodness-of-fit against out-ofsample data shows that ablating ccg_revealing has the largest impact on posterior temporal regions, more so than ccg_left. There are no areas where the 29 reverse pattern holds. These comparisons are shown Figure 13C and they favor the REVEAL operation of Stanojević and Steedman (2019) in as much as it delivers a sequence of parser steps that better matches human left frontal and temporal hemodynamic signals recorded during sentence processing.
3.4 CCG-based parser steps capture fMRI data independently of effects for predictability The above results are all observed in a statistical model that "controls" for predictability by including surprisal from the Chinchilla LLM as a co-regressor. We now turn to a direct comparison of the goodness-of-fit offered by surprisal in comparison to CCG-derived parser steps. Surprisal from a LLM, chinchilla_surprisal, reliably correlates with a broad range of bilateral fonto-temporal activity, with the strongest patterns observed in the superior temporal gyri ( Figure 15A). This result matches previous efforts that specifically compare surprisal from large language models with human neural signals Caucheteux et al., 2023), and also studies drawing on alternative methods for quantifying word-by-word expectations (Lopopolo et al., 2017;Henderson et al., 2016;Shain et al., 2020;Willems et al., 2016).
We check for linearly independent contributions of chinchilla_surprisal in comparison to the ccg_left and ccg_revealing terms by evaluating how removing each affects out-of-sample prediction via ∆RM SE . Indeed, we see that removing ccg_revealing reliably impacts goodness-of-fit in several regions including the left posterior temporal cortex, left anterior temporal pole, and left temporo-parietal regions to a degree that is reliably greater than any impact of chinchilla_surprisal. This comparison is illustrated on the left-hand side of Figure 15B. The reverse comparison, shown on the right-hand side of Figure 15B, reveals unique contributions of chinchilla_surprisal across the superior temporal gyri and superior temporal sulci bilaterally. The impact of predictability on superior temporal regions is also evident in the comparison to ccg_left shown on the right side of Figure 15C. The latter CCG predictor shows some evidence of independent contributions in ventral frontal regions, but this result should be interpreted cautiously given the overall better performance of ccg_revealing discussed above.
The broader pattern evident in these comparisons reinforces the complementary nature of expectation-based measures like chinchilla_surprisal and measures tied more directly to structural complexity like ccg_revealing. Both classes of predictors reliably and independently correlate with the fMRI signals and their effects are spatially distinct: the former is most prominent on bilateral supe- and CI ∆RM SE 99% exclude zero. Panel (C) shows the direct comparison of these two predictors in terms of ∆RM SE ; mean differences between predictors are plotted where ≥ 99% of the posterior distribution is above zero and CI β 99% for the first term excludes zero. Top-down parser steps shows improved fit to the data compared to bottom-up predictor in left middle temporal, temporo-parietal and superior frontal regions. There are no areas where the bottom-up predictor out-performs the topdown predictor. shows results for a left-branching strategy without that operation. Panel (C) shows the direct comparison of these two predictors in terms of ∆RM SE ; mean differences between predictors are plotted where ≥ 99% of the posterior distribution is above zero and CI β 99% for the first term excludes zero. The CCG-based predictor incorporating the REVEAL operation leads to improved model fits in left posterior temporal, and bilateral frontal regions; there are no regions where the absence of REVEAL reliably improves model fit. Figure 14: Comparisons between CCG and CFG-derived poser step predictors in terms of the relative sample goodness-of-fit quantified with ∆RM SE . Each panel plots the pairwise mean differences between predictors where ≥ 99% of the posterior distribution is above zero and CI β 99% for the first term excludes zero. CCG, especially with the REVEAL operation, outperforms CFG-based models along anterior and posterior regions of the left temporal lobe and the frontal lobes bilaterally. Bottom-up and top-down CFG-based predictors show relative improvements compared to CCG in temporo-parietal regions bilaterally and along the left superior frontal gyrus. The CCG-based predictors show an overall greater improvement in model fit than the CFG models (see main text). rior temporal regions, while the latter is associated with inferior frontal, posterior temporal, and anterior temporal foci.

Discussion
In this study we examine how well computational models of parsing effort capture variance in hemodynamic neural signals collected while participants listened to an audiobook story. The models we deploy span dimensions that have been debated in prior work in computational psycholinguistics and neurolinguistics: We compare parsing models based on a context-free grammar (CFG) with those based on a mildly-context sensitive grammar (CCG), and we also compare parsing models that differ in how eagerly they postulate new structure. For the CCG parsers, this includes testing the REVEAL operation as to whether it yields a more "human-like" account of incremental parsing. These comparisons are evaluated in the context of control covariates which include a state-of-the-art estimate of next-word predictability from the Chinchilla large language model (LLM). The comparisons reported here carry particular significance for cognitive theories of language processing based on artificial neural networks, whose operation relate directly to human brain signals (e.g. Schrimpf et al., 2021;Caucheteux and King, 2022). They also speak to the theoretical possibility that structural complexity might only impact incremental processing through the "bottleneck" of predictability (e.g. Levy, 2008).

General considerations in reasoning about computational models of language processing
Before discussing the results in detail, we first briefly consider the kind of reasoning about cognition that is supported by the modeling approach we adopt. The overall approach fits into a tradition that seeks to address which computational model of language processing best describes the human system (e.g. Kaplan, 1972;Kimball, 1973;Berwick and Weinberg, 1984;Frazier and Clifton, 1996;Steedman, 2000). Brennan (2016) lays out in very general terms how this traditional approach applies to neural signals that have been recorded during naturalistic listening. In the terminology of that review, the present work is an "L-study." Lstudies test linguistic and/or psycholinguistic theories by comparing different models against the same empirical measure of brain activity. All models meet a basic adequacy condition of "processing," "generating" or in some way "analyzing" the same text stimulus so the meaning of the stimulus is held constant.
In fact the present work reports two closely related L-studies, each examining a different dimension. The first dimension is the grammar i.e. Combinatory Cat- Figure 15: (A) Posterior mean of the regression coefficient (β) for the surprisal predictor derived from the Chinchilla LLM; regions are colored where both CI β 99% and CI ∆RM SE 99% exclude zero. Direct comparisons between surprisal and CCG-derived parser steps are given for ccg_left (B) and ccg_revealing (C) in terms of ∆RM SE ; mean differences between predictors are plotted where ≥ 99% of the posterior distribution is above zero and CI β 99% for the first term excludes zero. On its own terms surprisal from Chinchilla correlates with temporal and frontal activity bilaterally. In comparison, a CCG-derived predictor with REVEAL shows improved model fits in posterior middle temporal, anterior temporal, superior frontal, and temporo-parietal areas. On the other hand, Chinchilla shows improvements in model fit in the superior temporal gyrus bilaterally compared to CCG-derived predictors.
egorial Grammar (CCG), naïve phrase structure (CFG) or no explicit grammar at all (LLM). The second is the parsing strategy, which only makes sense in the first two cases but not the third. These alternatives are detailed above in subsection 2.1. The form of the claims to be discussed in this section is always the same: because model M fits the brain data better than M , the linguistic or psycholinguistic idea reified in M but not M is supported. Obviously, this sort of inference is limited to models whose consequences on the stimulus text can actually be calculated. Perhaps less-obvious is the absence of any sort of exhaustivity claim. Rather than excluding classes of possible models, the methodology delivers relative rankings of actual models. For this reason it is important to prioritize, as we have done here, models that already enjoy some degree of plausibility in subdisciplines such as Linguistics and Artificial Intelligence.

Mildly-context sensitive CCG and the REVEAL operation improve fit to neural signals
We turn first to the comparison of different parsing models in terms of their fit to hemodynamic data. In this effort, we build on prior work using fMRI (Brennan et al., , 2020Henderson et al., 2016;Shain et al., 2020;Wehbe et al., 2014;Reddy and Wehbe, 2021;Bhattasali et al., 2018) and electrophysiology (Nelson et al., 2017;Brennan and Hale, 2019;Hale et al., 2018); that work itself is situated within a broader literature that aims to narrow down the neural subsystems uniquely engaged in sentence-level combinatoric processing (for reviews, see Zaccarella et al., 2017b;Matchin and Hickok, 2020;Pylkkänen, 2019). The key theoretical issue is whether the more expressive CCG, which better-matches human languages, derives improved fits to fMRI signals relative to a less expressive CFG predominantly used in prior work. Models using the mildly context-sensitive CCG grammar show an overall better fit to neural signals, especially spanning the left frontal and temporal lobes, in comparison to models based on a CFG. This result is perhaps best illustrated in Figure 14A which shows the comparison between ccg_revealing and cfg_topdown. This particular comparison is valuable, as those two models each showed the best fit to neural signals in comparison to others based on the same grammar (see Figures 12 and 13).
The improvement in fit for the CCG parsers obtains in spite of the fact that the accuracy of the CCG parser, while high on its on terms, is lower than that of the CFG parser that we use (CCG accuracy: 90.8% Stanojević and Steedman, 2019, Table 4; CFG accuracy 95. 9% Kitaev et al., 2019, Table 1.) This result affirms the "hidden consensus" that human language syntax is built on a formal system with such expressivity (Joshi, 1985;Stabler, 2013). It also is consistent with one prior study comparing naïve phrase structure to X-bar structures derived from a more expressive Minimalist Grammar . That earlier work was limited insofar as the grammar was not broad-coverage. The current effort takes advantage of CCGbank (Hockenmaier and Steedman, 2007). The implicit CCGbank grammar is broad-coverage and seems to generalize in a psychologically-realistic way to our stimulus text, an English translation of The Little Prince. Shain et al. (2020) and Brennan et al. (2020) also deploy grammars that might in principle be as expressive as CCG, but in practice both of those efforts ultimately rely on CFG-equivalent variants.
We note, again, that the conclusions we draw from the present data are limited to the "small world" of the models under consideration. Other instantiations of both context-free or mildly context-sensitive formalisms may yield different fits to neural time-series. The present effort does not allow for statistical generalization beyond the specific models under analysis. However, we contend that the present comparisons are well motivated: The bulk of prior neurolinguistic work in this domain has relied on CFGs of the type we deploy, and CCG is especially useful in our effort because its design-principles reflect direct and incremental composition which accords with human sentence processing (Steedman, 2000).
Comparison between models also addressed the dimension of eagerness: how readily is structure postulated based on partial input? Of special interest here is the utility of the REVEAL operation for CCG parsing developed by Stanojević and Steedman (2019). That operation was introduced to resolve a challenge in efficiently incremental parsing for optional elements such as right adjuncts. Our analysis suggests that the sequence of parsing steps when this operation is included in the model yields a better match to the dynamics of neural activity than when this operation is not included. This pattern was most striking in left posterior temporal regions (Figure 13), in accordance with theoretical accounts positing a special role for that region in combinatoric processing (Matchin and Hickok, 2020) and recent intracranial recordings that test for sensitivity to linguistic phrase composition (Murphy et al., 2022).

Structural complexity and next-word predictability capture complementary neural signals
This study also aims to tease apart structural processing, operationalized here as parser steps, from predictability. The import of latter has been studied extensively in the cognitive sciences (de Lange et al., 2022, offer a broad perspective) and has been of central importance in the study of sentence comprehension (see Traxler, 2014, for a review). Crucially, structural complexity has been argued to modulate usage and, in turn, linguistic expectations derived from usage; under a strong for-mulation, structural complexity might only affect processing via the "bottleneck" of predictability (Levy, 2008cf. Hawkins, 2014. Thus, a theoretical question is whether direct indices of structural complexity can be teased apart from word-byword expectations within a naturalistic language comprehension task. To meet that question we use the strongest-to-date estimator for next word predictability, derived from the Chinchilla neural network LLM (Hoffmann et al., 2022). Against that baseline, CCG-derived structural complexity reliably correlates with activity in the left middle temporal gyrus, left temporal pole, angular gyrus and superior frontal gyrus (see especially Figure 15B.) We observe complementary activation for predictability, operationalized as chinchilla_surprisal. That activation is observed, linearly independent of any effects for structural complexity, in the superior temporal gyrus bilaterally (right-hand panels in Figures 15B,C). This result accords with the striking match between LLM-estimated predictability and neural signals reported by Heilbron et al. (2022) and Caucheteux et al. (2023) for single-sentence comprehension. The bilateral temporal localization also matches reports from other efforts that examine more naturalistic comprehension using alternative methods to estimate predictability such as n-gram language models (Willems et al., 2016;Lopopolo et al., 2017;Brennan et al., 2016;Shain et al., 2020).
While these comparisons between complexity and predictability are consistent with predictability having a (relatively) focal effect in comprehension, we caution against drawing an overly strong conclusion here. For one, we expect based on both theoretical and empirical arguments that next-word predictions will correlate, at least to some extent, with structural complexity and other factors (such as word-frequency). Within this context, the particular analysis just mentioned aims to isolate only those spatial correlates that are linearly independent among the effects included in our models. Indeed, areas where fMRI activity shows a reliable correlation with chinchilla_surprisal along-side other factors include the temporal lobes broadly, temporal parietal junction, inferior frontal gyrus and posterior superior frontal areas ( Figure 15A). At the least, our data are consistent with predictability have a large-scale modulatory effect on multiple, spatially distinct, facets of language comprehension.

On parsimony
Our analyses do not include predictors derived from the internal state of a LLM (cf. Schrimpf et al., 2021;Kumar et al., 2022;Caucheteux and King, 2022). That choice is driven by a focus on interpretability and parsimony in our modeling efforts; estimators for structural complexity are derived from the internal states of psycholinguistically plausible parsing models by counting parser steps. Deep neu-ral networks play a role in instantiating the parsing models, both for the CCG parser used here (Stanojević and Steedman, 2019) and the estimates of predictability from Chinchilla (Hoffmann et al., 2022). To pursue a more direct comparison between interpretable parsing models and the internal states of LLMs opens up the question of how to balance the statistical fit between a model to neural signals against the complexity or parsimony of that model.
LLMs are, famously, not parsimonious. Chinchilla has 70 billion parameters, which is substantially smaller than other state-of-the-art LLMs of the same class (GPT-3 has 175 billion; Gopher 280 billion etc.) The size of these models is only exceeded by the size of the training data used to fit them: Chinchilla is trained with 1.4 trillion language tokens. 7 The CCG parser used here has ≈ 0.1 million trainable neural network parameters and is trained on ≈ 0.9 million tokens (Hockenmaier and Steedman, 2007, CCGbank). 8 However, these counts over-state the parameter space spanned by the interpretable parsing model because they are used only for the disambiguation/predictability of the parse trees and do not participate directly in the complexity metrics used for our analysis. Our CCG parsing effort is measured only by the number of parsing actions taken per word, not by their predictability. In principle, the CCG parser makes one of a finite set of choices per step (shift, reduce, reveal, etc.). From that perspective, the size of the CCG model is on the order of 445 possible choices. 9 The upshot of this discussion is twofold. First, independent effects of structural complexity from the CCG parser, in comparison to Chinchilla, stand out against the massive difference in complexity between those two models. Second, connecting our work with efforts that align brain data directly to the internal states of LLMs is made difficult by the lack of parsimony in those efforts.

Conclusion
We find evidence that CCG captures neural activity above-and-beyond that which correlates with predictability of large language model and parsing steps of context-7 To put this in perspective, to have equivalent linguistic experience a 10 year old child would need to be exposed to over 4,000 words per second every hour of every day. (see Bergelson et al., 2022, for large-sample and cross-cultural study of naturalistic linguistic input). 8 Stanojević and Steedman (2019) also contains non-trainable parameters (i.e. they are fixed during parser training) that come from pretrained ELMo embeddings (Peters et al., 2018). But even if ELMo data and parameters are included in the CCG model size it is still many orders of magnitude smaller than Chinchilla both in data and parameter size. Similar numbers to CCG parser also hold for the Benepar constituency parser that uses pretrained BERT embeddings (Devlin et al., 2019). 9 The parser has 425 different shift transitions, one for each possible lexical category, and 20 different reduce transitions (5 for binary combinators, 2 for unary type-raising combinators, 12 for unary type-changing rules and 1 for revealing operation).  Figure S1 caption for details.