Task‐induced brain functional connectivity as a representation of schema for mediating unsupervised and supervised learning dynamics in language acquisition

Abstract Introduction Based on the schema theory advanced by Rumelhart and Norman, we shed light on the individual variability in brain dynamics induced by hybridization of learning methodologies, particularly alternating unsupervised learning and supervised learning in language acquisition. The concept of “schema” implies a latent knowledge structure that a learner holds and updates as intrinsic to his or her cognitive space for guiding the processing of newly arriving information. Methods We replicated the cognitive experiment of Onnis and Thiessen on implicit statistical learning ability in language acquisition but included additional factors of prosodic variables and explicit supervised learning. Functional magnetic resonance imaging was performed to identify the functional network connections for schema updating by alternately using unsupervised and supervised artificial grammar learning tasks to segment potential words. Results Regardless of the quality of task performance, the default mode network represented the first stage of spontaneous unsupervised learning, and the wrap‐up accomplishment for successful subjects of the whole hybrid learning in concurrence with the task‐related auditory language networks. Furthermore, subjects who could easily “tune” the schema for recording a high task precision rate resorted even at an early stage to a self‐supervised learning, or “superlearning,” as a set of different learning mechanisms that act in synergy to trigger widespread neuro‐transformation with a focus on the cerebellum. Conclusions Investigation of the brain dynamics revealed by functional connectivity imaging analysis was able to differentiate the synchronized neural responses with respect to learning methods and the order effect that affects hybrid learning.


| INTRODUC TI ON
So-called unsupervised learning (USL) implies spontaneous grasping efforts to acquire knowledge and skills (Clark, 2001), whereas supervised learning (SL) refers to accepting the guidance of a teacher to improve levels of ability in target domains (Fagg, 1994). The relationships between USL and SL have been defined according to the theory of "schema" (Bartlett, 1932;Schank & Abelson, 1977), which has been previously defined as "active, interrelated knowledge structures, actively engaged in the comprehension of arriving information, guiding the execution of processing operations" (Rumelhart & Norman, 1976). The schema theory was initially applied to the field of learning motor skills (Salmoni et al., 1984;Sherwood & Lee, 2004). The term was later broadened to accommodate an artificial intelligence-oriented perspective: "a schema is what is learned about some aspect of the world, combining knowledge with their corresponding application processes; a schema instance is an active deployment of these processes" (Arbib, 1992). To date, the schema theory has been further diversely conceptualized across various cognitive domains.
The schema theory can similarly be applied to characterize verbal learning (Carrel & Eisterhold, 1983), although its implementation in the language learning literature has been largely limited to the domain of reading or listening comprehension (Al-Issa, 2006;Long, 1989). Extending the schema theory to speech acquisition paradigms would allow for the better understanding of innate language development and subsequently facilitate targeted language training. For example, the canonical babbling of infants recruits both procedural and declarative learning processes, which can be further elucidated as the cooperation between USL and SL. The babbling process is initially predominated by USL, where an initial schema is formed and guides the implicit recognition of auditory patterns that were previously encountered. SL then begins to contribute at around 6 months of age. Selective attention emerges at this age, allowing for direct comparisons between explicit mental storage of auditory patterns and external canonical feedback that in turn facilitates schema update and development (Vihman, 2017). On the other hand, even though SL is a typical model of successful second-language learning at school (Lyster et al., 2013), the learning of canonical categories models (Gogate et al., 2006;Goudbeek et al., 2009) will be somewhat influenced by the "schema" pre-established from the speaker's native language acquisition, making it subject to a mixed model that includes both SL and USL. This is because, through implicitly learning the prosodic cues (Brodsky et al., 2007;Da Silva, 2015;Johnson & Heidl, 2009) and transitional probabilities between patterns for word identification (Pelucchi et al., 2009;Saffran et al., 1996;Xie, 2012), infants not only acquire their native tongue but also establish a lasting "schema" that continues to be influential in later learning of a second language (Cutler et al., 1992;Finn, Hudson, & Kam, 2008;Otake et al., 1993;Rodriguez-Fornells et al., 2005;Spivey & Marian, 1999;Weber & Cutler, 2006). These examples show that both first-and second-language learning fall under a form of "self-supervised learning" with interactive processing and organization of knowledge and can therefore be effectively modeled via the schema theory (Kröger & Bekolay, 2019).
A latent "schema" is the medium responsible for creating learning dynamics by relaying active spontaneous formation and passive reception and thus mediates between USL and SL (Kröger & Bekolay, 2019;Selinker, 1972;Vihman, 2017). According to Rumelhart and Norman (1976), a schema can be updated according to three modes of learning: "accretion," "tuning," and "restructuring." When incoming information coincides with the agent's original intrinsic schema, that particular experience will be engendered through USL and added to the pre-established cognitive space through the process of "accretion." However, when discrepancy arises between new information and the original schema, the agent actively employs SL to update the latter, either through "tuning" variables to make minor structural changes when conflicts are mild or through largescale "restructuring" events where a completely new schema is developed to account for the incoming information.
Note that the modulation of the schema can be reformulated and computationally implemented through the connectionist model based on the error-based learning theory (Cleeremans & McClelland, 1991;. More specifically, we presume that updating the schema would be initiated in each learner by a cycle of prediction, error detection, and subsequent readjustment of the learning function. This learning process is analogous to back propagation in the middle (hidden) layers of an artificial neural network. By repeating the learning steps, the weights given to the middle (hidden) layers are sequentially renewed such that the residuals between the output and teacher signals to be learned are minimized and converge toward stability. This process of back propagation, a key defining step in machine learning, might be algorithmically similar to that of underlying mechanisms of schema update to account for learned new experiences. We therefore hypothesize that this schema-update mechanism is physically implemented in the human brain.
However, there is limited literature on the neural mechanism that creates such dynamic schema regulation processes in language learning. A number of clues have been found that might eventually elucidate the neural underpinnings of USL, SL, and their intermediating schema. Previous studies have demonstrated that the neural systems underpinning nonmodality-specific learning involve the basal ganglia, cerebellum, and limbic system (Caligiore et al., 2019;Hertrich et al., 2016). Referencing the functional network connecting these areas in the brain, Caligiore et al. (2019) formulated a hypothesis of "superlearning" as an integration of SL, USL, and reinforcement learning. His hypothesis elaborated a model based on the ideas of Schweighofer et al. (2004) to explain how USL could strengthen SL via an information transfer mechanism. However, the dynamism of the mutual relationship between the various types of learning is hitherto unexplored from the perspective of individual variability in modulation of the functional network.
The neural implementation of schema update is further complicated by previous study designs that are longitudinal and sparsely sampled in nature, where data from two or more functional magnetic resonance imaging (fMRI) sessions were obtained for subjects over the course of several months (Bubbico et al., 2019;Chai et al., 2016;Saidi et al., 2013;Vinals, 2016). Although such experimental designs allow for the detection of long-term changes induced by learningfor example, for a change in association weights of the default mode network (DMN) through resting-state functional connectivity (FC) analyses (Fox & Raichle, 2007;Raichle, 2015;Raichle et al., 2001;Stawarczyk et al., 2011)-they lack sufficient specificity to capture the time-sensitive neural dynamics of online language learning. We therefore propose a more tailored paradigm in which participants perform language learning tasks while in the fMRI machine. This would capture neural response corresponding to subjects' linguistic performance in real time. We employed task-based dynamic brain FC analyses, as it is the most appropriate target measurement used in studies featuring similar study designs and analytical purposes (Fong et al., 2019). On the behavioral end, by exposing participants to a continuous stream of language stimuli, we explore what synchronization across brain areas supports more or less self-paced learning through immersion (Wong et al., 2014).
Task-induced FC has been suggested to play an important role in conventional fMRI research (Bassett et al., 2006;Eguíluz et al., 2005;Hutchison et al., 2013;Preti et al., 2017), and a recent review by Gonzalez-Castillo and Bandettini (2018) recapitulates the main features of this network connectivity: i) the DMN most characterizing the resting-state connectivity can still be identified during task performance with executive control networks (ECNs) (Cole et al., 2014;Krienen et al., 2014); ii) there is an overall increase in across-network connectivity, specifically in domain-specific across-network connectivity, that is positively correlated with task performance (Cole et al., 2014;Shine et al., 2016); and iii) dynamic FC patterns reveal performance quality, which allows inferences of interindividual differences in perception, learning, and other cognitive abilities (Deng et al., 2016;Gonzalez-Castillo & Bandettini, 2018;Kepinska et al., 2017;Mennes et al., 2010). In particular, spontaneous cognitive processes that typically reflect resting scans (Hurlburt et al., 2015), such as slow rumination, self-interrogation in interior monologue, repeated memorization, and recall, might also be observed in individuals who surrender their ears to an immersive flow of words with a certain rhythm (Simony et al., 2015); the study of spontaneous cognitive processes would be effective for elucidating the interlanguage schemas in USL and SL.
When tapping into the process of schema update, an FC analysisbased approach would also provide insight into connectionist modeling because the artificial neural network must have approximate counterparts in the real brain systems. Several prior studies have put forth principles of correspondence, including the dual-path model of a recurrent neural network (Chang, 2002;Chan et al., 2006). Such a model prevails due to its potential to reflect any possible viewpoints that could be postulated in linguistic processing while allowing for the identification of underlying neural mechanisms. Future studies can consider this neural network perspective with the goal of characterizing task performance and its underlying processes by mapping brain regions to nodes in an artificial neural network. However, the neural basis of schema updating, which would provide insights into the connectionist modeling, has yet to be uncovered. Looking toward the future of artificial intelligence, this study attempted to elucidate how task-induced functional connectivity-especially DMN-could be modulated in the brain as a result of the schema update achieved by superlearning in language acquisition, which would be formed by the alternation of USL and SL and controlled by transitional probabilities between patterns for word identification.

| Overview
It is acknowledged that the online language learning paradigm in fMRI sessions should be designed to meet requirements of simplicity and distinctness in terms of syntactic rule and phonetic regularity. Designed for this purpose, artificial languages have often been used in neurolinguistic experiments that apply mathematical approaches to probe probabilistic sequential pattern learning (Dienes et al., 1995;Kepinska et al., 2017). In our fMRI test, in which we use an artificial grammar language as a proxy of second language (Ettlinger et al., 2016), a USL run is provided at first with a meaningless sound stream composed of sequences of syllables. We ask participants to detect patterns of syllables that they consider to be "words" without revealing the correct segmentation pattern: participants choose their preferred "word" from each of six pairs of syllable patterns in the subsequent testing block. From this first test trial, we attempt to identify a particular characteristic of statistical learning that might depend on a subjective difference in probability, or "preference" in decision-making, in implicit but unsupervised sequential learning (Folia et al., 2011;Kaufman et al., 2010;Zizak & Reber, 2014). Here, any disyllabic word segmentation that the candidate extracts from the artificial speech follows either the high/low or low/high appearance frequency patterns, stemming from the candidate's bias toward forward and backward transitional probabilities and pitch patterns that relate to the stress position in each word candidate. This USL run is followed by three consecutive SL runs with different auditory materials, all of which conform to a structural pattern rule that remains identical throughout the whole experiment. For each SL run, prior to presenting the continuous syllable stream, six disyllabic words to be learned in that particular run are introduced as correct answers to be chosen in the ensuing test with the same alternative form as the preference selection in the USL run. The upcoming and final USL run, identical to the one presented at the very start of the experiment, completes the experiment while demonstrating how the previous three SL runs affected the participant's initial schema, effectively changing their responses without being noticed.
In this study, each subject's preference in word segmentation revealed through the first USL runs represents his/her own covert schema or interlanguage. The subsequent SL runs specify pseudowords that follow the same canonical schema (without the subject being fully aware) throughout the rest of the experiment, maintaining one of the coherent interpretations of the artificial grammar. The private schema of each subject, activated in the USL runs, might either accord with or contradict the "teacher signal word" that subjects are coerced to learn and recognize in the SL runs, although they might be unaware of it despite a gradual influence of the hidden oriented pattern control. If there is a subject-level match between the subject's inherent USL schemas and those provided during SL, the schema is "accreted" and reinforced throughout the entire experiment, allowing the subject to learn the new language comparatively effortlessly. Otherwise, they will be obliged to thoroughly "tune" and even "reconstruct" the individual schema to be more or less in line with (or even against) the latent but objective presence of the linguistic rules presented. In short, dynamic FC enables the detection of the brain networks involved in maintaining or updating schemas between USL and SL of a language that are contingent upon characteristics specific to the individual.

| Participants
The study participants were 22 Japanese university students (13 male, 9 female; aged 20-27 years and all right-handed) who attended the Kenshinkai Tokyo Medical Clinic, Japan. Ethical approval was obtained from the Human Investigation Committee of Tokyo Institute of Technology. All volunteers signed a written informed consent form. Prior to the fMRI session, a General Self-Efficacy Scale questionnaire was administered to all study participants to glean information on their characteristics. All subjects were presented with USL demonstration material, containing sound stream and disyllabic combinations that were different to those used later on in the actual run, before entering the scan room. This experience was provided so that the study participants could understand the task procedure easily but without giving any hint of the structure of the test material, except the knowledge that the word candidates were all disyllabic and embedded several times within the exposure (learning) phase.
At the end of the session, the subjects were asked to complete a post hoc survey to determine how they evaluated their performance in the experiment. The survey included items that queried the quality of the participants' performances in the first USL, the SL, and the second USL. We also asked participants to report using the 4-point Likert scale the extent to which they were convinced that they faithfully identified the "true" words which appeared in the USL sessions.
Data for two male subjects were excluded from the analysis because of excessive head motion (n = 1) and falling asleep during the scan (n = 1).

| Task design
This study replicated the experiment on implicit statistical learning ability (Mirman et al., 2008;Perruchet & Desaulty, 2008;Xie, 2012;Yim & Windsor, 2010) performed by Onnis and Thiessen (2013) and included the same experimental task for the USL runs with the addition of new factors: namely, prosodic variables and explicit SL runs. The original unsupervised artificial grammar learning experiment aimed to detect the sensitivity of the subjects to patterns of forward-backward transitional probabilities as a function of language background and environment through the task of finding word boundaries in a continuous speech stream designed on an AG (Onnis et al., 2015). According to Onnis et al., the forward transition probability can be defined for any given sequence of elements XY as the probability of Y given X, calculated by normalizing the co-occurrence frequency of X and Y by the frequency of X. In contrast, the backward conditional probability for the sequence of XY is the probability of X given Y, so is calculated by normalizing the co-occurrence frequency of X and Y by the frequency of Y. As a consequence, the combination of low frequency followed by high frequency syllables is defined as S-HiLo, and vice versa. We slightly revised the terminology of Onnis et al., such that the "forward low-backward high" probability was abbreviated as "syntactically HiLo" or "S-HiLo" and "forward high-backward low" probability as S-LoHi. The reason for this modification was that in the present fMRI experiment, another HiLo or LoHi contrast, "prosodic-HiLo" ("P-HiLo") or "prosodic-LoHi" ("P-LoHi"), was also implemented from the information of pitch patterns that varied the stress location in each word candidate unit.
Note that because we took advantage of a text-to-speech engine designed to replicate the specific phonation features of each language in our study, we had no rigid or consistent distinction between "pitch," "accent," and "stress," and these phonetic factors for differentiating P-Hi and P-Lo were configured to render some independent syllables more explicit as "Hi" than "Lo" in each language. Some syllables are more easily interpreted as P-Hi or Pi-Lo (they tend to fall at the extremes of the pitch spectrum and are therefore more easily classified) compared to other syllables due to their intrinsic phonetic characteristic.
We faithfully recreated the experimental design of Onnis et al. in the first and last USL runs, when subjects underwent a word segmentation activity for deciding ambiguous lexical borders within a text spoken in similar types of artificial languages. Given the equal frequency of Hi-Lo and Lo-Hi patterns in the unsupervised training material, there was no "correct" extraction criterion in the USL for splitting the units from the sound stream, revealing an individual preference of the underlying syllable frequency and/or stress distribution patterns. The auditory stimuli were transmitted to each participant via the MRI-safe Serene Sound system (Resonance Technology Inc., Northridge, CA, USA). In the learning phase, subjects were prompted to detect compact sound sequences in a continuous syllable stream that were noticed multiple times, thus seeming optimal to be recognized as "word" instances. In the testing phase that followed soon after exposure to the speech stream, 12 pairs of disyllabic word candidates were provided in succession, prompting the participant to select one or the other as a valid word that they extracted from the learning phase; that is, each set of 12 forced choices was performed after the subjects had compared two disyllabic items representing "S-HiLo" and "S-LoHi" patterns (or vice versa, with equal probability of appearance) in turn. Subjects held a two-button MRI-safe fiberoptic response pad (Current Design Inc. 932-fORP) in their left hand and were instructed to press either the yellow button with the index finger if they chose the first item or the blue button with the middle finger if they chose the second.
When pairs of disyllabic units were presented for comparison, the stress was always assigned to the first syllable, so there could have be some units where the stress position was different between their presentation during the learning and testing phases.
In the USL runs, with a voice of moderate modulation, words with "forward high-backward low" probability, that is, S-HiLo, all have prosodic HiLo patterns (P-Hilo). This means that their stress position in the learning phase remains the same as that in the testing phase. However, the "forward low-backward high" probability is phonologically implemented in the converse order between the learning and testing phases so might confound the judgment of learners who inherently prefer the S-LoHi and/or P-LoHi pattern (since all the word candidates proposed during testing were P-HiLo).
Taking into account such inconsistency at the pronunciation level, we use the brevity code of "-l" for listening and "-t" for testing to express the exact stress information, such as P-HiLo-l, P-LoHi-l, and P-HiLo-t (with no instance of P-LoHi-t). The three SL runs shared the same transitional probability patterns but subjects were requested to learn six preselected disyllabic words in advance whether or not they liked them. The correct "words" to be memorized were all of the S-HiLo pattern and, unlike the word candidates in USL, the language of which lacked rich intonation, their prosodic patterns in the speech streams were diverse, including prosodically neutral ones, so the prosodic information itself did not provide particularly decisive cues for detecting word boundaries. The task of subjects in the SL runs was to recognize the learnt words in the listening part and perform the recall test, the form of which was identical to that of the USL runs.
In summary, each fMRI session was composed of five runs in the order of one USL run, three SL runs, and a repetition of the first USL run. In the SL runs, different stimulus sets were provided in a fixed order across participants. Each run was subdivided into a learning (listening) phase and a testing phase. In the USL runs only, the learning phase was preceded by a resting-state scan for 40 s with a fixation cross mark at the center of the screen. This part of the USL was replaced with instruction of correct words in SL, in which six S-HiLo F I G U R E 1 Time courses of the USL and SL runs. Above: Time course of the USL runs voiced by Veena. Below: Time course of the SL runs with the voices of Amelie (featured in this figure), Anna, and Luciana, for each of the three runs, respectively. They are composed of a learning phase that differs by the opening stimulus (a 40-s resting state for USL and repeated presentation of correct wordsall "S-HiLo"-for SL), and a following testing phase. SL, supervised learning; USL, unsupervised learning words were sequentially presented six times. The forced choice task was carried out at the end of each run to evaluate the articulation preference in the USL and the memorization accuracy in the SL. The auditory and orthographic stimulus creation, playback, and behavioral response recordings were made using E-Prime 2.0-Standard (Psychology Software Tools, Inc., Sharpsburg, PA, USA). The time course of a trial for the USL and SL is depicted in Figure 1.

| Learning phase
Although these frequency patterns were fixed in this experiment, the particular set of stimuli involved were entirely different across runs except for the two USL sessions. The integer at the end of each pattern string is the number of a word candidate in an artificial-language sample set. Each sound stream in the learning phase was created by the "say" command of MAC OS using different synthesized female voices ("Veena" as a voice for Indian English in USL; "Amelie," "Anna," and "Lucianna," respectively representing Canadian French, German, and Brazilian Portuguese in SL) with the speech rate set to 200. It consisted of the following continuous sequence based on the rules of a stochastic Markovian grammar chain conceived by Onnis et al.: X, a1, Y, b1, X, a2, Y, b3, X, a3, Y, b1, X, a2, Y, b3, X, a3, Y, b2, X, a1, Y, b1, X, a2, Y, b2, X, a1, Y, b1, X, a2, Y, b2, X, a1, Y, b3, X, a3, Y, b2, X, a2, Y, b3, X, a1… The probabilities of appearance of X and Y (highly frequent) were both 25% through one learning sound stream and the remaining half of the syllables were almost equally assigned to a1, a2, a3, b1, b2, and b3 (infrequent). This sequence can be segmented into either of the following syllable patterns. Note that 'Hi'/'Lo' does not mean a high/low frequency of appearance; on the contrary, 'S-HiLo' as a whole represents the "forward high-backward low" probability Note: A bold font is used for listing the materials in the testing phases and to indicate the locations of prosodic stress within each of the disyllabic items within the speech stream of the learning phases (during testing, the stress was consistently given to the first syllable).
whereas 'S-LoHi' represents the "forward low-backward high" probability in parsing. The stimulus materials are summarized in Table 1. The concrete stimulus materials were as follows.

S-
The first and last USL run: The gothic fonts used in the list of word candidates in Table 1 imply the position of stress in the continuous speech stream, so that the word corresponding to S-HiLo.1 'b1 X' in the SL run 2, that is "ropi," for example, has a prosodic emphasis at the second syllable "pi" during the listening phase as an instance of "P-LoHi-l." Thus, this word was heard by the subjects with different pronunciations between the learning and testing phases because any disyllabic word when cut up in the testing had an initial stress as "P-HiLo-t." The words without gothic fonts were more or less phonetically monotonous in the learning.
Noteworthy is that the Indian English voice "Veena" used for the USL sessions solely presented a consistent prosodic stressing rule, enabling us to adjust the syntactic and prosodic patterns such that all the S-HiLo words were P-HiLo and all the S-LoHi words were P-LoHi in our learning phase dataset.
The number of syllables was always 524 across the training sets but those contained within approximately 8 s at the beginning and ending of each sequence were more or less difficult to hear under the fade-in and fade-out effect used to eliminate hints (temporal bias) for detecting word boundaries. Although the speech rate was always set to 200, the total time lengths were different because of the specifics of the voices generated by the "say" program (minimum 210 s, maximum 300 s). Therefore, the numbers of scan volumes differed across runs. Our artificial grammar samples shared no surface features and were meticulously constructed to eliminate cues for phonetic or conceptual association due to a resemblance to real Japanese words.

| Testing phase
There were 12 testing trials posterior to the learning phase of each run. We induced in this phase a set of forced choices to be performed after letting the subjects compare two disyllable items representing "S-HiLo" and "S-LoHi" patterns (or vice versa, with equal probability of either order being presented) in turn. Each subject held an MRIsafe response pad in the left hand and was requested to press either the yellow button by the index finger if choosing the first item or the blue button by the middle finger if choosing the second one. The time course for each trial was as follows. The subject was asked to perform the forced choice after being presented with the second disyllabic word (which had a 50% chance being S-HiLo or S-LoHi), and the subject was not allowed to make choices before presentation of both words was completed (Table 2).
All the testing sets across the five runs consisted of the following 12 ordered combinations but using completely different voice materials. The integer at the end of each pattern string represents each word candidate number in an artificial language sample set. Table 3 represents the order of the samples provided in each testing trial.

| Methodology
The initial data processing was performed through the default pre- We mainly used the ROI-to-ROI connectivity analysis to evaluate the magnitude of t-contrasts between learning conditions. Given that the aim of our research is to explore the neural basis of broad general concepts ("meta-concepts") that can be recapitulated by the abstract term "schema," it is neither worthwhile to pre-determine a few seeds to serve as important nodes nor to extract significant voxels within each of the target regions scrutinizing tables of functional anatomy. Hence, we analyzed the FC graphs of nodes in the whole region and performed an exhaustive assessment of the overall edges using t tests corrected for multiple comparisons. The significant ROIto-ROI connectivity was thus taken as the neural "representation" of each learning method with its corresponding schema; this representational difference between USL and SL is viewed as the modulation of schema brought about by the switch of learning methodologies.

| RE SULTS
Upon completion of the session, no participant stated that he or she was aware of the equal frequency distributions of disyllabic TA B L E 3 Twelve patterns of forced choice task administered during the testing phase of each run and their order of presentation Test 1: S-HiLo5, S-LoHi5 Test 2: S-HiLo1, S-LoHi1 Test 3: S-LoHi2, S-HiLo3 Test 4: S-HiLo2, S-LoHi2 Test 5: S-HiLo4, S-LoHi4 Test 6: S-LoHi3, S-HiLo4 Test 7: S-HiLo6, S-LoHi6 Test 8: S-LoHi5, S-HiLo6 Test 9: S-LoHi1, S-HiLo2 Test 10: S-LoHi6, S-HiLo1 Test 11: S-LoHi4, S-HiLo5 Test 12: S-HiLo3, S-LoHi3 TA B L E 4 Results of ROI-to-ROI connectivity analysis for contrast between first unsupervised learning, the supervised learning, and second unsupervised learning combinations, that is, the objectively equal probabilities of existence of all the candidate words. Interestingly, subjects who rated their performance as poor in the first USL run tended to report being satisfied by their competence in the second (last) USL run (r = .739, p = .0002 for the correlation between the binary answers to the two questions posed about the two USL runs in reverse for the coding of yes: 1, no: 0). Nevertheless, there was no correlation between these self-assessments of each separate USL performance and test scores in the SL run (r = .1 for the first USL and r = .204 for the last USL).
Hence the accuracy of memory testing in the SL could be treated as a factor independent of self-assessment on the preceding and following USL performances. However, there was a significant correlation between the SL test accuracy rate and the four-scale conviction degree ratings of the subjects when asked whether or not they could finally ascertain the "assumed intended" words as a whole (r = .543, p < .05).

| Contrast analysis between the first USL, SL, and second USL runs
By forcing the subjects to alter their learning strategies (USL versus SL), which started with USL and then moved on to three SL trials and then looped back to USL for a second time, we could set three contrasts between any two of the three merged sessions (first USL with one run, SL with three runs, and the final USL that was identical to the first one) for evaluation of the difference in strength of association across the exhaustively enumerated edges, that is, ROI pairs, via a t test with a corrected false discovery rate of p < .01: A) first USL versus last USL; B) first USL versus SL, and C) last USL versus SL. The results of the ROI-to-ROI analysis for these contrasts are as follows and listed in Table 4.
A The between-region links that were significantly stronger for the first USL than for the SL were between i) the right inferior fron-

| High-score group versus low-score group
Next, we highlight the key findings related to how the subjects answered correctly or not before finishing each SL session. Considering the SL accuracy distribution (mean 0.72, standard deviation 0.21), we subdivided the cohort into a high-score group of 10 subjects, certified as skilled learners, who reported with more than 80% precision, and a low-score group of 10 subjects with less than 80% accuracy.
Reviewing the 4-point Likert scale scores for self-assessment of task performance in the scanner, there was a significant correlation between the accuracy scores for the post-SL tests and the extent to which the subjects thought that they could understand the intended words as a whole, including on the USL runs (r = .543, p < .05). The apparent assumption here is that there was relationship to be explored in each subject between the discovery of knowledge during the USL run and word acquisition in the SL, despite the language of each run being independent and unlike material-wise, except at the level of syntactic or phonetic grammar. However, it is noteworthy that, at the end of each USL, half of the high-score group showed a tendency to choose lexical candidates with the S-HiLo pattern, specifically used to provide the words to be learnt in the SL runs, whereas the low-score group showed no such tendency. The precision rate in the post-SL tests was significantly correlated with the number of times they chose the s-HiLo pattern syllables in the first post-USL (r = .498, p = .026) and the total two post-USL sessions (r = .504, p = .024), although this relationship did not reach significance in the second post-USL session (r = .428, p = .060). Table 5 enumerates the significant links (p <.01 FDR, two-sided) in the ROIto-ROI analysis and contrasts the performances of the high-score group and the low-score group from the connectivity data that tar-

| Schema-matching group versus schemamismatching group
As previously mentioned, the preference in syntactic or prosodic patterns, which was revealed through the USL runs, is crucial for lexical acquisition in SL, where these patterns serve as hidden cues for memory retrieval. The disyllable segments with the S-HiLo pattern shared the same accent position between learning and testing (preference disclosure) in the USL runs, but for the S-LoHi words, the prosodic pattern was inverted to P-LoHi for listening and P-HiLo for testing. Therefore, the subjects were subdivided into a schemamatching group and schema-mismatching group according to the outcome of preference. The schema-matching group included 6 subjects who elicited S-HiLo propensity at least once during the two USL runs, while the schema-mismatching group contained the remaining 14 subjects, who had never shown such behavior despite the prosodic incongruity in the opposite S-LoHi words. All the members of the schema-matching group belonged to the high-score group except for one subject who had an accuracy score of 77.8%. The mean difference in SL scores between the schema-matching group (mean accuracy 0.88, standard deviation 0.057) and schema-mismatching group (mean accuracy 0.65, standard deviation 0.218) was significant according to the t test (t = 3.615, p = .002). Note that there was a significant correlation across all subjects between the accuracy rate recorded in the SL runs and the preference for S-HiLo over S-LoHi during the two USL runs (r = .503, p < .05). Concerning the relationship between the SL test scores and USL schema matching, the TA B L E 5 Results of ROI-to-ROI connectivity analysis for contrast measured in the second unsupervised learning runs between the highscore and the low-score groups performances of all the subjects are shown in Figure 3, where each dot represents a single subject. We crossed the two binary criteria (High-score group versus low-score group and Schema-matching group versus schema-mismatching group) to classify the subjects into four subgroups. The color and shape of the dot indicate the group to which each individual belonged. This scatter plot reveals a significant correlation between the test accuracy rates in the SL sessions (x-axis) and preference rates of S-HiLo (schema matching) over S-LoHi (schema mismatching) in the USL sessions (y-axis). Table 6 lists the significant links (p <.01 FDR, two-sided) in the ROI-to-ROI analysis for the first USL in order to compare the network responses of the schema-matching group and the schema-

F I G U R E 2
Results of connectivity analysis of functional MRI data recorded in the listening phase of the final (second) USL run between the high-and low-score groups (high >low): difference in performance for tests at the end of each SL run. Graph of ROI-to-ROI effects (p <.01 FDR, two-sided)

F I G U R E 3
Performances of all the subjects in the USL and SL sessions. This scatter plot elicited a significant correlation between the test accuracy rates from the SL sessions (x-axis) and the preference rates for S-HiLo (schema matching) over S-LoHi (schema mismatching) from the USL sessions (y-axis). The linear regression model was: y = 0.456 x + 0.046 with a p value of 0.024 for the weight to independent variable. The Pearson's correlation coefficient between the variables (r) was 0.504 (p <.05). Each dot represents an individual subject classified by color and shape according to two types of grouping. The boundary between the high-score group (abbreviated as "High") and the low-score group ("Low") was set to x = 0.8 for evaluating the accuracy in the SL testing. The criterion for distinguishing the schemamatching group (abbreviated as "Match") and the schema mismatching group ("Mismatch") is that the subjects belong to the former preferred and chose, at least once during the two USL runs, S-HiLo patterns more than six times out of twelve comparison tests showing the S-HiLo propensity, whereas those within the latter group were the remaining subjects

| Connectivity changes across the switch in learning methodology
We examined three types of t-contrasts between the ROI-to-ROI FC values brought from the listening portions of the first USL, SL composed of three runs, and the final USL that was identical to the first one. The FC toolbox (CONN) enables statistical evaluation of all combinations across 132 areas and 32 representative nodes with MNI coordinates of the eight intrinsic networks. ROI-to-ROI contrast analysis, although on a rough scale for a brain map, would be the most effective exploratory approach to the neural basis of metaconcepts, such as USL or SL frames, which are either too general and abstract or too large-scaled to narrow down and underpin any voxel-wise signal interpretation that would be supported in the existing literature. The ROI-to-ROI connectivity analysis was a provisional way of evaluating the task-state FC that was expressed simply as an amalgam of genuinely task-evoked active and hodological interactive effects; the correlation computed in the default manner was considerably inflated by that mixture so it does not represent the exact amount of interregional functional synchronization (Cole et al., 2018;Janata et al., 2002). However, we assessed the firstorder effects of task-evoked activations as well as FC effects and used this confounding coefficient as the neural "representation" of a schema for methodologically oriented language learning.
The first contrast, set between the first and second USL runs, represents exclusively the order effect that the medial SL runs engendered as an interference modulator that acted on the identical USL content (Table 4). The traditional language network between the left inferior frontal gyrus, pars opercularis, and the left posterior middle temporal gyrus was detected as a crucial edge where the association was significantly stronger in the first USL than in the second one. This result is convincing if we consider the posterior effects of SL that could interfere with spontaneous linguistic performance in some subjects, that is, efforts to find correct answers independent of their original propensity for sense in language. The other implications of this contrast would be related to the efficiency of the short-term working memory in the first USL run, given that fresh exposure to a sound stream might activate the liveliest response in TA B L E 6 Results of ROI-to-ROI Connectivity Analysis for Contrasts measured in the First USL Run between the Schema Matching and Mismatching Groups temporo-parietal areas during auditory segmentation tasks is a consequence of auditory short-term memory. In this research, selective involvement of the left inferior frontal gyrus can be ascribed to the role of this area in sound segmentation and maintenance of auditory working memory (Burton et al., 2000;Hsieh et al., 2001;Pedersen et al., 2000;Poldrack et al., 2001). Regarding the significant association between the left inferior frontal gyrus and the left amygdala, the argument that is noteworthy is the theory of mediation between affect and cognition, especially concerning the formation of memories in acquisition of a second language (DeLuca et al., 2019). In this respect, it has been acknowledged that the amygdala is identified as a hub of the corticofugal path of memory formation that is responsible for regulating and appraising the objects to be learned and for maintaining sufficient motivation in learning (Schumann, 1990(Schumann, , 2001. The USL-related temporal contrast was thus emphasized by the quality difference in performing the task, and it is possible that the SL altered the USL with the same task contents and conditions.

Seed/Sources
Interestingly, there is no information on this contrast pertaining to the DMN or its reorganization process, at least at the level of the subjects overall. The DMN was featured through the second contrast, which was between the first USL and the following SL runs.
Compellingly, as indicated in Table 4, the DMN, and in particular its posterior subsystem, subserves the USL in the genuine condition and without confounds from the traces of memory training of SL.
Compared with SL, the within-connection during the first USL was extremely strong because it encompassed the bilateral inferior parietal lobules and the medial nodes of the posterior cingulate cortex and precuneus. Presumably, some inward thoughts and a flash of intuition for choosing parcels of sounds as meaningful to the subject activated the DMN during spontaneous activity in USL. It is likely that the DMN plays an important role in USL as long as the effects of SL are filtered to leave the DMN as the marker of pure USL. In contrast, the neural circuit of SL was identified, by force of contrast, in the edges between the posterior part of the left superior temporal gyrus (Wernicke's area) and the juxtapositional lobules bilaterally. According to recent studies, the former corresponds to the auditory modality aspect of language processing while the latter is also specified for relevant functions, especially inner speech mechanisms during language encoding, lexical disambiguation, syntax, and prosody integration, (Hertrich et al., 2016;Saur et al., 2010). We will underscore the importance of this connection later in the betweensubject analysis as a clue to the success of updating "schema" or "tuning" in the context of Rumelhart and Norman's theory of learning. In this section, we provide the outline for neural underpinning of the SL task in contrast with the functions of the DMN as a symbol of genuine USL.
The third contrast assigned between the second USL and the preceding SL runs was characterized by a set of multimodal between-ROI edges featured in the former but not in the latter.
The connectivity between Heschl's gyri (auditory cortex) and the lingual gyri (visual cortex) that increased in the second USL run when compared with the SL runs might reflect the auditory occipital association (AOA). The AOA, as a synchronization between the different perceptive modalities, represents, according to Cate et al., the persistent engagement of auditory attention; we can recognize this phenomenon when boosted in more difficult listening conditions (Cate et al., 2009) that need additional support from the visual cortex. Unexpectedly, this intriguing phenomenon emerged in the second USL run, which should have been more relaxed in the sense that only the subject's spontaneous reactions were required. A possible explanation for this finding is that the second USL run was affected by the executive control needed in the preceding SL, such that the subjects were no longer relaxed and suspected hidden correct answers. The neural mechanism underlying AOA remains unclear, in particular very little is known about the occipital regions (particularly the lingual gyrus) engaged in AOA and their specific functions with respect to auditory processing (Burton et al., 2005;Hasegawa et al., 2002;Janata et al., 2002;Zimmer & Macaluso, 2005). The remaining hubs in the third contrast other than those that were modalityspecific were the bilateral superior parietal lobules, which were connected to the right inferior frontal gyrus (the homologous portion of Broca's area) or the left anterior insula, representing the SN ( Figure 4). The left superior parietal lobule is known to be a central region in the ECN involved in working memory-related task performance (Koenigs et al., 2009;Liang et al., 2016). Interestingly, increased connectivity between the left superior parietal lobule

F I G U R E 4
Results of connectivity analysis of functional MRI data recorded in the listening phase of the first USL run between the schema-matching group and the schema-mismatching group (match >mismatch): difference in choice of S-HiLo pattern words for tests at the end of the two USL runs. Graph of ROI-to-ROI effects (p <.01 FDR, two-sided) and the right inferior frontal gyrus has been cited previously as evidence of brain plasticity promoted after 4 months of learning a second language (Bubbico et al., 2019). Furthermore, accumulating evidence indicates that the right inferior frontal gyrus plays a role in inhibiting control of word choice in language (Abutalebi & Green, 2016;Aron et al., 2004). We can interpret the SL-ECN link that connects the left anterior insula with the right superior parietal lobule in light of the resource allocation theory of Lerman et al. (2014), who posited that the SN simultaneously influences the DMN and ECN and balances the allocation of resources for bidirectional control of cognitive activity. This balance has been identified to be disrupted in individuals with Internet gaming disorder (Zhang et al., 2017). Note that the dominant DMN of the first USL minus the SL runs again involves a portion of the SN that presumably influences the DMN and reduces the activity of the ECN, which connects the anterior cingulate gyrus and the parietal operculum.

| Skilled learners coactivate bihemispheric auditory language resources with the DMN
The analyses to this point were performed by including all the subjects regardless of their individual variability in performance (in SL) and preference (in USL). From this point onward, we treat the between-subject contrasts based on these two viewpoints. The difference in precision of SL performance (between the high-and low-score groups) became manifest in FC only after finishing all the tests in the SL session, that is, during the final USL run when the responses were made spontaneously ( Table 5). The interregional interactions were expanded to the bilateral areas, which were homologous in structure and to some extent in function for natural linguistic processing (left superior temporal gyrus) and complex sound or spectrotemporal pattern analysis (right planum temporale) (Griffiths & Warren, 2002;Obleser et al., 2008).
Although pitch processing and assimilation of experience with tones (Xu et al., 2006) are often ascribed to the left planum temporale, the contralateral hemisphere is considered to play a role by connecting filtered acoustic information to areas of higher-order cognitive function (Griffiths & Warren, 2002;Price, 2006). We noticed that during FC in the high-score group, the right hemispheric nodes were located at the corresponding part of the left temporal language network, so were putatively concerned with speech recognition in parallel with Wernicke's area. Furthermore, the left posterior temporal lobe as a hub in this skilled learner characteristic connection was linked to the posterior subsystem of the DMN (precuneus and posterior cingulate cortex). This suggests that the group of higher scorers did not have to mobilize task-positive resources to maintain a relaxed state with a moderate cognitive burden (Pyka et al., 2009) and spontaneous reactions in the final USL run. As already mentioned, many DMN associations emerged in the first (but not the final) USL when contrasted with the SL conditions. This long-lasting participation of the DMN might allow us to isolate its significance in the success of USL and SL hybrid learning.

| Difference in quality of learning between schema-matching and schema-mismatching groups
In this section, we analyze the reason why and how the schemamatching and schema-mismatching groups diverged in the course of switching learning methodology. The distinction between these groups involves intricate phases that originate from the complexity of the artificial grammar languages used in this experiment. The schema-matching group, which more or less showed the S-HiLobiased decision, was unarguably assisted by favorable prosodic features that were consistent across listening and evaluation in USL.
The schema-mismatching group followed the S-LoHi order, which was unfavorable for consolidating word identity; for this syntactic pattern, pronunciation differed between speech and recognition of words. Interestingly, no subject in the schema-matching group made any comment on the post-hoc questionnaire about syntactic information, although they might have found that condition favorable. One explanatory hypothesis is that they were inclined toward a type of pitch learning where congruity of the stress position in the USL material could provide a key to word segmentation. However, viewed in light of the responses to SL testing, they did not persist in using prosodic cues exclusively but succeeded in "tuning" their preference schema, which could reflect not only aspects of prosodic learning but also sequential learning of lexemes independently of auditory tone. In contrast, the schema-mismatching group that never showed a preference for S-HiLo might have had incongruity in the stress positions of the S-LoHi words between the two steps in each SL run. If they adhered nevertheless to the S-LoHi pattern, discarding the auditory fluctuation between the P-LoHi in speech and the P-HiLo in the test, their linguistic schema may have remained "accreted" to the syntactic formation built on the arrangement of frequent and nonfrequent syllables.
Interestingly, the difference in performance quality between the schema-matching and the schema-mismatching groups is in line with the results of a comparison of implicit (not rule-instructed) and explicit (rule application) learning processes when performing a probabilistic continuous fine motor task (Green & Flowers, 2003 Figures 2 and 3).
However, the DMN did not represent the disparity in preference during the initial USL run (it was not included in the ROI-to-ROI contrast map of connectivity between the groups; Table 6; Figure 4), although at this stage the schema-matching group could already proceed smoothly with learning by receiving favors for existence of disyllabic words, which were preferred in terms of both syntactic and prosodic HiLo patterns. In contrast, subjects in the schemamismatching group might have suspected that there was a dilemma regarding the choice because the syntactic LoHi pattern they preferred was implemented in disyllabic units with the prosodic LoHi pattern that they did not always favor. Moreover, the stress position in the syntactic LoHi pattern words was inverted in the testing phase of the USL run, which could be perplexing for those who consider AG rules owing to the lack of prosodic consistency. In the first spontaneous learning session, such complexity in the preference retrieval process was not reflected in the activity of the DMN; rather, it was the neural basis of the "self-supervised learning" aspect (Kröger & Bekolay, 2019) that could capture the trace of the schema matching through the input from exposure to the speech stream. It may be that this "self-supervised learning" has something to do with "superlearning" (Caligiore et al., 2019) as an integration of SL, USL, and reinforcement learning. "Superlearning" implies a set of learning mechanisms that act in synergy to trigger widespread neuro-transformation at various modality and abstraction levels that are transmitted through multiple cortical/subcortical pathways and loops, including the cerebellum.
Self-supervised learning-recruited networks that started at the peripheral nodes in the cerebellum bilaterally encompassed crucial regions for cognitive management in linguistic processing, for example, the right insula representing the SN, the right homologous portion of Broca's area that is involved in linguistic executive control, and some left lateral regions included in the sensorimotor network. This finding is consistent with the suggestion of Caligiore et al., (2019) that USL mechanisms could boost cerebellar performance during SL and for the schema-matching group already in the opening stages of USL. Moreover, the argument regarding cerebellar learning mechanisms can be validated by referring to the many papers on the cognitive functions of the cerebellum as well as labeling of the Connectivity toolbox for the cerebellar subsystems.
For example, there has been mounting interest in the involvement of Crus I and I in the cerebellum in lexical acquisition and word retrieval based on their roles in executive control, salience detection, episodic memory, and self-reflection (Habas et al., 2009). The functional topography of the cerebellar cortex consists of a set of complicated enclaves that do not conform to the geographic features of the subregions but instead reflect the motor and cognitive functions to be projected to the FC in the forebrain (Guell et al., 2018;Sokolov et al., 2017). According to Buckner et al. (2011), approximately half of the cerebellum, especially Crus I and II (language areas in the cerebellum), is functionally connected to the cortical regions related to the DMN (Dobromyslin et al., 2012). For the neural response of schema-matching subjects, recruitment of the cognitive (and linguistic) cerebellum will be assessed by virtue of how the DMN represents success in USL (Figure 4).

| Limitations and extensions
This study has some limitations, some of which may merit future inquiry. We attempted to analyze the characteristics of the subjects in light of three stages of changes in the schema that Rumelhart and Norman theorized could explain the evolution of learning, namely, "accretion," "tuning," and "reconstruction." In our study, possibly because of the small cohort size for scans, we could not record any dynamic sample of the final stage ("reconstruction"), which should have created a new schema for fundamental renovation of the learning strategies. There were no cases of schema-mismatching whereby a subject with an unfavorable proto-schema during the SL runs went on to have results accurate enough to warrant reclassification in the schema-matching group.
In addition to the paucity of subjects, it was difficult to include scenario-making or contextual simulation for reconstruction of schemas in the experimental design because we had to allow for timing of critical triggers for shifting of the schema while controlling for any spurious order effects. The pitch control of the speech stream was unavoidably imperfect for the three artificial speeches (sound streams) in the SL runs because the Macintosh specifications of the "say" command aim at synthesizing voices that have the most natural pitch pattern for the given natural languages.
This factor gave rise to a slightly complicated material structure that could confuse subjects and worsen their performance. These trivial aspects of experimental stimulation and the research design of alternating the USL and SL for detection/retrieval of similar knowledge targets would not guarantee generalizability of the current approach to mixed learning methodologies or have neural correlates that could be mapped to "schema" dynamics. Moreover, in terms of more in-depth analysis of FC, there are still aspects of our experimental design that are yet to be uncovered. Given that the whole session was composed of a variety of stimuli and tasks that took considerable time for repeated learning and testing, we could not set any independent resting state for fMRI that was sufficiently long, stable, and immune to fluctuation for more than several minutes. Such scan data should have enabled a total independent component analysis (ICA) to determine, in a bottom-up manner, the intrinsic functional networks that would be specific to the experimental groups. This is the reason why we opted to use the network templates prepared in CONN in a top-down manner to perform the connectivity analysis.

| CON CLUS ION
Despite these limitations, our study demonstrated the dynamic topography of enhanced or weakened connectivity and has helped to elucidate the neural basis of schema modulation in learning development. Notably, the inquiry of functional brain mapping for abstract and comprehensive (meta-) concepts like USL, SL, and hybrid superlearning should be started from re-acknowledgement of global activation patterns across brain regions with fluctuation between canonical intrinsic networks serving as FC templates. In that sense, the contribution of the present study to neuro-pedagogy is can expand beyond the present research paradigm but should be in keeping with the source ROIs found to underlie the dynamic process of language acquisition.

CO N FLI C T S O F I NTE R E S T
The authors declare no conflicts of interest.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1002/brb3.2157.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data available on request from the authors.