Is critical thinking happening? Testing content analysis schemes applied to MOOC discussion forums

Learners’ progress within computer‐supported collaborative learning environments is typically measured via analysis and interpretation of quantitative web interaction measures. However, the usefulness of these “proxies for learning” is questioned as they do not necessarily reflect critical thinking—an essential component of collaborative learning. Research indicates that pedagogical content analysis schemes have value in measuring critical discourse in small scale, formal, online learning environments, but research using these methods on high volume, informal, Massive Open Online Course (MOOC) forums is less common. The challenge in this setting is to develop valid and reliable indicators that operate successfully at scale. In this study, we test two established coding schemes used for the pedagogical content analysis of online discussions in a large‐scale review of MOOC comment data. Pedagogical Scores are derived from manual ratings applied to comments by raters and correlated with automatically derived linguistic and interaction indicators. Results show that the content analysis methods are reliable, and are very strongly correlated with each other, suggesting that their specific format is not significant in this setting. In addition, the methods are strongly associated with the relevant linguistic indicators of higher levels of learning and have weaker correlations with other linguistic and interaction metrics. This suggests promise for further research using Machine Learning techniques, with the goal of providing realistic feedback to instructors, learners, and learning designers.

Dialogue helps learners' build personal social capital and gain exposure to new ideas [41], and language used in these environments has been shown to indicate the depth of critical thinking [1]. However, educational research has predominately focused on discussion within formal learning environments rather than the informal setting provided by Massive Open Online Courses (MOOCs) [88] and has been dominated by assessments of the quantity rather than the quality of interaction.
Discussion forums within online learning environments have been identified as rich seams of instructor and learner interaction data that can be mined to monitor levels of participation, but they can also reveal significant aspects regarding the quality of interaction through the adoption of content analysis techniques [11,33,50]. Weber defines these techniques as research methods that build on "procedures to make valid inference[s] from text" [84, p. 1], and in this study, we are proposing to interpret and categorise critical thinking within MOOC comment forums (e.g., Reference [25]).
Although a precise definition of critical thinking is unresolved, it is recognised as a key objective when encouraging learners to adopt in-depth, rather than surface, learning approaches [55]. In the context of this study, we agree with Lipman who argues that critical thinking is best acquired within the social context of a community of inquiry [49], as well as Biggs [5] association of deep learning with "affective involvement" through interaction.
Dialogue between learners stimulates cognitive conflict which encourages reflection, assimilation of new knowledge, and continued interaction. In this context "critical thinking" is perhaps best defined as "reasonable and reflective thinking that is focused upon deciding what to do or believe" [56, p. 1].
Although the high volume of MOOC data provides an unprecedented opportunity for insight into how CSCL is used in practice, the reliability of analysis methods used to explore this data is questioned. Specifically, some argue that the methods lack coherence and validity [15], and others identify a research-inhibiting lack of consistency in their application [85].
A high volume of data combined with uncertainty regarding analysis methods emphasises the importance of constructing theoretically sound methods that can reliably, and automatically, analyse this data. To address these issues, this study employs appropriate pedagogical content analysis methods, using instruments that have previously been adopted in studies exploring the depth of critical thinking evidenced in CSCL, and seeks to explore the potential of established methods for identifying critical thinking in MOOC forums.
In particular, our study aims to identify features that would distinguish behaviour suggestive of levels of critical thinking, and sets out to answer three questions: • RQ1: Are coding schemes used for pedagogical content analysis of online discussions reliable in the context of MOOC discussion forums? In particular, can different people consistently apply them, and do different frameworks identify the same levels of critical thinking? • RQ2: Are the linguistic characteristics of comments significant indicators of levels of critical thinking when applied to MOOC discussion forum comments, as identified through pedagogical content analysis? • RQ3: To what extent do more typical measures of attention to learning (such as social interactions) indicate levels of critical thinking when applied to MOOC discussion forum comments, as identified through pedagogical content analysis?
Our work lays the foundations for further research into the analysis and visualisation of Web-based learning, with the potential to improve learner reflection, MOOC development tools, and the discoverability of high-quality learning materials.

| BACKGROUND
As the ultimate aim of this study is to develop automated methods of assessing comments that can be readily comprehended by users (i.e., educators), it is important to ground this method on established pedagogic theory. In Weltzer-Ward's [85] analysis of 56 content analysis-coding schemes used between 2002 and 2010, Bloom's Taxonomy [7], and analyses adopting Community of Inquiry: Cognitive presence (CoI) [25] were recognised as established methods with high citation counts, accounting for a high number of reviewed papers. They are, therefore, a good choice for the content analysis in our study and inform the rubrics we developed which follow in the tradition of similar work that adopts these methods. In this section, we will review these manual coding schemes and give an overview of existing work in the linguistic and interaction analysis.

| Bloom's Taxonomy of the cognitive domain
The Taxonomy of Educational Objectives: Handbook 1: Cognitive Domain, commonly referred to as "Bloom's Taxonomy" was developed to improve the "exchange of ideas and materials among test workers, as well as other persons concerned with educational research" [6,7, p. 1], and promote the use of teaching methods that encourage higherorder learning. Though directed by a small committee, the Taxonomy resulted from a collaborative effort that took input O'RIORDAN ET AL. | 691 and feedback from a wide range of educators, educational psychologists, administrators and researchers.
Bloom's Taxonomy consists of a hierarchy of categories of educational goals or outcomes, starting from the lower-order learning goals of "remember" and "understand", to the mid-level uses of knowledge as evidenced in "apply" and "analyse", with "evaluate" and "create" indicating the achievement of deeper understanding. Further Taxonomies classifying the Affective and Psycho-Motor domains were published but did not reach the high level of recognition and use achieved by Handbook 1.
Researchers have found the categories useful as a framework for analysing learning processes. Bloom himself used them to evaluate types of learning that take place in class discussions and compared them with lectures [6]. His key finding that learners spend more time engaged in higher-order thinking in class discussion than in lectures, led him to suggest that increasing opportunities for learner interaction would lead to improved development of problem-solving skills.
Kember's [37] association of Bloom's dimensions with Mezirow's [51] "thoughtful action" category (e.g. writing), Gibson, Kitto, and Willis' [28] use of Bloom to map word types to levels of cognition, and Karaksha et al.'s [35] use of Bloom to evaluate the impact of e-learning tools in a higher education setting, support the use of the Taxonomy in this study.
Furthermore, in Chan et al.'s [11] study two raters were employed to analyse essay papers and classroom discussions, applying Bloom, Structure of the Observed Learning Outcomes (SOLO), and the reflective thinking measurement model. Finding strong correlations between the models in long essays, but not in short discussions, they proposed further research using more than two raters to improve agreement and using the new version of Bloom to improve the accuracy of assessing cognitive learning outcomes. By engaging a team of seven raters and Krathwohl's updated version of Bloom [42] in this study, we aim to advance understanding of these research issues. Table 1 presents the categories used by human raters in this study and includes descriptions and verb types associated with each category.

| Community of Inquiry
Community of Inquiry is based on the interaction of the forms of engagement or "presence" within Web-based learning communities: Cognitive presence, social presence, and teaching presence [26]. As our study looks for evidence of critical thinking in MOOC forums, our focus is on the cognitive presence dimension, which Garrison, Anderson, and Archer [25] define as "critical, practical inquiry" as evidenced within four types of text-based dialogue: Triggering, exploration, integration, and resolution [57, p. 14]. These categories refer to stages of dialogue-starting with an initiating "triggering" comment and ending with assertions that conclude the discussion ( Table 2).
As an established pedagogic content analysis method, CoI has been applied within many studies. In her paper exploring the application of learning analytics, Dringus asserts that CoI provides "an array of meaningful and measurable qualities of productive learning and communication in online learning environments", and suggests converting CoI dimensions into datatypes that can T A B L E 1 Bloom's Taxonomy [7,11,42] Score Descriptor 0-Off-topic There is written content, but it is not relevant to the subject under discussion.

1-Remember
Recall of specific learned content, including facts, methods, and theories. Verbs: Name, describe, relate, find, list, write, tell.

2-Understand
Perception of meaning and being able to make use of knowledge, without understanding full implications.

4-Analyse
Deconstruct learned content into its constituent elements to clarify concepts and relationships between ideas.

5-Evaluate
Assess the significance of material and value in specific settings. Verbs: Check, decide, rate, choose, recommend, justify, assess, prioritise, critique.

6-Create
Judge the usefulness of different parts of content and producing a new arrangement.
Verbs: Synthesise, invent, plan, compose, construct, design, imagine, generate. be mined to "draw out coherent patterns" [19, p. 96] in online courses. Tirado, Hernando, and Aguaded [78] apply CoI in their study on the quality of knowledge construction in social environments and call for the strong validation of content analysis methods that evaluate the processes of the construction of knowledge in this setting. Shea et al. [69] adapt the approach to measure the students' practice of successful learning strategies and compare their results with social network analysis methods. They recognise the importance of further research into the relationship between cognitive presence and interaction and suggest that its detection contributes to the understanding of learners' networking behaviours. Joksimovic et al. [33] associate linguistic proxies for learning with CoI stages in discussion forums within small scale online courses. Their findings indicate the usefulness of further research that explores the effects of different levels of cognitive presence on learners with different levels of prior knowledge.
Finally, Waters et al. [83] implement a machine learning approach to predict students' critical thinking levels in formal online discussions according to CoI. In their study, they adopt word count, post similarity, chronological order, and other features to build a model that achieves a moderate level of accuracy. Although we do not implement an automated approach, our study adopts Waters et al.'s suggestions for future work that include the use of Linguistic inquiry and word count (LIWC) analysis to identify phases of critical thinking.
The research questions identified by these studies consider the utility of data derived from pedagogical content analysis and its potential in measuring performance in online courses. These are relevant to our study, which seeks to contribute to the development of robust and effective ways of understanding largescale comment data that are based on established theory.

| Bloom and CoI as content analysis methods
Weltzer-Ward [85] argues that understanding of online environments may be improved through the use of pedagogical content analysis methods, and calls for research on their application outside of online academic classroom contexts, and the exploration of opportunities for synthesis. Our study addresses these research areas by using different methods that adopt complementary approaches to analyse discussion in large scale, informal settings. Although Garrison et al. [25] acknowledge the consistency of their framework with socioconstructivist learning theory emerging from Dewey's [17] ideas on the importance of sociological as well as psychological aspects of learning, Bloom et al., do not explicitly recognise a single theoretical basis. However, the authors of these frameworks adopt hierarchical approaches identifying changes in learners' behaviour that have much in common with Piaget's theory of staged development [64] and T A B L E 2 Community of Inquiry: Cognitive presence [26,57] Score Descriptor 0-Off-topic There is written content, but it is not relevant to the subject under discussion.

1-Triggering event
A contribution that exhibits a sense of puzzlement deriving from an issue, dilemma or problem. Includes contributions that present background information, asks questions or moves the discussion in a new direction. Verbs: Evoke, induce, contradict.

2-Exploration
A comment that is seeking a fuller explanation of relevant information. This can include brainstorming, questioning and exchanging information. Contributions are unstructured and may include: Unsubstantiated contradictions of previous contributions, different unsupported ideas or themes, personal stories and descriptions or facts that are not used as evidence.

3-Integration
Previously developed ideas are connected. Contributions include references to previous messages followed by substantiated agreements or disagreements, developing and justifying established themes, cautious hypothesis, combining different sources, providing a tentative solution to an issue. Verbs: Test, conjecture, check.

4-Resolution
New ideas are applied, tested and defended with real-world examples. This involves methodically testing hypotheses, critiquing content in a systematic manner and expressing supported intuition and insight.
implicitly recognise the value of social learning explored by Vygotsky [81].
Although there are similarities, there are also distinct differences in their focus as well as approaches to the evaluation of critical thinking: Bloom facilitating generalisable evaluations of educational outcomes that can be applied to assessing learners in any number of settings, and CoI focusing on the appraisal of participation in the specific CSCL environment. In addition, some educational psychologists argue that individual and distributed cognition are two distinct, interrelated processes [53], and the methods we adopt in this study emphasise these different aspects: Bloom-individual, and CoIdistributed. By comparing two distinct approaches to measure critical thinking, we seek to establish if different methods yield significantly different results and identify opportunities for synthesis. This leads to our first hypothesis (RQ1) that there are significant differences between levels of critical thinking as measured by each method.
In this study, we evaluate MOOC forum comments in terms of the extent to which they provide evidence of deep learning approaches that reflect critical thinking through the lens of each method and attempt to identify significant differences in outcome from their use. Specifically, we seek to establish if these pedagogical content analysis methods can be applied consistently by different people and if these methods identify the same types of learning activity.
In addition to comparing and critically evaluating two different pedagogical content analysis methods, we compare the outcomes of this analysis with established proxies for learning in the form of linguistic analysis and typical interaction analysis, these are each explored in the next two sections.

| Linguistic analysis
The content and style of language used in everyday communications provide important indicators of psychological and social meaning that may be measured by quantitative methods, including content analysis and word pattern analysis [62]. Characteristic approaches to quantitative language analysis involve the identification and coding of similar patterns and the interpretation of content supported by statistical tests of significance [39]. Writing, speech, and the types of words used are seen as important proxies for emotional and cognitive processes [60].
In recent years, increased emphasis on content analysis studies exploring language use and CSCL has been identified [85]. For example, Delfino and Manca [16] discuss the use of "figurative" language in online social contexts, Miller [52] explores gender-related language patterns, and Uzuner [79] identifies educationally valuable talk in Computer-Supported Collaborative Learning (CSCL), Tausczik and Pennebaker [76] adopt a real-time language feedback system to improve learner collaboration, Robinson, Navea, and Ickes [65] correlate student language use with educational attainment, Joksimovic et al. [33] correlate word categories with CoI dimensions, and Allen, Snow, and Mcnamara [1] use linguistic indicators to predict learners' reading comprehension abilities. Evidence that pedagogically meaningful dialogue in Web-based environments can be automatically identified using learning analytic techniques has importance for this study [14], as does the use of mixed linguistic and interactional data to identify potentially "at-risk" learners [86].

| Linguistic inquiry and word count
Among computational approaches to language analysis, LIWC [24] was chosen as suitable for analysis of online discussion, and evaluation of cognitive processes. LIWC was developed as a result of studies into the therapeutic effects of writing about psychological traumas. The application adopts a quantitative, word count approach that aims to reveal the psychological meaning of words taken out of context from their original settings [62]. It searches within text files for over 2,300 words or word stems, tracking stylistic aspects of language use classified into 82 dimensions (e.g., articles, prepositions, pronouns), psychological processes (e.g., positive and negative emotion categories, cognitive processes), and other categories.
In addition to the developers' experiments aimed at validating the program, there are a number of studies which suggest the usefulness of LIWC in detecting the meaning of words. Although not seen as a replacement for qualitative analysis, Carroll [9] found LIWC provided meaningful results in their analysis of essays written for a critical thinking course. He discovered learners demonstrated less use of pronouns and words related to insight (think, know, consider), discrepancy (should, would, could), and tentativeness (maybe, perhaps, and guess), and were more likely to express causal thinking (because, effect, hence) in their final essays, compared with writing at the start of the course.

| LIWC categories
In this study, though we explored correlations with all LIWC categories, previous research indicates that specific categories are more closely associated with critical thinking than others. Therefore, as we set out to answer the question of whether the linguistic characteristics of comments are reliable proxies for levels of critical thinking (RQ2), we expected to find significant associations between the results of our pedagogical analysis and several LIWC characteristics.

| Word count
The number of words used in comments is often understood as a rough guide to levels of participation [3] and is commonly associated with the intensity of engagement [13,68]. Ferguson and Buckingham Shum's [21] research into synchronous text chat, and Joksimovic et al.'s [32] linguistic analysis of online discussions similarly suggest close associations between high word counts and thoughtful, "exploratory" exchanges.

| Pronouns
In their comparison of self-assessment with traditional (nonreflective) assignments, Peden and Carroll [58] found that learners writing self-assessment essays included more pronouns, insight and emotion words and used simpler language than expressed in traditional academic assignments. Kacewicz et al. [34] suggest that higher status contributors use fewer first-person singular pronouns, and Vosecky, Leung, and Ng's [80] research into tweet quality suggests that "I-talk" signifies "low quality", nonfactual communication. Robinson, Navea and Ickes [65] discovered that they could predict learners' course performance on their use of "word simplicity, first-person singular pronouns, present tense, details concerning home and social life, and words pertaining to eating, drinking, and sex" (p. 469), concluding that lowperforming learners tended to exhibit egocentricity in their writing.

| Causal words
Within LIWC dictionaries, causal words are categorised as a subgroup of cognitive process words, which suggest an engagement with active reappraisal, or processing of information [61]. Although Joksimovic et al. [33] found counts of causal words were not significant between higher phases of CoI, several studies (e.g., References [34,46,59]) have found that causal words are related to the level of cognition. Linguistic analysis of journals and essays indicates that causal words are more evident in precise and concise descriptions, and indicate progress in the level of cognition and understanding [59]. In addition, increased levels of differentiating between competing ideas have been linked to higher levels of cognition [76].

| Power and affiliation
The LIWC categories of power and affiliation, are developed from thematic apperception test (TAT) research and relate to assessments of an individuals' unconscious drives and social motives, where the affiliation motive is related to the friendliness and establishing rapport, and power is associated with making an impact and exerting control [87]. Although the literature does not suggest causal associations between TAT scores and levels of critical thinking, in LIWC higher incidence of power words suggests the writers' perception of themselves as having high status or expertise. In this study, we conceptualise this form of self-assurance as a potential indicator of critical thinking.

| Emotion words
Using sentiment analysis to measure relationships between mood and different variables-from consumer confidence to managing disaster relief-is commonplace wherever people's behaviour is under scrutiny. Research suggests that though positive language can suggest a focus on group cohesion, which may encourage individuals to work harder [23,47], it has been noted that correlation with positive sentiment can suggest disconnection, and that high levels of empathetic discussion may distract learners from key tasks [47]. Conversely, the expression of negative sentiment has been associated with "cognitive disequilibrium" and higher levels of thinking [18,27].

| Word length
Although complex cognitive processes and critical thinking are often associated with using long words [3,38], some researchers have found that counts of long words are not significant indicators of cognitive load, but have use in supporting analysis that includes other significant features [38]. However, long sentences should not necessarily suggest increased cognitive attention. In their research into predictors of students' reading comprehension, Allen et al. [2] assert that shorter sentences can suggest more sophisticated writing strategies.

| Other word types
Researchers have found negation, auxiliary verbs, and conjunctions to be significant indicators of cognitive load [39], and in the analysis of undergraduate writing, these categories have shown significant differences between triggering and other phases of CoI [33]. Research also indicates a high incidence of prepositions associated with attention to reflective behaviour. High use of prepositions are identified as significant indicators of increased cognitive load [38], and their prevalence in the discussion sections of journal articles, which are "often the most complex part of an article" [31, p. 35].
In addition, Joksimovic et al.'s [33] study found distinct use of the dictionary, functional, inhibition, inclusive and cognitive words, as well as articles, prepositions, conjunctions in the triggering phase, but found no significant difference in the use of pronouns or insight words throughout the four phases.

| Limitations
Although supported by numerous research outputs, linguistic analysis is limited in its reliability [75]. In addition to uncertainty over the meaning of higher numbers of words per sentence counts, as referred to above, analysis of word categories may also be compromised. Contributors to discussion forums often use symbolic, oblique and indirect ways of communicating meaning, which may lead to classification errors [60]. Multiple meanings of words, complicated sentence formation, and unclear use of pronouns may obscure meaning and require more complex methods to resolve uncertainty than are available in the software used in this study [31]. However, notwithstanding the potential for error, we agree with Pennebaker and Francis' claim that LIWC analysis is "as valid as a judge-based system that requires multiple judges who, themselves, are prone to error" [39, p. 622].

| Interaction analysis
Where use of language acts as a more-or-less unconscious indicator of mood, interaction analysis looks at more direct actions. "Likes" are a common intentional rating mechanism used to signify personal feelings [45,77]. Some research suggests that this metric is ambiguous [43] and unreliable [74], however, this indicator, as well as sentiment analysis, are widely used in learning analytics, for example, to identify learner attrition [86], selfconfidence [71], and learners' opinions of courseware [74]. In addition, there is some evidence that this cumulative rating system may provide learners with timely prompts that can lead to higher levels of learning [13], and the platform hosting the MOOCs explored in this study include a "like" button feature associated with all individual comments, which allows learners to provide immediate, simple feedback.
Furthermore, by placing discussion forums within the context of each activity and providing mentor support [48], the platform encourages sharing and situated debate [4,36], with the explicit intent of building communities of inquiry and inspiring higher-level learning. In this context, we set out to discover if learners' use of the "like" button was significantly associated with pedagogical content analysis methods. Specifically, we aimed to answer the question of whether the number of likes awarded to comments or the sentiment of posts are a reliable indicator of the level of critical thinking (RQ3).

| METHODOLOGY
To answer our three research questions the comment data from three Massive Open Online Courses (MOOCs) offered on the FutureLearn platform in 2014-15 were analysed. The MOOCs were chosen to facilitate the analysis of writing produced in diverse subject areas: business, education, and science. More than 41,500 registered learners engaged with the courses, with nearly 15,000 contributors posting over 174,500 comments containing more than 8.5 million words (the MOOC2015 corpus). Each MOOC was delivered via an average of 20 "steps" per week throughout each of the 3-to 6-week courses, and each step provided the facility for instructors and registered learners to contribute to discussions within the steps' comment field. As all comment data was provided in anonymised form, it was not possible to separate comments by type of participant. Although the random sample of 1,500 comments used in this study may have included comments from instructors as well as learners, as the literature reports low levels of instructor intervention (e.g., Reference [8]) our assumption is that the bulk of comments were made by learners.
Sample size was limited by the time each rater would have to provide an reliable evaluation, the cost of employing raters, and the available financial resources. To obtain accurate results, I anticipated that raters had to be motivated to undertake the tasks in an expert manner. Many studies support the proposition that the strongest motivating factor for this type of work is the ability to earn money [8,36]. As raters were expected to carry out expert assessments, we decided to pay them the normal rate for similar professional activity (e.g., teaching).
Having undertaken similar work in earlier studies, we estimated that each rater would expect to spend an average of 30 s evaluating each comment. As each comment was rated twice (once for each analysis method), we estimated that within our small research budget raters could be reasonably expected to evaluate no more than 1,500 comments in total (the MOOC2015 corpus).
Although amounting to less than 1% of the total number of comments submitted on the three chosen MOOCs, the MOOC2015 corpus is considerably larger than 20-140 sample sizes required to train "good classifiers" suggested by Beleites et al.'s [4] algorithm design research.
To select 1,500 comments from a total of 174,500, we used a simple random sampling method [67]. Labelling each comment with a unique number and choosing 500 comments from each MOOC using a random number generator would have provided a satisfactorily random sample. However, raters expressed concern during training that comments selected in this way would be seen out of context and may have led to inaccurate ratings. As individual comments taken out of the context could be misconstrued by raters, we organised them into batches of 20 consecutive comments (the minimum number of comments considered to be large enough to facilitate context-based judgements). Eight batches from each MOOC were then selected for rating using a random number generator [29]. Three randomly selected batches of 20 consecutive comments from each MOOC were also selected to facilitate test rating, before undertaking analysis of the rest of the sample (see Figure 1). Qualitative analysis was undertaken by a team of seven raters recruited from postgraduate students registered at a UK university, using content analysis methods based on Bloom's Taxonomy (Table 1) and CoI (Table 2), to rate whole comments ( Figure 2). Two of the seven had backgrounds in education, two in anthropology, and one each from physics, psychology, and languages. Five had previous experience of assessing written work. The raters were provided with a short face-to-face instruction session, where they scored a variety of typical comments and observed how others scored the same comments. They were instructed to work alone on the coding task and not compare results with, or request advice from, anyone.
To identify outlying scores and possible misunderstandings among raters, an initial test selection of 60 comments, comprising 20 randomly selected consecutive comments from each MOOC, were scored by the coding team. Intraclass correlation coefficients were calculated using a two-way mixed, consistency, average measures definition and provided inter-rater reliability scores of 0.93 for Bloom and 0.898 for CoI suggesting "almost perfect" agreement [67, p. 165], and indicating that levels of critical thinking were scored similarly across raters.
Rating of a larger sample then went ahead. Comments from all three MOOCs were numbered in batches of 60 consecutive comments from which eight batches were randomly chosen and distributed among the raterstwo batches from each MOOC. Raters scored each comment twice (once by Bloom and once by CoI), with each batch being scored by two raters working independently. To avoid confusion between content analysis methods, scoring sheets were distributed with a 10-day time lag between each method, and only after the scoring using the first method had been completed. In total 1,440 comments were scored.
Comments from the test and full sample were combined (n = 1,500) and Pedagogical Scores (PS) for each comment were generated based on the average score of the two raters who had examined that comment, where the PS value is equivalent to the level identified for that comment within the analysis framework.
Statistical analysis software was used to conduct twoway analyses of variance and generate scatter plots with fitted lines to identify the existence and intensity of simple linear regression. Histograms show close to normal distribution of mean scores (Figures 3 and 4, and Table 3) and these scores were compared with number of words per comment, number of likes, and LIWC2015 categories to produce correlation and prediction values. LIWC2015 word category analysis is believed to be unreliable for texts containing less than 50 words [63], and as 40% of comments contained less than this amount, results were explored on three levels: 1,500 individual comments (where analysis methods, likes, and word count were compared), 150 aggregated batches of 10 contiguous comments (LIWC2015 compared with average scores for each batch), and 607 individual comments containing 50 or more words (LIWC2015 compared with individual average scores). The aggregated batches were selected from the data sets by grouping together the text from 10 contiguous comments. Average PS was calculated for each block and correlated with LIWC2015 analysis.
In addition, PS generated by the two different frameworks (CoI and Bloom) were correlated to explore whether the frameworks were measuring the same sorts of pedagogical activity.   Table 4, we have shaded the cells to show the two strongest, significant correlations: word count and first-person pronoun count. Due to the number of significance tests (a total of 103 tests) we have made note of p values at <.05, <.01, and <.00.

| Research Question 1: The reliability of the pedagogical analysis methods
As the key test of objectivity in content analysis research, establishing the degree to which raters agree is vital, unfortunately, many studies either fail to report rater agreement, or report discussion leading to full agreement [15]. Krippendorff [43] argues that this approach is of little use as it tests just the reliability of the individual raters rather than the method. As there is no settled F I G U R E 1 Sample selection process. MOOC, Massive Open Online Course method of testing agreement, our study follows de Wever et al.'s [15] recommendation and reports two indices.
To establish the reliability of the pedagogical analysis methods used in this study, intraclass correlation coefficients were calculated between pairs of raters and provided inter-rater reliability scores of 0.832 for Bloom and 0.818 for CoI. This suggests a high degree of agreement between raters and indicates that the pedagogical frameworks were interpreted and applied similarly across raters. Furthermore, reliability analysis using the Krippendorff's [44] α method provided inter-rater reliability scores of 0.7287 for Bloom and 0.6961 for CoI, which supports the use of these methods to reach tentative conclusions.
When comparing PS derived from the two frameworks there is a high correlation score of .909 (p < .001), suggesting a close association between Bloom's levels of learning and CoI's measures of meaningful and productive discourse. This suggests that while they describe pedagogical activity in different ways, they are relatively consistent in measuring its presence and strength.

| Research Question 2: Linguistic content analysis as an indicator of learning
We sought to establish which LIWC characteristics had significant correlations with levels of critical thinking using the LIWC2015 software to examine all characteristics. Two moderate to strong indicators were identified across all approaches to corpus analysis: word count ( Figure 5; highest correlation: r = .759, p < .001) and firstperson singular pronouns ( Figure 6; r = −.533, p < .001).
Finally, word types with low correlations across all approaches to corpus analysis included conjunctions (and, also, although: r = .28, p < .001) and words with six letters or more (sixltr: r = .2, p < .05).

| Research Question 3: Social interactions as an indicator of learning
We also explored correlations between content analysis methods and measures commonly used to measure social interaction: "likes" and sentiment analysis. Although "likes" gave positive, significant results across all approaches to analysis, the correlation was weak (Table 4; maximum r = .298, p < .001). In terms of sentiment, typical measures include positive and negative emotions and emotional tone. Although all three produced significant, moderate correlations within aggregated comments, results were weakly correlated in ≥50-word comments, and negative emotion words (negemo) providing insignificant results.

| ANALYSIS AND DISCUSSION
In their wide-ranging review of content analysis methods de Wever et al. [15] identify six interrelated criteria for assessing their effectiveness; instruments should be "accurate, precise, objective, reliable, replicable, and valid" (p. 8). Central to these criteria is the theoretical basis of the instrument, the unit under analysis (i.e., the comment as a whole, or in part), and the extent to which they can be replicated across a variety of settings: from an individual rater agreeing with themselves, then two or more raters reaching an agreement, to the reliable use by many different groups of researchers [66]. The content analysis methods used in this study were applied by seven raters, who applied the analysis criteria to the individual, whole comments. The high level of agreement in this study suggests that these methods may be F I G U R E 3 Distribution of average Bloom scores successfully applied in other settings and provide the foundation of automated rating systems that closely conform to commonly held values regarding levels of critical thinking.
By exploring the comment data using three sampling techniques, we were able to investigate how the analysis methods behave in different contexts. Looking at all 1,500 comments allowed us to make inferences about general word count and interaction categories, individual ≥50word comments facilitated LIWC word category analysis at an individual contributor level, and aggregations of all comments provided an overview of how contributors were commenting. These approaches are useful in different contexts. For example, understanding language dynamics at an individual level is important for analysing the behaviour of specific contributors, and an aggregated approach can indicate how activity within the course is generally progressing.

| Pedagogical content analysis methods
The pedagogical content analysis methods used in this study were highly correlated. As each method emphasises different aspects of cognition (Bloom-individual, CoI-distributed), this suggests that, in this study, there is a strong connection between individual levels of critical thinking and how this develops through discussion. This outcome may result from aspects of learning design that are Particular to the FutureLearn platform.
For example, though online learning environments are not always synonymous with improved critical thinking [73], there is some evidence that providing learners with timely and detailed prompts can lead to higher levels of learning [13]. By placing discussion forums within the context of each activity and providing mentor support [48], the FutureLearn platform encourages sharing and situated debate, with the explicit intent of building communities of inquiry and inspiring higher-level learning. In addition, the two instruments may be measuring very similar behaviours related to the depth and intensity with which people write about what they are thinking. If we agree that there is an approximate connection between the complexity of writing and depth of understanding, it is reasonable to assume that someone who has applied greater attention to their learning and wishes to share this with others, will use more elaborate arguments ("Create" in Bloom), or attempt to summarise arguments ("Resolution" in CoI), and suggests that comments evidencing these types of focus will tend to be ranked in a similar manner. Although the instruments based on those frameworks are sensitive to different aspects of learning, our results show consistency in measuring the presence and strength of critical thinking, which suggests their interchangeability in quantifying these properties in this setting.

| LIWC analysis
An important aim of this study was to determine predictors that closely align with cognitive processes in CSCL, and the literature indicates that LIWC is an accurate tool for measuring significant aspects of language use in this setting.
The most relevant outcome from regression analysis of comparisons of outputs from LIWC2015 and the content analysis instruments is the clear, statistically significant, positive correlation between word count and level of critical thinking, which confirms findings of studies which associate high word counts with thoughtful, "exploratory" exchanges in formal CSCL environments. In addition, our results for first-person person singular pronouns ("I-talk") are also supported in the literature, in showing high, significant results across all profiles, with a negative effect associated with high-level learning.
Our findings for causal, differentiation, cognitive process, and power words all provided moderate positive Methods r = .909 *** Likes r = .237 *** r = .243 *** r = .263 *** r = .298 *** r = .146 *** r = .149 *** Positive correlations WC r = .687 *** r = .704 *** r = .759 *** r = .759 *** r = .422 *** r = .465 *** Cause r = .125 *** r = .101 *** r = .573 *** r = .523 *** r = .224 *** r = .196 *** Differentiation r = .220 *** r = .195 *** r = .443 *** r = .429 *** r = .100 * r = .122 ** Negation r = .122 *** r = .110 *** r = .458 *** r = .451 *** r = .058 ns r = .052 ns correlations with both pedagogical content analysis methods, across aggregated comments and ≥50-word comments. The findings for causal, differentiation, cognitive process accord with studies of language use in formal education as well as informal settings. The correlation of power words with higher levels of critical thinking in our study suggests self-confidence in expressing opinions. With regard to "emotion" and positive sentiment words, the statistically significant, though moderately negative correlation between these categories and learning objects with high PS is another noteworthy outcome of this study. Some studies suggest that correlation with positive sentiment indicates loss of focus from key tasks, which, by showing a higher incidence of these categories associated with lower levels of cognitive engagement, our study appears to agree with. However, the positive association of positive sentiment words with higher levels of cognitive engagement in longer comments (containing 50 or more words) in our study, whereas weak, implies agreement with other studies that suggest that higher levels of positive language equate with a greater focus on group cohesion, and encouragement to work on-topic.
Results for negative emotion were also mixed, with significant, moderate positive correlation with the level of critical thinking in aggregated comments, but no significant results in ≥50-word comments. Although this category is associated with higher levels of critical thinking in the literature our study suggests that this, along with positive sentiment, may not be a reliable measure in all samples.

| Interaction analysis
The significant, positive association of likes in this study with high PS was unexpected, as this metric has been reported as ambiguous and unreliable, and our previous work produced insignificant results for this metric. Although results from this study, and good research practice, suggests caution when making inferences from web paradata and ambiguous phenomena like individual behaviour or cognition (especially with correlation values of less than r = .3), the significant result across all comments and aggregated comments implies that a process of "aggregated trustworthiness" is possibly at work [32]. In this setting, a sufficient number of MOOC forum contributors may be using the "like" button as an indicator of the trustworthiness and expertise of certain posts (rather than using it to signify agreement or ironic disagreement). Although we were unable to uncover any empirical evidence in the literature to support this particular argument, Facebook "likes" have been used to predict intelligence levels [40], and some researchers have found that the "like" feature can moderately F I G U R E 7 Scatter plot showing correlation between percentage of causal words and Bloom score in aggregated comments (r = .573, p < .001) stimulate learner motivation [70], suggesting this may be a fruitful area for further research.
Finally, words containing six or more letters returned weak correlations, and words per sentence, negation, auxiliary verbs, conjunctions and prepositions returned mixed and insignificant results in the ≥50 word comment sample. Although the use of long words and long sentences have been associated with higher levels of critical thinking, the literature reports mixed findings for this category. In our study, negation, auxiliary verbs, and conjunctions produced moderate significant results in the aggregated comment sample, but analysis of the ≥50word comment sample revealed no significance for these features in individual CoI coded comments, with conjunctions and auxiliary verbs producing very weak correlations in Bloom coded comments. These inconclusive results suggest that using these categories as a sole indicator of critical thinking is not advisable, but that there may be a place for these categories supporting analysis that include other significant features.
Together with the unexpected significant result for likes, the low correlation values or lack of significance of prepositions was also unanticipated. When aggregating all three MOOCs, our findings do not appear to agree with the large number of studies that have found statistically significant positive associations between prepositions and attention to reflective behaviour or increased cognitive load. However, exploratory analysis of results filtered by MOOC revealed insignificant results for prepositions in the business-related MOOC, with significant moderately correlated results for this word type in the other two. This may be explained by the very low incidence of off-topic comments in the latter MOOC samples, which further suggests that aspects of language analysis are highly context-dependent.

| CONCLUSIONS
This study set out to answer three key questions: RQ1. Are pedagogical content analysis methods reliable, can different people consistently apply them, and do they identify the same types of learning activity? Converting informal MOOC comments into comparable scores based on multiple pedagogical frameworks is a significant research challenge. In this study, a group of seven raters achieved a high degree of reliability using both pedagogical analysis methods, which enables us to have some confidence in the generalizability of these methods in future studies. Building on previous research in formal CSCL environments, we have established close associations between two distinct methods applied to informal settings, which contradicts previous findings (e.g., Reference [11]), suggesting the value of further investigation of critical thinking evaluation in MOOCs.
Although the pedagogical content analysis methods have different theoretical foundations and have been developed to assess different aspects of learning (individual and distributed cognition) when applied in this context, and correlated against language categories, sentiment and "likes", there appears to be very little difference in how they measure levels of critical thinking. RQ2. are linguistic content analysis measures significant indicators of levels of critical thinking? Confirming previous research (e.g., References [12,34,59]), through LIWC2015 analysis, we identified word count and first-person singular pronouns as convincing indicators of levels of critical thinking, with causal words, power and all pronouns providing moderate results. Other word categories provided mixed results within the two sampling methods used, suggesting a supporting role for these categories in future research. RQ3. to what extent do typical measures of attention to learning indicate levels of critical thinking when applied to MOOC discussion forum comments, as identified through pedagogical content analysis?
Although producing significant results, and confirming previous work suggesting "likes" prompt engagement with higher levels of learning [13], both measures of sentiment and "likes" were weakly correlated with measures of critical thinking. Therefore, as with weakly correlated word types, this suggests secondary roles for these measures in future research.
Henri [30] suggests that the object of analysing education-based CMC interactions is to "improve the efficacy of interaction with students" (p. 117). Despite progress in codifying content analysis methods and the development of automated Natural Language Processing techniques, the absence of effective tools means the process of coding remains arduous and time-consuming [54]. For instructors, the timely identification of learners in need of pedagogical support is as relevant now as it was when Henri addressed the issue 25 years ago, and the strong correlations with LIWC2015-based proxies for pedagogical activity and the pedagogical content analysis methods in this study suggest significant promise for automated tools.
Our future work will thus involve exploring the development of real-time, automated analysis tools based on Machine Learning techniques. Although we aim to explore a range of methods, Random Forests classifiers appear the most promising. They are among the most widely adopted classifiers in Learning Analytics research [72], offer opportunities for high accuracy and reliability [10,22] and are considered to be relatively straightforward to apply [82]. These algorithms have the potential to go beyond the linear regressions presented in this paper, combining multiple metrics to predict PS. It is our hope that such software could support learners in their personal reflection, help tutors to identify excelling or struggling students, and aid learning designers in identifying areas of weakness in their MOOCs. In the future, such tools will be essential if tutors are to effectively manage learning interactions at a massive scale.

ACKNOWLEDGEMENT
This study was supported in part by a grant from EPSRC, award: 1383089.