The full text of this article hosted at iucr.org is unavailable due to technical difficulties.

SPECIAL ISSUE ARTICLE
Open Access

From signals to knowledge: A conceptual model for multimodal learning analytics

Daniele Di Mitri

Corresponding Author

E-mail address: daniele.dimitri@ou.nl

Welten Institute ‐ Research Centre for Learning, Teaching and Technology, Open University of Netherlands, , The Netherlands

Correspondence

Daniele Di Mitri, Welten Institute ‐ Research Centre for Learning, Teaching and Technology, Open University of Netherlands, Valkenburgerweg 177, 6419 AT Heerlen, The Netherlands.

Email: daniele.dimitri@ou.nl

Search for more papers by this author
Jan Schneider

Welten Institute ‐ Research Centre for Learning, Teaching and Technology, Open University of Netherlands, , The Netherlands

Search for more papers by this author
Marcus Specht

Welten Institute ‐ Research Centre for Learning, Teaching and Technology, Open University of Netherlands, , The Netherlands

Search for more papers by this author
Hendrik Drachsler

Welten Institute ‐ Research Centre for Learning, Teaching and Technology, Open University of Netherlands, , The Netherlands

Search for more papers by this author
First published: 23 July 2018
Cited by: 2

Abstract

Multimodality in learning analytics and learning science is under the spotlight. The landscape of sensors and wearable trackers that can be used for learning support is evolving rapidly, as well as data collection and analysis methods. Multimodal data can now be collected and processed in real time at an unprecedented scale. With sensors, it is possible to capture observable events of the learning process such as learner's behaviour and the learning context. The learning process, however, consists also of latent attributes, such as the learner's cognitions or emotions. These attributes are unobservable to sensors and need to be elicited by human‐driven interpretations. We conducted a literature survey of experiments using multimodal data to frame the young research field of multimodal learning analytics. The survey explored the multimodal data used in related studies (the input space) and the learning theories selected (the hypothesis space). The survey led to the formulation of the Multimodal Learning Analytics Model whose main objectives are of (O1) mapping the use of multimodal data to enhance the feedback in a learning context; (O2) showing how to combine machine learning with multimodal data; and (O3) aligning the terminology used in the field of machine learning and learning science.

Lay Description

What is already known about this topic:

  • Multimodal data can capture fine‐grained measurements of educational traces.
  • Many sensors can be now used in the domain of education to collect data.
  • These data are records of learning and can be used to investigate it.
  • Learning happens across physical and digital spaces.

What this paper adds:

  • It reports the results of a literature survey in the field of multimodal learning analytics.
  • It provides a taxonomy to organize for the first time the different modalities in learning from a sensor perspective.
  • It introduces the concept of observability line.
  • It explains how machine learning can be used on multimodal data to improve learning.
  • It aligns the terminologies used by the learning science and the machine learning communities.

Implications for practice and/or policy:

  • The model proposed can be used in future multimodal learning analytics research to enhance feedback for learners.
  • The feedback provided to the learner can become more adaptive and therefore make the learning more effective.
  • The multimodal learning analytics community can profit to a shared understanding on how to use multimodal data for learning.
  • The model supports continuous assessment, which can in the future replace future examination.

1 INTRODUCTION

With the rise of data‐driven techniques to discover insights and generate predictions from the learning process such as learning analytics, the need for 360° data about learners has grown consistently. Combining data coming from multiple sources has become a prominent necessity in learning research and has led to an increased interest in multimodality and consequently into multimodal data analysis. To clarify the concept of multimodality, we use the definition provided by Nigay and Coutaz. The term “multi” refers to “more than one”, whereas the term “modal” stands both for “modality” and for “mode”. The modality is the type of communication channel used by two agents to convey and acquire information that defines the data exchange. The mode is the state that determines the context in which the information is interpreted (Nigay & Coutaz, 1993). The reasons why multimodality in learning is drawing so much attention can be summarized according to four developments.

First of all, multimodality is a consolidated theory. It has been subjected of investigation already for two decades in different fields including functional linguistic, conversational analysis, and social semiotics (Jewitt, Bezemer, & O'Halloran, 2016). Research in multimodal interaction investigated how different modalities interact and complement each other to convey and densify meaning (Norris, 2004). Different experiments using multimodal data in learning scenarios also date back to the early 90s. In 1993, Ambady and Rosenthal found out that college teachers were able to predict students' end‐of‐semester results by observing “thin slices” of interactions, that is, looking at their physical and non‐verbal behaviour with short video clips (Ambady, Rosenthal, Scollon, & Kress, 1993). These early findings paved the way towards a new research hypothesis, the possibility to infer cognitive and social processes by using multiple data sources and social signal processing (Poggi & Errico, 2012).

Second, multimodal tracking has recently become more feasible. This happens because of recent technological developments such as the Internet of Things, wearable sensors, cloud data storage, and increased computational power for processing and analysing big data sets. To date, sensors can be used to gather high‐frequency and fine‐grained measurements of micro‐level behavioural events as, for example, movement, speech, body language, or physiological responses. The Internet of Things approach, that is, connecting sensors to physical world objects or to human bodies, allows computers to take measurements of the world as well as the physiological phenomena, encoding them into machine‐interpretable data.

Third, modelling across physical and digital worlds is a rising need. A general “call for multimodality” has been fostered in computer‐supported collaborative learning and learning with interactive surfaces communities (Schneider & Blikstein, 2015). Multimodal data systems are needed to link digital and physical interactions and shed a light on collaborative learning and collective sense‐making (Martinez, Collins, Kay, & Yacef, 2011; Pijeira‐Díaz, Drachsler, Järvelä, & Kirschner, 2016). Sensors and wearable trackers can be used in learning settings to collect attributes from face‐to‐face physical learners' interactions, such as speech, body movement, and gestures. These bodily micro‐actions can be combined with digital interactions recorded with tabletops and stored in log files. A similar need exists in the Learning Analytics & Knowledge, for achieving a more complete picture of the learning process. Such need originates from the fact that traditional data sources, like logs, clickstreams, and content interactions taking place within the learning management system, only represent a small proportion of the learning activities and not the whole learning process (Pardo & Kloos, 2011). Multimodal data, in summary, can mitigate the streetlight effect, 1 by adding more street lights and expand the visible area and complete the learner's digital profile in the computer (Heckmann, 2005).

Finally, the multimodal approach is more aligned with the nature of human communication. The use of multiple modalities in human communication is redundant and complementary (Calvo, D'Mello, Gratch, & Kappas, 2015). This reflects also when the human interacts with the computer. Humans communicate their intentions and emotions using multiple modalities such as facial expression, voice intonation, or body movements. When analysing incomplete data sets, especially those having missing data (e.g., due to hardware failures), the information overlap across multiple modalities is convenient because it allows their overall meaning to be preserved (Bosch, Chen, Baker, Shute, & D'Mello, 2015).

The developments here described paved the way for a new approach in data‐driven learning support, that is, the multimodal learning analytics (MMLA; Blikstein, 2013). MMLA is a research field located at the crossroad between learning science, machine learning. MMLA leverages the advances in multimodal data capture and signal processing to investigate the learning in complex learning environments (Ochoa & Worsley, 2016). MMLA can establish a bridge between complex learning behaviour and learning theories (Worsley, 2014). MMLA can offer new insights into learning spaces and tasks in which learners have open choices to differentiate their learning trajectories by facilitating the provision of feedback (Blikstein, 2013).

Despite the increased interest that the MMLA research field is receiving, it still remains a new kind of “data geology,” which faces several challenges. Some of these challenges are inherited by the complex and multiform nature of multimodal data. On this extent, the most relevant multimodal data challenges were described by Lahat, Adali, and Jutten (2015) and include high‐dimensionality, different modality resolutions, noise, missing data, data fusion techniques, and choice of the right model.

MMLA also faces challenges specific to its application domain of education and learning. In this paper, we aim to get an overview of the MMLA field and its challenges. First, we proposed a classification framework for MMLA research consisting of input space and hypothesis space divided by the observability line. Thereafter, we conducted a literature survey where we explored MMLA empirical studies (Section 2), and we further operationalized the input and the hypothesis spaces. The literature survey helped to identify three main challenges in the field of MMLA: (C1) There is a lack of understanding of how multimodal data relate to learning and how these data can be used to support learners achieving the learning goals; (C2) it is still unclear how to combine human and machine interpretations of multimodal data; and (C3) the fields of machine learning and learning science use different terminologies that are ambiguous and need to be aligned. The surveyed literature allowed us to go a step further and address the exposed challenges by introducing the Multimodal Learning Analytics Model (MLeAM, Section 3). MLeAM was designed to fulfil three objectives: (O1) mapping the use of multimodal data to enhance the feedback in a learning context; (O2) showing how to combine machine learning with multimodal data; and (O3) aligning the terminology used in the field of machine learning and learning science.

2 LITERATURE SURVEY

In this section, we first describe the classification framework (Section 2.1) used to conduct the literature survey in the field of MMLA. We detail the two main components of the classification framework: the input space (Section 2.1.1) and the hypothesis space (Section 2.1.2). In Section 2.2, we describe the selection process and criteria adopted to identify the relevant articles. In Section 2.3, we present the results of the survey by proposing the taxonomy of multimodal data for learning (Section 2.3.1) and the classification table of the hypothesis space (Section 2.3.2). Lastly, in Section 2.4, we discuss the results, and we draw the conclusions in terms of future challenges for the MMLA community.

2.1 Classification framework

Some aspects of the learning process such as the learner's behaviour can be directly observed and measured by means of sensors. Some other aspects, such as learner's cognition or emotions, are latent attributes that cannot be directly measured by sensors and thus can only be inferred. For our literature survey, we named these aspects as input space and hypothesis space, which is a distinction widely used in machine learning. In the case of human learning, the input space includes, for example, the learner's behaviour and the learning context. These aspects of learning can be captured automatically into multimodal data. It is relevant to point out that sensors have a different viewpoint than humans; sensors are not capable of making interpretations or assigning meaning to the data they collect. The hypothesis space encompasses the range of possible interpretations, that is, attributes not directly observable by sensors but that can also be expressed as data. The hypothesis space includes semantic interpretations of the multimodal data, which can be based on psychological and learning‐related constructs such as emotions, beliefs, motivation, cognition, or learning outcomes. These attributes belong to the learner's sense‐making process, which in classroom activities remains invisible for educators and researchers (Kim, Meltzer, Salehi, & Blikstein, 2011).

The input and hypothesis spaces are therefore conceptually separated by the observability line: a line of separation between the observable evidence and all the possible interpretations. The attributes of both spaces are facets of the same iceberg, the ones “above the water line” are noticeable from the point of view of a generic sensor. While, the attributes “underwater”, require multiple levels of interpretation, depending how deep they stand from the observability line. The distinction between observable/unobservable is conceptual and can vary in practice. Figure 1 presents one possible instantiation of this concept. The distinction is useful when employing sensors and using machine‐guided interpretations. For computers, the interpretation process, that is, moving from the input to the hypothesis space, is increasingly difficult.

image
The observability line: The multimodal data can capture only the observable attributes [Colour figure can be viewed at wileyonlinelibrary.com]

Although input and hypothesis spaces are separated for computers and sensors, they are tightly intertwined for humans. Humans can interpret behavioural cues, by reasoning and drawing conclusions, for example, yawning corresponds to boredom or tiredness. Psychological and educational theories tell us how these relationships can be drawn. For example, the affective‐behaviour‐cognition theory connects observed behaviour with emotions and cognition (Ostrom, 1969). Similarly, Damasio proposed the idea of “somatic markers” that are special instances of feelings in the body associated with emotions such as rapid heartbeat is associated with anxiety or nausea is associated with disgust (Damasio, Tranel, & Damasio, 1991). At the biological level, the process of self‐regulation as a response to physical and external demands is known as homoeostasis, which supports the idea of the human body working as a complex system. An example for this homoeostasis for learning is the state of arousal known as the degree of physiological activation and responsiveness caused by a situation or collaborative activities (de Lecea, Carter, & Adamantidis, 2012). Low arousal is an indication for a harmful physiological state for learning such as frustration or boredom, whereas high arousal indicates an active or responsive mode that is supportive for learning (Bjork, Dunlosky, & Kornell, 2013; Pijeira‐Díaz, Drachsler, Kirschner, & Järvelä, 2018).

2.1.1 Input space: Multimodal data

Learning is a complex and multidimensional process (Wong, 2012). Defining the input space, that is, identifying the relevant modalities and extracting informative attributes, is not trivial tasks. To facilitate these tasks, we expand the initial notion of multimodal data for learning by describing their distinctive features.

An important requirement to be fulfilled is that the modalities must be periodically measurable. To explain this, we pick the counterexample of biomarkers testing extensively employed in medicine (Koh & Jeyaratnam, 1998). Analysing samples of blood, body fluids, or tissue, biomarker tests can be used to investigate the genomic structure, the presence of molecules or hormones concentration like dopamine or norepinephrine (noradrenaline). The presence or lack of one of these substances can indicate potential disease or a certain body state. The way these tests are conducted does not allow for continuous measurements and monitoring: For this reason, these dimensions are out of the scope of multimodal data analytics.

The modalities belong to the input space and can be either endogenous or exogenous (behaviour vs. context), depending if they explain the learner's behaviour or the learning environment affordances that are external but might influence the learning process. The behavioural modalities can be divided between motoric and physiological. Motoric modalities are movements and describe events mainly governed by the somatic nervous system and actuated by the muscles and the skeleton. These modalities are generally deliberated, they should be seen as random events, as there exists no evident correlation between consequent values. Conversely, the physiological modalities that are governed by the autonomic nervous system are generally involuntary, their role is to help to self‐regulate, and they should be seen as continuous events. An example is the cardiovascular activity controlled by the heart: The value of the heart rate at one time point is dependent on the previous values and, for this reason, must fall into a range. The division between intentional and unintentional events is however not so black and white as it seems. Anderson (2002) illustrates how humans have different levels of cognition like biological, cognitive, rational, or social, and human actions can be classified accordingly to these levels depending on which timescale they take place. The reaction time for an action can span from microseconds for biological reactions to minutes, hours, days, or weeks for social actions. The rational and social actions are pondered; they require enough time to go through different layers of consciousness; for this reason, they are associated with higher level of intentionality. One example can be standing in a very hot room with closed windows: A common unintentional biological reaction can be starting to sweat, whereas a common rational action can be opening the window. Both actions and reactions can be considered self‐regulatory.

At the biological level, the Neurovisceral Integration Model supports the idea of coordination: It shows how the human body works as a complex interconnected system, adapting its functioning according to the stimuli it receives and to goals it wants to reach (Thayer, Hansen, Saus‐Rose, & Johnsen, 2009). For example, the mind under effort is associated with physiological arousal and therefore with increased heart rate. The discipline that studies the correlations between physiological activity and psychological states including cognitive, emotional phenomena is called psychophysiology. Cacioppo, Tassinary, and Berntson (2000) found interesting correlations between heart rate accelerations and emotions such as anger, fear, and sadness. For example, an increase in heart rate variability (HRV) is correlated with joy and amusement, whereas the decrease of HRV is correlated with happiness.

Another distinction that can be made is between verbal and non‐verbal modalities. Non‐verbal expressions are thought to make up to 93% of the meaning during face‐to‐face communication and social interaction (Mehrabian, 1971). In particular, kinesics, commonly referred as body language and physical appearance, is thought to have an important role especially during learning. Teachers, for instance, often use kinesics to reinforce the meaning of the words (Leong, Chen, Feng, Lee, & Mulholland, 2015). Verbal modalities, on the other hand, use natural language as communication and, for this reason, have a much higher interpretation complexity. For an intelligent computer, it is way more complicated to make sense of the meaning of what one person is saying (or writing and drawing) as compared with the how she is saying it. The surveyed studies using speech modalities focus on prosodic features rather than discourse analysis.

2.1.2 Hypothesis space: Learning theories

The hypothesis space, a term which is widely used in inductive logic and in machine learning, specifies the range of possible states of a phenomenon. In the case of the MMLA field, the hypothesis space lists all the possible interpretations that can be assigned to the observed learning process and are driven by validated learning theories or by psychological constructs. One state in the hypothesis space is a unique value combination of the attributes describing a phenomenon. The learning states are represented by data through the learning labels. The learning labels are typically assigned by human inference to specific time intervals of multimodal data recordings. The act of repeatedly assigning learning labels to multimodal data intervals is called annotation. The annotation is often the only way to provide the baseline to multimodal data, that is, the truth values that will be used to train the machine learning models and test their accuracy. A careful definition of the hypothesis space weighs a lot in the optimal success of the data‐driven solution. Defining the hypothesis space consists in three points: (a) defining actionable components; (b) selecting the most appropriate data representation for the learning labels; and (c) devising an annotation strategy.

Defining actionable components for the hypothesis space

The size of the hypothesis space is proportional to its descriptive power, that is, the number of possible interpretations that it describes, but it is inverse proportional to its generalizability. This is the well‐known bias‐variance trade‐off (Friedman, 1997). One good heuristic for deciding the most useful hypothesis space is thinking in terms of actionability. The predicted state in the hypothesis space should support the design of valuable and actionable feedback for the learner. Hence, the hypothesis space specification must be guided by the question: “what is relevant for the learner to know to improve the performance?” The answer to this question is not trivial and can be properly addressed with careful feedback design (Hattie & Timperley, 2007). The machine learning models, alongside predicting the learning labels in the hypothesis space, can contribute by determining the attribute importance (e.g., how much a modality weigh in for the prediction). The attribute importance is the extent by which each attribute contributes in predicting the learning labels in the hypothesis space that can be used for targeted suggestions. Multimodal data can also provide historical values records and can shed a light on the historical changes in both the input and the hypothesis spaces. Predictions, attribute importance, and the historical multimodal records are three integrative elements that can enhance the learner's feedback.

Data representation of the learning labels

From a data representation point of view, the learning labels of the hypothesis space can be represented as binary variables (e.g., focused vs. not focused) and can be specified in a numerical scale or as discrete categories (e.g., bored, engaged, and confused). The number of required learning labels depends on the size of the input space, that is, the number of attributes selected by the multiple modalities. In general, the number of labels required to properly run supervised machine learning is still quite high, for example, thousands of labels per individual learner. Many researchers in the machine learning field are currently researching techniques based on transfer learning to minimize the problem of the required labels, for example, using techniques such as pretraining with unlabelled data (Pan & Yang, 2010). The frequency of the annotations can also vary from 10 s to hours.

Annotation strategy

Generally, there are two approaches for annotating multimodal data recordings: The first is asking experts to provide the learning labels and the second is asking the learner to fill self‐reports on a regular or random basis. Both approaches come with their set of pros and cons, and both are subject to bias. One advantage of using external experts could be not to interfere the natural task execution flow during learning; the con is that experts are expensive and hard to organize. Self‐reports, instead, produce imbalanced class distribution (Hussain, Monkaresi, & Calvo, 2012), which require some down‐sampling approach, which means losing data. Self‐reports, however, can be given in‐the‐moment, leveraging the short memory and, for this reason, producing more trustworthy reports compared with retrospective ratings (Edwards et al., 2017).

2.2 Literature survey selection process

Using the concept of observability line, we conducted a literature survey of empirical studies in the field of MMLA. The survey was first aimed to discover both the most frequent modalities and learning theories used in MMLA research and therefore the existing patterns and commonalities in the definition of the input and hypothesis spaces. In this survey, we identify representative MMLA studies, and we used them to specify our conceptual model for MMLA (see Section 3). The selected articles were found by going through all the papers of the last 5 years' Learning Analytics & Knowledge conference proceedings (2014–2018), the six editions of the MMLA Data Challenge workshop series (2013–2018), the Learning Analytics Across Physical and Digital Spaces workshop series (2016–2018), and additional publications by influential researchers in the MMLA field. We filtered the retrieved studies by applying two selection criteria: (a) the data set analysed in the studies was generated using more than one modality and (b) the multimodal data were linked to a clear learning theory. We obtained a subset of 20 empirical studies fulfilling these criteria. We consider this number to be sufficient for getting an overview of the field; however, we foresee an increase of similar studies in the future.

2.3 Results of the literature survey

Following the description of the input space (Section 2.1.1) and the hypothesis space (Section 2.1.2), we further operationalize both spaces with insights gained from the literature survey: in Section 2.3.1, the Taxonomy of multimodal data for learning and in Section 2.3.2, the Classification table of the hypothesis space.

2.3.1 Taxonomy of multimodal data for learning

The Taxonomy of multimodal data for learning is the first approach to organize the complexity of the observable modalities (input space), which can be monitored by sensors and are mentioned in the surveyed studies. This taxonomy is not meant to be an exhaustive classification of the modalities for learning or a technical review of different sensor types. For the latter, we refer to the review of Schneider, Börner, van Rosmalen, and Specht (2015a) that provides an extensive list of sensors that can be applied in the domain of education.

The taxonomy is presented from the perspective of a generic sensor. The underlying idea is that a sensor can monitor one (or multiple) modalities. We consider here the modality as a measurable property belonging to a specific part of the body or the context. The modalities are communicated through signal channels. Continuous sampling of a signal channel leads towards the longitudinal collection of one (or multiple) modalities. For instance, a microphone (sensor) can sample the voice (channel) to detect speech (modality), or a video camera can track at the same time voice, movements, and facial traits and therefore provide speech, gross body movements (GBMs), and facial expressions. To make an overview of the proposed taxonomy, we analysed the two main branches: (a) behavioural motoric and (b) behavioural physiological modalities by providing meaningful examples found in the surveyed literature of multimodal experiments. For the third main branch, (c) the contextual modalities, we remind to the work of Zimmermann, Specht, and Lorenz (2005), who propose a framework for context‐aware systems in ubiquitous computing that combines personalization and contextualization.

For simplicity, the motoric modalities can be split between the ones concerning the “body” or the “head.” Part of the subcategory body is the torso, legs, arms, and hands. The movements of the torso can provide GBM, which is typically derived from video cameras. GBM was used by Raca and Dillenbourg (2014) in their study for assessing students' attention from their body posture, gesturing, and other cues. Similarly, Bosch et al. (2015) used GBM to detect learners' emotions in combination with facial expression and learning activity. Although movements of the legs can be tracked with step counters and provide good indicators for physical activity, arms and hands are body parts richer in meaning. Movements of the arms can be detected by video cameras, a popular choice, in this case, is Microsoft Kinect, for gestures and body postures recognition; several studies opted for this solution in the survey, especially those focusing on presentation skills (Barmaki & Hughes, 2015; Echeverría, Avendaño, Chiluiza, Vásquez, & Ochoa, 2014; Schneider, Börner, van Rosmalen, and Specht, 2015b). An alternative to arm movements and gestures can be traced with electromyographic sensors (EMG): Hussain et al. (2012), for instance, used EMG in their study in emotion detection. Finally, hands are probably the parts of the body that can provide the best insights on the learner's activity: Hands movement can be traced in search for specific hand signs or to track handling of objects as well as pen strokes or drawings. For instance, Oviatt, Cohen, Weibel, Hang, and Thompson (2013) gathered a data set known as Math Data Corpus in which they combined analysed pen strokes with modalities captured from video and speech records in group settings with the aim to detect expert from non‐expert students.

The motoric modalities of the head include analysis of the facial expressions, eye movements, and speech analysis. These three body parts can provide relevant information to the point that three well‐established research communities are dedicated to advancing the techniques and methodologies for data acquisition. Facial expressions are highly investigated in learning for emotion recognition in the affective computing research and have been quite extensively used in multimodal human–computer interaction experiments (e.g., Alyuz et al., 2016; Bosch et al., 2015; Hussain et al., 2012). Eye‐tracking is commonly used as an indicator for learners' attention has also been used with multimodal data sets (Edwards et al., 2017; Prieto, Sharma, Dillenbourg, & Rodríguez‐Triana, 2016; Raca & Dillenbourg, 2014). Finally, an analysis of the speech spans from paralanguage analysis like speaking time, keywords pronounced, or prosodic features like tone and pitch (e.g., Prieto et al., 2016) to actual recognition of spoken words in dialogic settings like student–teacher interactions (D'mello et al., 2015). In theory, speech recognition opens up the possibility to transcribe discourse and use natural language processing to look for deeper level semantic interpretations. In practice, due to its high‐level technical complexity, discourse analysis is a frontier that we envision in multimodal learning but which has been not yet explored in related works.

The physiological modalities can be also divided into corresponding body parts. For instance, heart, brain, and skin are the main organs of which is possible to derive physiological information. The most popular approaches to detect brain activity is the electroencephalogram (EEG), which measures the difference of potential inside of the brain. EEG was used by Prieto et al. (2016) in combination with eye tracking, from a teacher analytics perspective to predict social plane of interaction and concrete teaching activity. Different techniques can be used to calculate measurements of the heart activity like the heart rate and HRV: the electrocardiogram (ECG) or the photoplethysmography. Galvanic skin response (GSR), also referred as electro dermal activity (EDA), is the measure of electrical conductance of the skin. If body receives stimuli that are physiologically arousing, the skin conductance increases. Arousal is widely considered to be one of the two main dimensions of an emotional response. Alzoubi, D'Mello, and Calvo (2012) used the combination of EEG, ECG, and galvanic skin response to detect naturalistic expressions of affect. EDA was used by Pijeira‐Díaz et al. (2016) in combination with BVP, heart rate, skin temperature, and pupil size. Heart rate has been used by Di Mitri et al. (2017) to predict Flow in combination with steps and activity data. Edwards et al. (2017) used EDA to detect presence and lack of attention. Hussain et al. (2012) combined ECG, EMG, EDA, and respiration with video features to predict emotions. Also, Grafsgaard, Wiggins, Boyer, Wiebe, and Lester (2014) used multimodal analysis to predict emotions combining EDA (skin conductance) with facial expression derived from video, gestures, and posture.

2.3.2 Classification table of the hypothesis space

Table 1 provides a summary of the learning theories found in the selected studies that used multimodal data. The table classifies the studies according to the chosen theoretical construct, hypothesis space specification, data representation type, and annotation method, and it provides a reference to study.

Table 1. Classification table of the hypothesis space
Construct Hypothesis space Representation type Annotation method Used by
Emotions in learning Low, medium, and high valence Numerical Self‐reports with video records Hussain, Monkaresi, and Calvo (2012)
Satisfied, bored, and confused Categorical N.A. Alyuz et al. (2016)
Boredom, confusion, curiosity, delight, flow, and surprise Categorical Self‐reports Alzoubi, D'Mello, and Calvo (2012)
Confidence, frustration, excitement, and interest Categorical Self‐reports Arroyo et al. (2009)
Boredom, confusion, delight, engaged concentration, and frustration Categorical N.A. Bosch, Chen, Baker, Shute, and D'Mello (2015)
Happiness, sadness, surprise, fear, disgust, anger, and neutral Categorical Mimicking Bahreini, Nadolski, and Westera (2015)
Engagement, frustration, and learning Categorical Self‐reports Grafsgaard, Wiggins, Boyer, Wiebe, and Lester (2014)
Flow Flow 0 to 100 Numerical Self‐reports Di Mitri et al. (2017)
Attention N.A. Categorical Self‐reports Raca and Dillenbourg (2014)
Relevance of the lecture N.A. Categorical Self‐reports Raca and Dillenbourg (2014)
Action codes Build, Plan, Test, Adjust, and Undo Categorical Clustering Worsley and Blikstein (2013)
Activity types N.A. N.A. Expert Prieto, Sharma, Dillenbourg, and Rodríguez‐Triana (2016)
Social plane of interaction N.A. N.A. Expert Prieto et al. (2016)
Expertise Expert and non‐expert; high, medium, and low Categorical ordinal N.A. Ochoa et al. (2013) and Worsley and Blikstein (2013)
Activity performance Good and bad Categorical N.A. Echeverría, Avendaño, Chiluiza, Vásquez, and Ochoa (2014)
Cognitive load Low and high Categorical ordinal Expert Eveleigh et al. (2010)
  • Note. N.A.: not applicable.

The most advanced studies using multimodal data focus on predicting emotions. Emotions as they are considered readouts of physiological changes in the body, changing as the response to certain stimuli. According to the Somatic Marker Hypothesis, physiological changes occur in the body and are passed to the brain when they are interpreted as emotions (Damasio et al., 1991). People adapt their environment and emotional stimuli via the autonomic nervous system responses (Kemper & Lazarus, 1992). It is possible therefore to correlate certain autonomic nervous system activity to emotional states. Emotions are thought to have also an important role in learning (Boekaerts, 2010). Typical emotions during learning are confusion, boredom, engagement, curiosity, interest, surprise, delight, anxiety, and frustration (Hussain et al., 2012). D'Mello (2013) provided a meta‐analysis of the incidence of emotions during learning.

A psychological construct used is the Flow, a mental state of operation that individuals experience whenever they are immersed in the state of energized focus, enjoyment, and full involvement with their current activity. “Being in the flow” means feeling in complete absorption with the current activity and being fed by intrinsic motivation rather than extrinsic rewards (Csikszentmihalyi, 1997). The Flow naturally occurs whenever there is a balance between the level of difficulty of the task and the level of preparation of the individual for the given activity.

Another construct found in the literature is the one of Cognitive load refers to the demands within working memory that occur during learning: Too little load fails to engage learners sufficiently, whereas too much load overruns the capacity of working memory (Van Merriënboer & Sweller, 2005). Eveleigh et al. (2010) measured cognitive load of basketball players using speech during think‐aloud protocols and using external experts to annotate through a 9‐point Likert scale low or high cognitive load.

Epistemological frames are a way of understanding student reasoning and have to deal with the student motivation toward the learning activity. Examples of these frames are hesitant, calm, active (Andrade, 2017); talk, flow, action, stress (Worsley & Blikstein, 2015). These frames were also named action codes by Worsley and Blikstein (2013), which aimed to develop a system that based on speech and gesture recognition would be able to detect three levels of expertise in construction building.

2.4 Discussion

This literature survey deepens the knowledge about the modalities for learning and learning theories and how these were operationalized in the learning scenarios investigated in related studies. Among the “Taxonomy of multimodal data for learning” and the “Classification table of the hypothesis space,” we identified three main challenges of MMLA revealed by the literature survey.

First of all, analysing the literature according to the proposed observability line (Section 2.1) evidenced that the MMLA community has not yet clarified how multimodal data can ultimately support learners in their learning process. None of the studies describes how multimodal data can be used to provide actionable feedback or even an intervention to learners. Hence, the first challenge identified is that (C1) there is a lack of understanding of how multimodal data relates to learning and how these data can be used to support learners achieving the learning goals.

Second, we noticed that generating analytics with multimodal data and letting humans (learners and teachers) making sense them is increasingly complex. The raw multimodal data are generally very noisy and have a large number of attributes and a low semantic value (Dillenbourg, 2016). When the number of attributes in the data set increases, the data become hard to visualize and to interpret for humans. In contrast, intelligent computer agents are able to deal more efficiently with multimodal data and can be employed to process vast amounts at scale and be trained to perform interpretations. Therefore, the second challenge is that (C2) it is still unclear how to combine human and machine interpretations of multimodal data.

Third, the field with MMLA is a field located at the intersection of different disciplines including learning science, machine learning, and social signal processing. We have noticed that learning science and machine learning talk differently about “learning” and that results in very ambiguous meanings and less fruitful discussions. The third challenge identified is that (C3) the fields of machine learning and learning science use different terminologies which are ambiguous and need to be aligned.

3 THE MULTIMODAL LEARNING ANALYTICS MODEL

To address the challenges found by the literature survey (Section 2.4), we introduce the MLeAM, a conceptual model for the emerging research field of MMLA.

The design of MLeAM originates by the necessity to make optimal use of multimodal data for supporting learning activities through intelligent tutoring and learning analytics. The intended MLeAM contributions are framed more clearly into the following three main objectives, respectively, addressing the three challenges described in Section 2.4.

The first objective of MLeAM (O1) is to map the use of multimodal data to enhance the feedback in a learning context. Although other conceptual models were proposed, such, for example, the Learning Analytics Framework (Greller & Drachsler, 2012), until today, no conceptual model for learning was specifically designed to deal with multimodal data. MLeAM can therefore provide more structure to drive further research into the new research field of MMLA and help researchers to design future experiments. With such a structured approach, the community can better identify and describe major challenges that than can be addressed by independent research teams globally.

The second objective (O2) is to show how to combine machine learning with multimodal data. Multimodal data have the potential to provide a digital representation of the physical world in a way that both humans and artificial agents can process. The MLeAM shows explicitly for the MMLA community how to best combine human interpretations with machine learning and automatic inference.

The third objective (O3) is to establish a joint terminology across the two main scientific disciplines that the MMLA field combines: learning science, machine learning. With MLeAM, we hope to establish a shared MMLA terminology to make it meaningful for educational researchers but also express well‐established terms from the educational world to the machine learning community.

To address these objectives, we propose the MLeAM represented in Figure 2. Along with the observability line (Section 2.1) separating the input and hypothesis spaces, MLeAM introduces a second orthogonal dimension: the Mixed reality line. Mixed reality is defined as the contiguous space where physical and digital worlds meet (Milgram, Takemura, Utsumi, & Kishino, 1994). We believe that the separation between physical and digital world helps to understand the benefit which intelligent computer agents and digital technologies can bring into the learning process. The behaviour of the learners and the feedback transmitted to them happens in the physical world. The multimodal data representation of the modalities, and their processing and annotation live in the digital world. The intersection between observability line and mixed reality line creates four quadrants as represented in Figure 2. The transition between these quadrants is guided by a process (“P”) that generates a result (“R”). The model proceeds clockwise iteratively starting from the top centre.

image
Multimodal Learning Analytics Model (MLeAM)

3.1 From sensor capturing to multimodal data

The model starts with (P1) sensor capturing. This process consists of automatically sampling sensors' recording data from several modalities. These chosen modalities relate to the attributes of the input space (see Section 2.1) such as learner's body position, gaze direction, and facial expression. These data can be extracted from of the learner's behaviour and actions or from the learning environment; in either case, the modalities reside in the physical world. P1 continuously transforms different modalities into their digital representation: multiform data streams that we call (R1) multimodal data. A transversal cut into the multimodal data streams corresponds to a digital snapshot of the learner in the learning context at one specific time point. There are three important aspects to be considered when designing a P1 implementation: (a) definition of the used input space: heuristic selection of the modalities and their data representation; (b) identification of the most suitable sensors to capture the selected modalities for the specific learning scenario; and (c) design and implementation of a sensor architecture, a hardware and software infrastructure for collecting and serializing the data streams from multiple sensors (Di Mitri et al., 2017). The design of the sensor architecture must take care of several technical aspects including sensor network engineering, raw data synchronization, fusion techniques, and data storage logic for sensor data persistence. A similar challenge regarding sensor data collection has also been addressed by Specht (2015) in the AICHE model.

3.2 From annotation to learning labels

The second process is the (P2) annotation, a repeated procedure driven by human such as an expert or by the learner. P2 aims at enriching the low‐semantic multimodal data with human judgments according to some predefined assessment scheme. The scheme is based on the hypothesis space (see Section 2.1.1), that is, the unobservable interpretations that the machine learning algorithms automatically derive from the multimodal data. P2 can be seen as the assessment of a learning task in relation to some learning goals. P2 is achieved through triangulation: A judge is exposed to some human interpretable evidence of the learning task (e.g., videos or direct observation). The judge assigns some (R2) learning labels to time segments of the multimodal data. This process P2 annotation allows to provide some meaning to some time intervals of the raw data. Similarly, to P1, P2 requires to define all the possible learning labels. This task corresponds in defining the hypothesis space and its data representation. It also requires devising an annotation strategy consisting of a reporting tool and an annotation procedure. The procedure must minimize the interpretation bias to provide the most reliable labels and should take into account the nature of the observed tasks (i.e., the learning context and activities). The minimum number of labels should be decided a priori. That is usually dependent on the estimated number of attributes to be considered in the model.

3.3 From machine learning to predictions

The third process is the (P3) machine learning. The purposes of supervised machine learning are (a) to learn statistical models (functions) out of observed (R1) multimodal data and manually annotated (R2) learning labels and (b) to generalize on future unobserved data similarly structured to generate (R3) predictions (Mohri, Rostamizadeh, & Talwalkar, 2012). The core machine learning task can be expressed with a mathematical formalism, calculating a function: y = f (X) + ε where

  • X is a multimodal observation, input of the function f. X is a vector of n attributes <x1, …, xn> derived from the multiple learning modalities. All the possible value combinations of X constitute the input space, the domain of f.
  • y is the learning label (s), which locate each input observation into the hypothesis space, the range of f of all possible learning labels.
  • The function f is a generalization of the relationship between observations X and learning labels y plus some error term ε.
  • Given a new multimodal observation Xnew, the prediction task corresponds to calculating the learning label (s) ynew = f (Xnew) + ε.

P3 also includes the following iterative steps: (a) preprocessing: resampling, handling missing data; (b) fitting the model to the data; (c) post‐processing: selection relevant attributes, tuning the parameters; (d) validating the generalizability of the model on new data; and (e) diagnostics: deriving relevance to determine the importance that each attribute holds in predicting the learning labels. If the obtained model is trained with reasonable accuracy, the system can be able to predict the learning labels throughout unseen multimodal data. This prediction is a machine‐assisted estimation of the learner's standpoint in the learning process. P3 automatizes using machines the annotation procedure that has to be driven by the humans. Predictions can be used to enrich the learner model and have a more adaptive feedback model for the learners and nudge them towards positive behavioural change. Both the learner model and the feedback models as shown in Figure 2 are not part of MLeAM but are connected to and extended by it.

3.4 From feedback interpretation to behavioural change

The final process is the (P4) feedback interpretation closing the machine‐driven feedback loop returned back to the learner. The purpose of P4 is to exploit support of the multimodal data and lead to (R4) behavioural change. P4 requires a feedback model that has to be designed in advance. Devising an efficient feedback model is not within the scope of MLeAM (see Figure 2). The feedback model is in fact highly dependent on the learning activity and is defined by the task model. MLeAM does not deal with any of the feedback dimensions (Mory, 2004) and also does not inform about effective feedback strategies that depend learning activity. Nonetheless, MLeAM can be used in combination with different models of feedback with relevant already analysed information about the learners' behaviour and context. Different forms of feedback can be prompted to the learner based on the predictions obtained through MLeAM. The feedback design should be able to facilitate the process of feedback interpretation and lead the learner to some new learning behaviour. Similarly, to the (P2) annotation, the P4 feedback interpretation is fully human‐driven.

4 CONCLUSIONS

In this paper, we analysed the emerging field of MMLA. In Section 1, we introduced the origins of this new field according to four main developments. We highlighted the main mission for MMLA: using multimodal data and data‐driven techniques for filling the gap between observable learning behaviour and learning theories. In Section 2.1, we described two components as input space and the hypothesis space separated by the observability line. We used them as classification framework to conduct a literature survey (Section 2) of MMLA studies. By analysing the related literature, we were able to derive general characteristics of the multimodal data for learning (the input space, Section 2.1.1) and the learning theories and other constructs (the hypothesis space, Section 2.1.2). As result of the literature survey, we proposed the Taxonomy of multimodal data for learning (Section 2.3.1, Figure 3) and the Classification table for the hypothesis space (Section 2.3.2, Table 1). The literature survey also unveiled three main challenges for the MMLA field (Section 2.4). We addressed these challenges introducing the MLeAM (Section 3), a conceptual model to support the emerging field of MMLA. MLeAM has three main objectives: (O1) mapping the use of multimodal data to enhance the feedback in a learning context; (O2) showing how to combine machine learning with multimodal data; and (O3) aligning the terminology used in the field of machine learning and learning science.

image
Taxonomy of multimodal data for learning. EMG: electormyogram; ECG: electrocardiogram; PPG: photoplethysmography; EEG: electroencephalogram; GSR: galvanic skin response; GBM: gross body movement; HR: heart rate; HRV: heart rate variability; EOG: Electrooculogram; BVP: Blood volume pulse; EDA: Electro dermal activity; RR: Respiration rate

We acknowledge that MLeAM is not to be considered in its final stage. In the future, we aim to extend MLeAM by various activities. First of all, it is important to extend the literature survey because some aspects were intentionally not covered, by, for instance, the social dimension of learning, that is, the extent to which both the teacher and the learning peers influence each other, for example, during dialogic learning. We encourage the readers to contribute in expanding the Taxonomy of multimodal data for learning (Figure 3, available online for comments at http://bit.ly/MLEAMtree) and the Classification table of the hypothesis space (Table 1, available online for comments at http://bit.ly/MLEAMtheory) with further studies using combination of different modalities and presenting convincing results in terms of accuracy and their adaptability to different learning settings.

Further empirical studies and meta‐analysis can also focus on which is the most suitable data representation for each modality; the heuristics for best modality combination; best pairing between modality and available sensors in commerce; and providing guidelines for the data analysis of multimodal data sets (P3 in MLeAM). On this particular point, multimodal data for learning need best practices to achieve real‐time time series analysis and classification, in combination with random events and proper balance between learner specificity and generalization across groups. Baselines for future experiments must be established, preventing to reinvent the wheel every time. This could technically be done on the one hand by extending the current interoperability standards (e.g., Experience API—xAPI) to better work with a high‐frequency sensor and consequent data analysis. Meaningful baselines can also be software prototypes such as the Multimodal Learning Hub (Schneider, Di Mitri, Limbu, & Drachsler, 2018), or hardware prototypes that can be used off‐the‐shelf for data collection: for instance, Process Pad (Salehi, Kim, Meltzer, & Blikstein, 2012) or the Multimodal Selfie (Domínguez, Echeverría, Chiluiza, & Ochoa, 2015), two low‐cost devices that can be used in classrooms for capturing multimodal data.

Finally, the MLeAM classification evidenced a shortage of studies that focus on feedback and interventions for the learner and their learning process. In particular, more research is needed to invest feedback systems that use timely predictions generated by multimodal data. We encourage further collaboration with feedback experts to discover what kind of feedback is valuable for the learner and is it able to trigger fundamental behavioural changes.

  • 1 The streetlight effect describes the common practice in science of searching for answers (i.e., the lost key) only into places that are easy to explore, that is, the streetlights (Freedman, 2010).

Number of times cited: 2

  • , Sympathetic arousal commonalities and arousal contagion during collaborative learning: How attuned are triad members?, Computers in Human Behavior, 10.1016/j.chb.2018.11.008, (2018).
  • , Multimodal Analytics for Real-Time Feedback in Co-located Collaboration, Lifelong Technology-Enhanced Learning, 10.1007/978-3-319-98572-5_15, (187-201), (2018).