Big Data and data science: A critical review of issues for educational research
Abstract
Big Data refers to large and disparate volumes of data generated by people, applications and machines. It is gaining increasing attention from a variety of domains, including education. What are the challenges of engaging with Big Data research in education? This paper identifies a wide range of critical issues that researchers need to consider when working with Big Data in education. The issues identified include diversity in the conception and meaning of Big Data in education, ontological, epistemological disparity, technical challenges, ethics and privacy, digital divide and digital dividend, lack of expertise and academic development opportunities to prepare educational researchers to leverage opportunities afforded by Big Data. The goal of this paper is to raise awareness on these issues and initiate a dialogue. The paper was inspired partly by insights drawn from the literature but mostly informed by experience researching into Big Data in education.
Practitioner Notes
What is already known about this topic?
- Potentials of Big Data in transformation of education is growing
- Researchers in educational technology and computational sciences, in particular, have generated a reasonable volume of literature on the promise of Big Data and analytics in higher education in influencing teaching, learning, and research
- Over the last 6 years, a number of institutional research projects have focused on the development of tools, systems, and strategies for successful deployment of learning analytics
- Research is also available on challenges of implementing learning analytics systems across institutions.
What this paper adds
- Identifies a broad range of issues that educational researchers need to consider when working with Big Data in educational research.
- Sets the stage for discourses on the development of educational research design with the theoretical and epistemic tools and approaches to Data Science in educational research
- Introduces Data Science as the fourth research methodology tradition in educational research
Implications for practice and policy
- This paper aims to bring awareness of the fundamental issues facing educational researchers in fully leveraging the promise of Big Data
- The paper appeals for an immediate reconceptualization of value and relevance of Big Data in educational research
- Work presented in this paper will provide institutions to think about creating educational research programs in Data Science that support the successful implementation of Big Data in education
Introduction
Big Data describes a phenomenon involving complex and dynamic growth in data. Researchers conceptualize Big Data along structural and functional dimensions. The structural dimension of Big Data covers elements of volume, velocity, veracity, variety, verification and value (Manyika et al., 2011; Poulovassilis, 2016). The structural diversity and complexity of Big Data is attributed to the emergence of new forms of data generated by sensor networks, social media applications and other mobile and ubiquitous devices (Manyika et al., 2011; Snijders, Matzat, & Reips, 2012; Ward & Barker, 2013). Moreover, the functional dimension describes the use of innovative technologies for capturing, storing, distributing, managing and analyzing large and heterogeneous datasets (Dede, Ho, & Mitros, 2016; Lazer, Kennedy, King, & Vespignani, 2014).
As a new research paradigm, Big Data in education stimulates new ways of framing research questions, designing studies, analyzing and visualizing data (Daniel, 2015; Dede et al., 2016). With the availability of large amounts of data in education, researchers can investigate subgroups within a population (a particular group of people), without necessarily relying on sophisticated probabilistic methods (Mayer‐Schönberger & Cukier, 2013). Further, Big Data tools enable researchers to collect large amounts of research data with relatively low cost (Mayer‐Schönberger, 2015).
Big Data provides educational researchers with a comprehensive set of tools for manipulating and visualizing data on learning and teaching (Baker & Siemens, 2013; Bhat & Ahmed, 2016). Greer and Mark (2016) propose the use of visualization techniques to identify useful patterns in educational data that may not be obvious for teachers working with conventional statistical approaches. Research has also demonstrated that visualization dashboards can help teachers with limited mathematical knowledge to easily navigate and interpret student data (Bueckle, Ginda, Ranga Suri, & Börner, 2017; Ong, 2015).
The analysis of a large set of educational data can inform the development of predictive models for identifying opportunities and addressing challenges of educational institutions (Daniel & Butson, 2013). It is also argued that insights gained from predictive models can be used to explore student learning trajectories to facilitate the design of adaptive and personalized learning environments (McKenney & Mor, 2015).
Though Big Data in education is a new phenomenon, with the availability of vast amount of educational data stored in institutional databases (eg, data obtained from social media and learning management systems), educational research is likely to become a data‐intensive field, utilizing methods and techniques from Data science. Data Science is primarily concerned with the development and use of tools as well as processes for extracting and discerning valuable knowledge from complex data (Leek, 2013; Waller & Fawcett, 2013).
Data Science can provide educational researchers structure and principles necessary for tackling complex educational data. It offers a set of fundamental principles that support the extraction of information and knowledge from data (Provost & Fawcett, 2013, p. 52). The application of Data Science principles and techniques in education can yield high‐quality benefits (Klašnja‐Milićević, Ivanović, & Budimac, 2017).
Though the literature on Big Data in education offers educational researchers numerous opportunities, various issues need to be addressed. This paper examines the emerging promise of Big Data in education and identifies a broad range of issues likely to affect the future utilization of Big Data in education.
Related research
The analysis of student data has only become an important phenomenon in education in the last decade (Lodge & Corrin, 2017). However, the use of data to support student learning can be traced back to research on intelligent tutoring systems (ITS) and artificial intelligence in education (AIED) (see Figure 1). Today, the primary purpose of using data in education is to identify strategies for designing better learning environments (Mor, Ferguson, & Wasson, 2015).

A brief overview of educational technology fields related to Big Data
ITS as shown in Figure 1 utilizes computational approaches to track student learning activities and build diagnostic learner models (Anderson, Boyle, & Reiser, 1985; Brusilovsky, Schwarz, & Weber, 1996; Nwana, 1990). As the need to support diverse and complex students in different forms of learning environments increases, new methods of data gathering and analytics were developed. Researchers in AIED and educational data mining (EDM) in particular have proposed various student modeling techniques (eg, Bayesian Networks, Regression models, Cognitive models, etc), and mechanism for analysis and visualization of data (see, eg, Slater, Joksimović, Kovanovic, Baker, & Gasevic, 2016).
Learning analytics (LA) describes a set of various tools and approaches for handling large and complex student data and the contexts in which learning occurs (Greer & Mark, 2016). Although EDM preceded LA, the two research communities share a common goal of supporting education. EDM is mostly concerned with automated knowledge discovery, and offers a collection of automated data gathering and visualization tools intended to support adaptive learning (see, eg, Baker, 2010; Jones & Jo, 2004; Luan, 2002; Romero, Ventura, & García, 2008). LA research on the otherhand aims to provide students and teachers with actionable tools to support education (Mor et al., 2015; Siemens & Baker, 2012).
Reconceptualization of big data in education
Big Data in education is a new phenomenon (Picciano, 2012), with most of the research discourses centered on the use of data to inform the quality of instruction and research (Eynon, 2013). For instance, Kalota (2015) suggested the utilization of Big Data techniques in education allows academic institutions to understand challenges students face and identify strategies to address them. The availability of large educational data, in particular, provides educational researchers with the opportunities to use automated tools and techniques to explore complex educational phenomenon on a massive scale. Daniel (2015) proposed three uses of Big Data in education namely; supporting learning, teaching and administration (see Figure 2).

Use case scenarios of Big Data in education (Daniel, 2015)
Various sources of Big Data in education are noted elsewhere in the literature. For example, Poulovassilis (2016) describes various sources of Big Data in education including, data generated and stored in virtual learning environments, assessment data, student personal records, learner models, video data and physiological data (eg, heart rate, blood pressure, etc). LA also allow teachers to identify risk factors associated with student engagement in learning and optimize the design of learning environments (Lodge & Corrin, 2017; Mor et al., 2015). Teachers can use LA dashboard to visualize student learning pathways and identify areas where students struggle the most, so that they can design better intervention strategies. Similarly, providing students with access to the personalized dashboard fosters a greater sense of self‐awareness and promotes self‐directed learning dispositions (Tan, Koh, Jonathan, & Yang, 2017).
Although Big Data offers a number of opportunities to education, Big Data in education and educational research are two separate areas of inquiry, requiring different sets of skills and knowledge (Table 1). While educational research is broadly concerned with the investigation of various aspects of education, such as student learning, teaching methods, technology‐enhanced learning, Big Data in education deals with the analysis of large and complex data, using Data Science techniques. Working with Big Data in education, therefore, requires an adequate knowledge of Data Science and the ability to work with automated techniques (eg, machine learning) and high‐performance database systems such as Hadoop and MapReduce.
| Educational research | Big Data research in education |
|---|---|
| • Context of data known to researchers | • Context of data might be unknown to researchers |
| • The researcher may be involved in data collection | • The researcher might use data already collected |
| • Focused epistemology and ontology | • Emergent epistemology and ontology |
| • Clear ethical protocols and accountability | • Ethical accountability might be unknown |
| • Requires expertise in education and research methods | • Needs additional knowledge of Data Science |
| • Clean, and often small/manageable sample size data (measured in Megabyte, Gigabyte) | • Large and complex data structures (measured in Terabytes, Petabytes, Exabyte) |
| • Does not require real‐time analysis | • Might employ real‐time analysis |
| • Data are stored within the limits/possibilities of available storage mechanisms | • Highly distributed file systems (HDFS), or NoSQL |
| • Analysis is manual or using standalone software systems such as (SPSS, NVivo, STATA) | • Use Hadoop, MapReduce systems, web mining applications, sensors networks, traffic monitoring. |
Though, currently, there are not many data scientists working in education (Buckingham et al., 2013; Koprinska, Stretton, & Yacef, 2015), some universities have started offering degrees in LA, a subset of Big Data in education (see, eg, Teachers College Columbia University, University of Queensland and Northeastern University), opening up future opportunities for extending Data Science into the educational domain.
It is apparent that educational researchers often work with relatively small data. However, massive quantities of educational data can now be easily collected, stored, analyzed and shared across individuals and institutions. The affordance of Big Data in education, however, requires an understanding of the fundamental differences between educational research and Big Data in education (Table 1), as well as addressing possible challenges (Figure 3) that might occur as researchers' transition from educational research to research with Big Data in education.

Big Data in education and critical issues for educational research
Big data and educational research: issues of conception
The rapid generation of data by different devices and people has reached huge sizes exceeding the capacity of hardware or human abilities to process and manipulate (Vaitsis, Hervatis, & Zary, 2016). Subsequently, there is a general tendency in the literature to conceptualize Big Data along the magnitude of data. This, in turn, has led to the belief that educational data is not big enough and therefore, cannot be considered Big Data. However, there is little agreement among researchers in many areas outside education on what constitutes Big Data in terms of magnitude. For example, can a terabyte of data qualify as Big Data? Others argued that the characterization of Big Data in terms of size is relative to the domain (Baker, 2015).
While there are different views on what constitutes Big Data in the literature, Daniel and Butson (2013) propose a theoretical framework for describing Big Data in higher education along: institutional analytics (IA), information technology analytics (ITA), academic analytics (AA) and LA.
IA is concerned with the analysis of administrative data to enhance the quality of decision‐making process. ITA relates to the collection and analysis of data associated with student and administrative use of technology services (eg, data warehouse, data standards, tools and policies).
AA refers to the analysis of data on activities and performance of academic programs (measured in terms of completion and graduate rates, passing and failure rates, etc). The outcome of AA informs strategic decisions relating to aspects of administration such as resource allocation, and student retention (Charlton, Mavrikis, & Katsifli, 2013; Siemens, 2013).
LA is the measurement, collection, analysis, and reporting of data about learners and the context in which learning occurs (Jones, 2012; Siemens & Long, 2011). Researchers use the outcomes of LA to understand and optimize the process of learning.
It is important to note that the variety of the conceptions of what constitutes Big Data in education raises issues of different interpretation, which is likely to impact on the implementation of Big Data projects in education.
Big data and educational research: technical issues
Working with Big Data systems requires access to a high‐speed computational infrastructure capable of handling a massive amount of data, which by large can incur a significant cost associated with data capture, storage, analysis and visualization (Chen & Zhang, 2014). Though many academic institutions are currently collecting various forms of data, this data is kept in disparate databases, making analysis difficult. Further, the lack of interoperability of institutional data systems makes aggregating data for analysis from disparate systems cumbersome (Daniel, 2015). Also, the absence of data sharing agreements and data governance models can constitute an additional bottleneck for cross‐institutional data integration and comparison (Miyares & Catalano, 2016).
Concerns on protecting individual and institutional privacy through authentication and security are other major issues in Big Data systems. For example, Big Data systems such as Hadoop designed for managing public data, has only a single level of data protection, making it difficult to employ in educational domain. Building an additional layer of control and encryption to protect data in education can incur significant resources.
Furthermore, one of the values of Big Data in education is the extensive use of predictive modeling. However, over‐reliance on predictive modeling can be limiting, since educational issues are far complex, and difficult to handle in a single model. For example, building models that can accurately identify students at the risk of failing their program of study require a thorough analysis of possible triggers of computationally intractable problems. Computationally intractable problems in education include student socio‐economic challenges (such as family background, health status, available resources and living conditions at home). These problems are by large, beyond the learning environment, as such, they can be difficult to capture and incorporate into a predictive model. In addition, accurate interpretation of predictive models requires technical knowledge of Data Science. However, such knowledge may not be accessible to many educational researchers.
Big data and educational research: ontological issues
Ontology constitutes a particular view of reality. Researchers use a particular ontology to situate their understanding within a theoretical perspective. In educational research, especially qualitative research, engaging with the process of data collection is a critical part of research integrity located within a particular ontological thinking or world‐view, because researchers infer the meaning of a phenomenon based on the context in which data is collected and analyzed. However, researchers working with Big Data are rarely involved in data collection or the study design (Dede et al., 2016), raising the question of how educational researchers can meaningfully engage and analyze data generated for different intent and context.
The relationship between the researcher and what is being researched, referred to as reflexivity, is an essential component of the educational research process. An ontological orientation facilitates an understanding of both the phenomenon being researched and the research process itself (Watt, 2007). A lack of engagement during data collection in Big Data research in education, can drastically diminish the value of reflexivity, possibly compromising the rigor of research outcomes.
Since researchers working with Big Data might make use of data already collected, the essential elements of experimental research such as randomization, countermeasures of threats to various forms of validity and manipulative control are hard to achieve (Boyd & Crawford, 2012; Crawford, Gray, & Miltner, 2014).
Big data and educational research: epistemological issues
The use of a particular epistemology (eg, positivism vs. interpretivism) informs the choice of any research methodology. And in turn, shapes the design of a study. Working with Big Data in education requires an understanding of universal scientific theories for inductive inferences (Frické, 2015). It entails embracing new forms of empiricism (Kitchin, 2014) that transcends quantitative and qualitative traditions.
The new forms of empiricism are characterized by emergent research design, shaped by the technological environment, complex and dynamic data. This new kind of empiricism constitutes the fourth research methodology tradition (Data Science). The first research methodology tradition is based on the scientific methods (quantitative), characterized by positivist epistemology. Whereas, the second tradition consists of research practices that are theoretically situated in interpretivism (qualitative methods). Mixed methods with its overarching epistemology of pragmatism forms the third tradition. The four traditions to with associated forms of data are shown in Figure 4.

The four research methodology traditions
The epistemology associated with Data Science differs from conventional methods (Harford, 2014), because the research process in Data Science does not depend on pre‐deterministic and hegemonic paradigms, but rather requires a continuous negotiation of meaning constrained by the environment in which the research is carried out. Unlike the three methodological traditions, to effectively work with Data Science requires, researchers ability to deal with complex and heterogeneity data (Fan, Han, & Liu, 2014). The fourth research methodology tradition proposed in the paper substantiates Tansley and Tolle (2009) views on the fourth research paradigm, which appeals to new approaches and procedures for undertaking scientific research, in the light of new forms of publically generated data, which can be repurposed and curated within certain regulatory constraints(Tolle, Tansley, & Hey, 2011).
Big data and educational research: methods and data analysis
Romero and Ventura (2010) noted researchers in higher education have worked with relatively small amounts of data that has limited interpretative power, latency, and validity. Big Data in education, offers researchers with robust approaches for discovering subtle population patterns unlikely to be achieved with small‐scale data (Fan et al., 2014). However, the outcome of Big Data research is by large limited to correlational models and predictive analytics, leaving the causality of educational research results desirable but to some extent unattainable.
Many methods in Big Data research are concerned with asking the “what” rather than “why” questions. However, the outcome of educational research is often needed to address particular learning problems. Therefore, identifying causes of problems, rather than simply describing problems is needed to develop better strategies to achieve desirable educational outcomes.
Research contests the use of correlation versus causality (see, eg, Bollier, 2010; Mayer‐Schönberger, 2015). Mayer‐Schönberger (2015) argues that correlational analysis of Big Data research can often yield useful connections for the development of interventions even in the absence causality. However, mistakenly treating correlation as causality can result in choosing ineffective interventions, even if such an outcome is based on the analysis of a large dataset.
Big data and educational research: digital divide and digital dividend issues
Big Data is a source of competition for some institutions because researchers can extract useful insights from data and use it to enhance productivity (Gurstein, 2011; McGuire, Manyika, & Chui, 2012). Junqué de Fortuny, Martens, and Provost (2013) stated that institutions with larger data assets could take advantage of Big Data to achieve competitive advantage over other institutions (digital dividend).
Big Data research in education requires specialized skills lacking in many educational researchers. The use of data visualization in particular, requires knowledge of statistics and information visualization, limiting accessibility to many educational researchers.
Further, many educational researchers are unfamiliar with technologies associated with Big Data research (eg, Hadoop, NoSQL and MapReduce). There is also a lack professional development opportunities for educational researchers interested in working with Big Data (digital divide). Working with Big Data requires the involvement of a Data Scientist, knowledgeable about the right educational research questions. However, there are limited number of data scientists who are familiar or interested in working in the domain of education.
Big data and educational research: ethical and privacy issues
Big Data in education presents potential threats to student safety and security. The use of LA, where students are being tracked and their performance flagged can lead to unintended outcome. For instance, the use of student data to make a decision might deny a student access to future programs. Moreover, some students might object to the use of their data even if a proper consent is obtained. As Prinsloo, Archer, Barnes, Chetty, and Van Zyl (2015) noted collecting data without any clear purpose or obtaining an appropriate consent from students raises issues of ethics, privacy and data ownership. Concerns over ethics and privacy in Big Data in education are complex, requiring an understanding of power relations between students and institutions (Slade & Prinsloo, 2013). To address issues of ethics and privacy, institutions need to consider creating data governance models, and data protection policies as well as the context in which data can be used (Diaries et al., 2014; Dyckhoff, Zielke, Bültmann, Chatti, & Schroeder, 2012; Metcalf & Crawford, 2016)
However, the current standards for obtaining participants' consent in Big Data research is challenging since most of the data already exist in institutional databases. Another ethical dilemma associated with the use of Big Data for research is maintaining research integrity when using publicly accessible data, because those who might have generated such data might not be willing to consent to the use of their data, or such individuals are no longer accessible to researchers.
The right to data ownership and access are additional issues to consider. For instance, should a student have access to the same data as a lecturer? Should educators be able to see analytics from other courses? Moreover, would it be appropriate for academic institutions to make student data available to a third party including employers? There are also questions of institutional moral obligations associated with the use of student data for predictive modeling. For example, if it becomes apparent that a particular student is struggling, will an institution be morally obliged to help the student, even if the cause of the difficulty might be of complex social and financial background?
As educational researchers explore analysis of data stored in cloud‐based computing, issues of privacy and safety are likely to become even more complex, necessitating the establishment of global ethics and moral obligations to use educational data.
Issues of trust need to be addressed when sharing research data across institutions. A growing number of academic journals (eg, British Journal of Educational Technology) encourage researchers to share data with other researchers. However, sharing data without proper guidelines might trigger intellectual property rights and concerns with informed consent. Since those who consented to the use of their data might not have permitted sharing with third parties.
Conclusion and future research
Big Data in education has prompted researchers and developers to see possibilities of how to introduce different technologies to process and generate information to support student learning. Despite the growing research into Big Data in education and its apparent value to learning, many academic institutions are slow in implementing Big Data projects (Macfadyen, 2017). Eynon (2013) cautioned us about the overuse of the Big Data in education as a form of “technical fix” instead of as a way of empowering researchers to carry out better educational research.
As Big Data in education becomes a mainstream research paradigm, issues of conceptualization need to be addressed, before it is widely embraced. A new conception of Big Data in the context of educational research is needed, one that takes into account the complexity of educational environments and the nature of data being collected. Big Data creates unique opportunities for research. However, these opportunities are not immediately accessible to all educational researchers, unless professional development opportunities are provided (Daniel, 2017). Furthermore, establishing educational research programs using Big Data will require addressing issues of epistemology, ontology, methodology and inequality in leveraging the outcomes of Big Data in education.
Issues of infrastructure, tools and human capacity required for the efficient collection, cleaning, analysis and distribution of large datasets are important to address. Further, critical concerns of privacy, ethics, access and governance remain a major concern (Gasevic, Dawson, & Jovanovic, 2016). As institutions increase the need for sharing educational data, it will be imperative that national and international standards be developed to address issues of data security and interoperability, privacy and access. Educators can engage in collecting various forms of data for class improvement, rather than for research (Ho, 2017), repurposing this data for research might not be ethical.
Future research needs to explore these issues and identify strategies to support educational researchers. Moreover, the successful implementation of Big Data in education depends on the ability of educational researchers to work with principles and approaches of Data Science driven by insights in the fourth research methodological tradition.
Statements on open data, ethics and conflict of interest
The ideas presented in this article are developed from the review of published literature. It does not pose any risks to individuals or institution. No potential conflict of interest was reported by the author.
Biography
Ben Kei Daniel is a Senior Lecturer in Higher Education, and the convener for Educational Technology for the University of Otago, New Zealand. He studies the value of Big Data and Learning Analytics in Higher Education. He is also investigating Data Science approaches for educational research, as well as developing pedagogical theories and praxis for research methodologies in Business and Academia.




