A state‐of‐the‐art survey of predicting students' performance using artificial neural networks

Predicting students' performance is one of the most important issue in educational data mining. In order to investigate the state‐of‐the‐art research development in predicting students' performance by using artificial neural networks (ANN), we conducted a survey on 39 important studies on this issue from 2016 to 2021. The results show that: (1) objectives of most prediction model is the performance of learners on the program and course; (2) datasets used for training prediction model are collected from logs of the learning management system; (3) the most commonly used ANN is feedforward neural network; (4) researchers use stochastic gradient descent and Adam algorithm to optimizes the parameters in ANN and configure hyper parameters of ANN manually; (5) feature selection is not necessary because ANN can automatically adjust the weights of artificial neurons; and (6) ANN has better performance than the classical classifiers in predicting student performance. The challenges in previous studies were comprehensively analyzed, suggestions for future research were put forward.

intervention and guidance to students' learning process in advance. Specifically, instructors can use the prediction model to predict students' coming performance according to their characteristics, such as demographics, online learning behaviors, scores in the previous courses. If the result of prediction is low or failure, instructors can provide students with more supervision or guidance. According to the students' prediction performance of different learning resources, the adaptive system provides students with more appropriate difficulty resources to devise efficient personalized learning paths. In recent years, researchers have collected learning datasets from eLearning systems such as LMS and MOOC, and used these datasets to train prediction models based on decision tree (DT), support vector machine (SVM), logistic regression (LogR), K-nearest neighbor (KNN), and other classifiers, 3,4 which achieved good performance of prediction. 5 The general approximation theorem 6 shows that, a hidden layer with sufficient neurons that containing affine transformation units and activation functions with "squeezing" effect can be used to approximate any bounded closed set function R d → R in real number space R d . According to this theorem, artificial neural network (ANN) is better than classical classifiers such as DT, SVM, LogR, for it has better nonlinear mapping and fitting ability. In addition, ANN does not need to include additional components of feature engineering, and can be used to construct end-to-end prediction models. It has become one of the most popular methods in machine learning.
At present, the new generation of ANNs such as convolutional neural network (CNN), 7 recurrent neural network (RNN), and graph neural network (GNN) 8 have achieved great success in many fields, such as smart healthcare systems, 9 speech recognition, 10 image classification, 11,12 natural language processing. 13 Meanwhile, more and more researchers begin to use ANN to predict students' performance. Feedforward neural network (FNN) is the most widely used ANN by researchers. Many researchers also use CNN to extract spatial-based features in learning dataset or GNN to extract the features of correlation between learners and learning resources, and achieve better performance of prediction.
Student's family status, assignment performance, study behavior and other 21 factors are identified as factors that have an impact on students' performance. Five classifiers in machine learning such as DT and NB are the most commonly used methods for researchers to predict students' performance. The classifier with the best average accuracy is KNN. 14 In classroom learning, the factors affecting students' academic include student background and behavior, assessments, teaching and domain knowledge. It is possible to grades or scores of courses before finals by using association rule mining, regression, classification, clustering, and text mining. 15 In high education, factors used to predict student performance can be divided into four categories, namely, students' previous grades and class performance, students' eLearning activity, students' demographics, and students' social information. Additionally, the most common used to predict and classify The most commonly used machine learning method for predicting students' performance are decision trees, naïve Bayes (NB), and artificial neural networks. 16 The deep learning technology represented by artificial neural network has been widely used in predicting students' performance, detecting undesirable student behaviors, recommendations and evaluation. 17 Using educational data mining method to predict students' performance includes four main stages: data collection, data preprocessing, establishing prediction model, evaluating performance of model, and interpretation. 5 So far, there is no special state-of-art survey on the studies of using ANN to predict students' performance. In order to investigate the design details of the ANN-based student performance prediction model, summarize the challenges in latest research and put forward suggestions for future research, we provided a state-of-art survey of ANN for predicting students' performance, reviewed important studies in this field from 2016 to 2021 in detail. Specifically, the main contributions of this survey are as follows.
1. We enlisted the important studies on this issue in the past five years, and classified them according to the objectives and results of prediction. 2. We analyzed the sources and important properties of the training dataset used by the researchers. 3. We summarized the types and structures of ANNs, the methods of optimizing parameters and hyper parameters, feature selection and the methods of representation learning used by researchers in these studies. The advantages and disadvantages of different methods are also discussed. 4. We compared the performance of prediction models in different studies, and identified factors affecting the performance of the prediction model. 5. We summarize the latest achievement on this issue and provided some suggestions for future research.
The following organization of this article are as follows: In Section 2, we define the problem of predicting students' performance using ANN formally, introduce the commonly used ANN, and discuss the related work. In Section 3, we introduce the methodology of this study. In Section 4, we classify the previous studies according to the objectives of prediction and training datasets, and report our investigation results according to six research questions. In Section 5, we summarize the previous studies by reviewing the research problems, and put forward some suggestions for future research.

Problem definition
In educational data mining, researchers generally regard prediction of student' performance as a classification problem in machine learning. For example, predicting whether students can pass the course and whether students can accomplish the program is binary classification, concretely, the result of prediction is success or fail. The results of prediction can also be expressed in multi-grade, such as "Excellent, Good, Medium, Pass, Fail." Predicting students' performance can also be regarded as a regression problem. The result of prediction is numeric value, which includes students' grade point average (GPA) or score of course. We can formally define the problem of predicting students' performance using ANN as (1).
where, f-ANN used for predicting students' performance, it can be regarded as a function. s-features vector as input as ANN, these features may be the demographic characteristics of students, the scores of courses, or the behavioral characteristics extracted from the log of LMS.
-learnable parameters of f , including the weight and bias of all artificial neurons in ANN. h-hyper parameters of f which cannot be obtained through training, mainly including includes the number of hidden layers of ANN, the number of artificial neurons contained in each hidden layer, the activation function, the learning rate and loss function used in training.
g-result of prediction, that is performance of students expressed in multi-grades or numeric values.

ANN for predicting students' performance
At present, researchers use FNN, CNN, 7,18 RNN, 18 GNN, 8 and other ANNs to predict students' performance. FNN is the most classical ANN used by researchers mostly at present. In FNN, all neurons are divided into different layers. The input of FNN is a d-dimensional feature vector R d , and the output of FNN is a number or an element in C = {c 1 , c 2 , … , c n }. FNN is used to represent a mapping of R d → R or R d → C. The neurons can accept the output of the neurons in the previous layer, and the generate output for the neurons in the next layer. The data in FNN flows in one direction. The structure of FNN is shown in Figure 1. As can be seen from Figure 1, the output of i-th layer in FNN denoted by y (i) is the result of performing affine transformation (2) and activation function on the output of (i−1)th layer, and f (i) is the activation function used by i-th layer, W (i) and b (i) are the weights and offsets of affine transformation, FNN can be regarded as a composite function (3) where x is the input vector and W and b are matrices composed of weights and deviations of all neurons in the network. Using FNN to predict students' performance, we can directly input the feature vector x = [x 1 , x 2 , … , x n ] into FNN. FNN can output not only grade of course and other categorical results, but also GPA and other numeric results.
CNN adds convolution and pooling layer to the front end of FNN, and extracts the implicit spatial-based features in raw dataset through convolution and pooling. Figure 2 is an example of convolution of two-dimensional vector x using kernel k.
Convolution is used to extract the features of a local region in the dataset. Let the input vector x∈R nh ⨉ nw , the kernel k∈R kh ⨉ kw , the result of convolution o∈ (nh-kh+1) ⨉ (nw-kw+1) when the step size is one, and o ij is generated according to (4). The result of convolution can be used as the input of FNN. Researcher generally use convolution to extract spatial-based features from the raw learning dataset, such as students' learning behavior features in a learning resource or time period, and use the extracted features to predict students' performance.
The neurons in RNN can accept not only the input x t of current time, but also the hidden state s t−1 of previous time, which is especially suitable for handling sequential data. The structure of a basic RNN are shown in Figure 3.
It is can be seen from Figure 3 that RNN uses s t−1 to store the hidden state from t 0 to t−1. The hidden state s t is determined by s t−1 and x t (5). W, U, and V are weights and b are offset. These parameters can be optimized through training. The output of RNN o t is determined by s t (6). Since all hidden state from s 0 to s t−1 are stored in s t , RNN can be used to handle sequential data. Students may generate lots of sequential data in learning activities, such as sequential records of visiting learning resources or answering questions. Researchers use RNN to extract implicit features from these sequential datasets to improve the accuracy of predicting students' performance.
F I G U R E 3 RNN generates output o t based on previous state s t−1 and present input x t

F I G U R E 4 F obtained by aggregating neighbor states in C used for prediction
Graph is the most powerful data structure, which is composed of vertices and edges. Vertices represent the information of entities and edges represent the information associated between entities. We can use GNN to handle dataset with graph structure. From Figure 4, we can see that the input and output of GNN are graphs. The output graph F is generated from input graph C through aggregation in multiple hidden layers. GNN uses the information of adjacent vertices or edges to update those in C. After several rounds of updating, the information of vertices, edges or the whole graph in F can be input into FNN to realize classification and prediction.
Node v uses its state in the previous layer h (v) t−1 and received state m (v) t to update its state in this layer h (v) t (7), m (v) t contains the state of all adjacent vertices N(v) in the previous layer h (u) t−1 and the state of all edges e (u,v) (8). With many iterations, the nodes in the last layer will contain the states of many neighbors.
Many learning data can be represented by graphs, such as knowledge graph composed of conceptions, the communication and cooperation relationship between learners. Researchers construct these associated data into an original graph C. After multilayer aggregation, the vertices and edges in the output graph F contain the states of neighbors, which provides a new solution for predicting students' performance.

Related works
Romero et al. reviewed the latest development of educational data mining, including important publications, educational environment for data generation, tools used by researchers and free available datasets, as well as the methods, main objectives and future trend of educational data mining. 2 The contributors of this review also discussed the similarities and differences between educational data mining and learning analysis, emphasizing that educational data mining focuses on technical challenges while learning analysis focuses on educational challenges. The main task of educational data mining is to design algorithms or models more suitable for educational data. Grusti et al. investigated 73 studies which used educational data mining methods to predict college students' dropout rate from 1999 to 2019. They identified six classification techniques: decision tree, K-nearest neighbors, support vector machines, Bayesian classification, neural networks, logistic regression, in which mostly used classification techniques is decision tree (60%). Their results showed that researchers used multilayer perceptron to predict college students' dropout rate in 11 studies, but this survey did not involve the employment of CNN, RNN, and other latest ANNs. 19 Charitopoulos et al. reviewed 316 studies that used DT, RF, ANN, and evolutionary algorithms to mine educational datasets between 2010 and 2018. Their survey results showed that assessment of learning outcomes and skills, are by far the top education problems/issues addressed through EDM and LA research, and predicting students' performance is one of the most concerned problems in educational data mining. They also pointed out that the most widely used learning environments for EDM are e-learning and computer-based learning (LMS), ANN is mainly used for data statistic and outlier detection in e-learning. 20 Khan et al. presented a systematic review of studies focusing on predicting students' performance in classroom learning. They reviewed 140 studies that used association rule mining, regression, clustering, text mining, and social network analysis to predict students' performance, and but did not include studies that used ANNs. 15 Their survey results show that regression and classification are two technologies widely used to determine the impact of various factors, but researchers rarely use of text mining, social network analysis to extract valuable inspiration from unstructured data. Hernández-Blanco et al. reviewed the application of deep learning in 13 main problems in educational data mining. They investigated 41 studies between 2015 and 2018. Their survey results showed that predicting students' performance is the most concerned problem. 17 However, the studies selected in this survey was published before 2018, and results of their survey could not reflect the latest achievement in using ANN to predict students' performance. Because of the importance of predicting students' performance and the rapid development of artificial neural networks, it is necessary to investigate and summarize the latest development and put forward some suggestions for future research. The comparison of existed surveys is shown in Table 1.

TA B L E 1 Comparison of existed surveys
Ref. (2) Learning contexts within relevant research.
(3) Relation between classic and soft computing methods employed to solve specific problems. (2) Methods used for finding these factors.
(3) How to predict the actual grade or score. (1) EDM tasks that have benefited from deep learning and those that are pending to be explored.
(3) Key concepts, main architectures, and configurations of deep learning and applications to EDM.
(4) Current state-of-the-art and future directions on deep learning used for educational data mining.

METHODOLOGY
In order to investigate the latest development of predicting students' performance based on ANNs, we used a simplified SLR methodology. According to the guidelines proposed by Kitchenham, 21 SLR includes three main stages: planning, conducting, and reporting. The methodology used in our survey can be shown in Figure 5.

Planning
In this subsection, we introduced the planning of this survey, and the results will be reported in detail in the next section.
In order to specify the objectives of this survey and sum up our contribution, we formalized six research questions: RQ-1: What specific objective do researchers use ANN to predict? RQ-2: What are sources and properties of datasets collected by researchers use for training ANN? RQ-3: What types and structures of ANN did the researchers use to predict students' performance? RQ-4: What methods have the researchers used to optimize the parameters and hyper parameters of ANN? RQ-5: What features are provided to ANN by researchers to predict students' performance? RQ-6: How about the performance of the prediction model established by researchers using ANN?
According to our research objectives and research questions, the search keywords include ("predict" OR "forecasting") AND ("Deep Learning" or "Neural Network") AND ("Student Performance"). Our search keywords indicate that we have not only collected the research on the application of FNN or MLP, but also collected the research on the latest ANNs such as CNN, RNN, and GNN.
Online databases we retrieved include Web of Science, Engineering Village, IEEE Xplore, and Science Direct. Our inclusion criteria are shown in Table 2. Only the studies that meet the inclusion criteria in the search results was selected by us for detailed investigation.
As can be seen from Table 1, we only selected empirical research on predicting students' performance using ANN. The paper should describe the design details of the prediction model based on ANN and the evaluation results of the performance of the prediction model through experiments. In order to reflect the latest development of this research problem, we selected the latest research in recent 5 years (2016-2021).
According to our research questions and inclusion criteria, we extracted the following information from each selected study: 1. Specific objectives of the prediction based on ANN. 2. The source, number of instances and features of the training dataset collected by the researcher.

Conducting
We used the search keywords described in the previous subsection to retrieve in the designated online database, filtered the results using inclusion criteria, and finally selected 39 studies for extracting information. The year of publication of these papers is shown in Figure 6. As can be seen from Figure 6, research on predicting students' performance using ANN has increased rapidly after 2019. Although ANN has a history of many years, the proposal of AlexNET in 2012 attracted widespread attention from researchers and achieved great success in the fields of image recognition, language translation, and target detection. 22 The development of ANN in many fields has promoted researchers to apply it to predict students' performance.
Educational data mining is an interdisciplinary research field including pedagogy, computer science, and statistics. Researchers in this field mainly include experts in pedagogy and computer science. The source of studies selected in this survey is shown in Figure 7.
As can be seen from Figure 7, research on predicting students' performance using artificial neural networks comes from the fields of pedagogy and engineering technology, and most of studies selected in this survey come from journals and international conference proceeding with the themes of data mining, artificial intelligence and intelligent systems. Researchers in the field of pedagogy also began to use artificial neural networks to predict students' performance. Obviously, using ANN to predict students' performance has attracted the attention of researchers in many fields, and it is an interdisciplinary research issue.

RESULTS
In this section, we report the results of this survey in detail according to six research questions. These results are synthesized according to the information extracted from the selected studies which can respond to research questions.

RQ-1: What specific objective do researchers use ANN to predict?
There are many different types of students' performance, that is, we can predict students' performance on different objectives such as question, course and program. The results of students' performance can be represented by two simple classes: Pass and Fail, such as whether students can successfully accomplish the course or program. We can also use multi-grades to represent students' performance. For example, we can use five classes of Excellent, Good, Normal, Pass, Fail to represent students' performance in course or program. Also We can use numeric values to describe students' performance, of which the commonly used are score of course and GPA. The statistics of prediction objectives and results in the selected studies are shown in Table 3. It can be seen from Table 2 that the objectives of selected studies include course, program and question. Among the selected studies, the most common objective of prediction is the performance of courses (56%, 22/39), and the performance of program (41%, 16/39). Only one study's objective of prediction is the students' performance on question. The predictive results of courses and programs include three types: binary-grades (Pass or Fail), multi-grades (Fail, Pass, Medium, Good, Excellent), and numeric results (scores or GPA). Most researchers use ANNs to solve the problem of classification, and the predictive results are classes (72%, 28/39). In other studies, researchers use ANNs to solve the problem of regression, and the predictive results are scores or GPA (28%, 11/39). In general, predicting students' grades of courses is the most significant goal for researchers to use ANN to predict students' performance.

I.D and References
Objective In 85% of selected studies, researchers collect training datasets from specific institutions or systems independently. Most researchers reported in detail the source, number of instances and features of the training dataset (Table 4). The sources of the datasets include public data repository, student learning records and logs recorded by eLearning systems such as learning management system, and information collected manually through questionnaires. We use "P", "E", "M" to represent three data sources in Table 3. If there is no specific indication, we use the "-" to represent. Obviously, most of training datasets are private. As can be seen from Table 3, the volume of training datasets used by researchers are vary greatly. Small datasets are generally collected through manual statistics or questionnaires, while datasets of more than 500 instances are exported from eLearning systems such as LMS or online question pool. The more training data can make the generalization ability of ANN stronger, but it also needs more computing resources. In order to reduce the computational load, researchers randomly sampled smaller independent identically distributed subsets from the original dataset. For example, Ma et al. sampled 20,000 instances from 13,900,000 records for training ANN according to experimental conditions. 45,46 There is no unified standard for the training datasets used by researchers in different studies, and there are great differences. Among the selected studies, only six researchers used the public datasets, of which four researchers used the famous student performance dataset, 62 from UCI machine learning repository 63 and another two researchers used the OULA. 64 The performance of ANNs trained with public datasets can be compared, and the methods of construction in these studies are more valuable. The ANNs trained with private datasets can only be used for the institutions of the data source, and they cannot be compared and transplanted, which limits the application of these ANNs. 4.3 RQ-3: What types and structures of ANN did the researchers use to predict students' performance?

Types of ANN
In the selected studies, FNN is the most used ANN by researchers. Because it can be used for classification and regression, researchers use FNN to predict students' performance such as multi-grades, score, and GPA. In addition to FNN, researchers also use lastest ANNs such as CNN, RNN and GNN to extract hidden features from the raw datasets through representation learning provided by these networks. In the selected studies, the number of different types of ANN is shown in Figure 8.
As can be seen from Figure 8, FNN is most used ANN by researchers. The reason is that FNN is very similar to the traditional multi-layer perceptron and is the most familiar ANN for researchers. After inputting the feature vector into FNN, the results of classification or regression can be obtained. However, FNN can only accept one-dimensional vector as input, although it can automatically adjust the significance of different features by setting the parameters of artificial neurons, it does not have the ability to extract implied new features. In order to overcome the defects of FNN, researchers use RNN, CNN, and GNN to extract sequential, space-based and rational-based implicit features from the raw datasets, respectively. We will report these research achievements in the next section. In order to solve the problem of small-scale training datasets, researchers began to use GAN to generate more new training datasets and improve the prediction performance of ANN. Chui et al. designed an ANN based on generative adversarial network (GAN) 65 to generate new training dataset according to the existing training data, so that the classifier DSVM yield specificity of 0.968, sensitivity of 0.971, and AUC of 0.954 when this ANN contains three hidden layers. 36

Number of hidden layers and artificial neurons in ANN
The most significant characteristic of the ANN's structure is the number of hidden layers and artificial neurons in each layer. Although the latest ANN for image classification or natural language processing contains dozens of hidden layers, such as 22 hidden layers in GoogLeNet, 66 the number of hidden layers contained in the ANN used by researchers to predict students' performance is still very small. In the selected studies, not all the researchers described the detail structure of the ANN they constructed. According to the concrete description that be extracted from the selected studies, the number of hidden layers of the ANN constructed by the researchers to predict students' performance is shown in Figure 9. From Figure 9, we can see that the number of hidden layers in ANN constructed by most researchers is less than five, and there are only three ANNs with more than 10 hidden layers. This shows that for predicting students' performance, the ANN with 10 hidden layers can achieve good performance. It also reflects that the training datasets used by researchers are simple, and ANN with few hidden layers can achieve good fitting. The number of hidden layers of ANN is one of the most important hyper parameters of ANN, which has a significant impact on the performance of ANN. Increasing the hidden F I G U R E 8 Types of ANN used by the researchers in selected studies

F I G U R E 10 Maximum number of neurons in a single hidden layer in the ANN
layers of ANN can expand the capacity of ANN and improve the fitting degree of training dataset usually. However, in some cases, increasing the number of hidden layers will reduce the performance of ANN. At present, researchers will also find the most appropriate number of hidden layers through experiments. In the studies we selected, researchers generally set the number of hidden layers of ANN directly based on experience, and only a few researchers use the optimization method of hyper parameters to set the number of hidden layers, that we will report in detail in the next section. In particular, Salal et al. used a simple method to generate the number of hidden layers according to the size of the input vector and the number of multi-classes of prediction. 25 We investigated the studies that reported the concrete structure of ANN in detail. The maximum number of artificial neurons contained in a single hidden layer in the ANN constructed by the researchers is shown in Figure 10.
From Figure 10, we can see that the maximum number of neurons in single-layer varies greatly. The minimum is less than 10 and the maximum is more than 500. In most studies, the maximum number of single-layer is less than 100. Figures 9 and 10 also show that the scale of the ANN constructed by researchers to predict students' performance is very small, far less than the latest famous ANN such as bidirectional encoder representation from transformers (BERT) 67 and visual geometry group (VGG). 68 As the training datasets become more complex, researchers need to increase the number of hidden layers and neurons to increase the capacity of ANN. Because it is easier to calculate the differential of rectified linear unit (RELU) 69 function, most researchers use RELU as the nonlinear extrusion unit in artificial neurons.
Modern ANNs are generally composed of several general modules. Application of modules greatly reduces the difficulty of design and construction, and provides great help for the popularization of ANNs. Among the selected researchers, the modules used by researchers include full connection layer in FNN, convolution and pooling layer in CNN, long short-term memory (LSTM) in RNN. Researchers also began to use the latest modules such as attention to construct ANN for predicting students' performance. 35,45 4.4 RQ-4: What methods have the researchers used to optimize the parameters and hyper parameters of ANN?

4.4.1
Methods used to optimize the parameters in ANN ANN contains many parameters that need to be optimized in the training stage. These parameters include weights and biases of all neurons in FNN, the weights of different gates in LSTM. All parameters in a ANN can be represented by a vector (9). Training ANN is a process of continuously optimizing parameters according to the value of loss function on all samples in training dataset L( ). The goal of training is to find the minimum value of L ( ) and the parameters at this time * (10). Therefore, finding the minimum value of loss function L( ) is a mathematical optimization problem, and gradient descent method is the most commonly used optimization algorithm, which uses gradient g (11) on the loss function L to update parameters .
In 28 of the selected studies, researchers reported the specific methods of optimizing parameters (optimizers) they used. According to our investigation, the optimizers used by the researchers include Levenberg-Marquardt, 70 stochastic gradient descent (SGD) 71 and adaptive moment estimation(Adam). 72 (Table 5).
As can be seen from Table 5, SGD and Adam are the most used optimizers by researchers. Both SGD and Adam use gradient g to optimize parameters, which are very similar. In order to reduce the computational complexity, SGD uses a small batch of training datasets to generate the result of the loss function L ′ ( ). Because all samples in the training datasets are independent and identically distributed, and samples in the small batches are randomly selected, the mathematical expectation of L( ) and L ′ ( ) are the same (12). SGD uses the gradient at the previous time g t and learning rate to generate new parameters t+1 (13), is called learning rate and is an important hyper parameter. Researchers set the learning rate based on experience, generally between 10 −3 and 10 −5 .
Adam introduces adaptive learning rate t and momentum m t into SGD to make the optimization smoother and less prone to oscillation (14). t is a time-dependent learning rate plan, normally, t decreases as time goes, which means that the learning rate decreases in the process of training. m t is the weighted sum of the gradients from time 0 to t−1. t is the root mean square of the gradient from time 0 to t−1. It can be seen that Adam not only uses the gradient of the current time, but also considers the gradients of the previous time, so that the optimization of parameters has inertia.  73 In selected studies, researchers used the search method based on particle swarm optimization (PSO) 74 to optimize the hyper parameters of ANN. 26,42,57 Particle swarm optimization (PSO) is a classic heuristic search method to find the global optimal solution quickly. Researchers further improve the performance of ANN used to predict students' performance by optimizing hyper parameters. Unfortunately, most researchers initialize the hyper parameters of ANN manually based on experience and do not optimize these hyper parameters.

Methods for selecting features
Feature selection is a classical feature engineering method, which refers to selecting significant features that have the prominent impact on the prediction results from the original features, and improving the performance of prediction by removing redundant features. 75 The framework of all feature selection method is shown in Figure 11. Generally, feature selection is a necessary part of machine learning tasks. However, ANN can automatically adjust the significance of different features in the input vector through tuning up weights of artificial neurons. Theoretically, it does not need feature selection and very suitable for designing end-to-end prediction model, which is also a significant advantage of ANN. In the selected studies, 60% of the prediction models designed by researchers are end-to-end and do not include additional feature selection, achieving excellent performance of prediction.
Some researchers still use additional feature selection methods to select the most important features from the original features as the input of ANN to improve the accuracy of predicting students' performance. In order to evaluate the significance of features, researchers used random forest 76 as a tool to evaluate feature subsets, 32,56 and took the prediction accuracy of random forest as the criterion for evaluating the feature subsets. In order to evaluate the correlation between features and prediction results, contributors of these studies 26,33,41,42,48,52 used Pearson correlation coefficient or Spearman correlation coefficient to mark each feature. According to the experimental results reported by the researchers, the prediction accuracy is further improved through additional feature selection on their private training datasets.

Methods of representation learning
In addition to using the original features, researchers have begun to use the method of representation learning to re-encode the original features or extract new features from the original features as the input of artificial neural networks. These new features may be sequential, spatial-based, and rational-based implicit features. In order to represent the sequence of students' learning activities, Rodolfo et al. used a coding method similar to bag of word (BOW)s 77 to encode the sequence of students' activity interacting with LMS into a vector as the input of ANN. 27 Table 5 is an example of this representation method. We can use vectors " [1,2,4,6]" to represent the set of learning F I G U R E 11 Framework of feature selection method activities of student 02. This representation method is simple and easy to use, but it cannot represent the sequential relationship between activities. Kim et al. used one-hot 78 encoding to convert each learning activity into a vector, and expressed the sequence of students' learning activities as a set of vectors. 55 Figure 12 is an example of this representation method, which can represent the sequence of different learning activities (Table 6).
Convolution is a feature extraction method based on the principle of translation invariance and feature locality. It can extract spatial-based features effectively and has been widely used and achieved great success in image classification. Researchers have begun to use convolution to extract spatial-based features from students' data as the input of ANN to predict students' performance. Ma et al. proposed a method to extract features from students' smart campus card swiping data using convolution for predicting students' performance. They convert student's smart campus card swiping data into a three-dimensional vector R t×l×d (Figure 13), and three convolution kernels are used to extract three features of students' duration, variation and periodicity in different locations on campus to predict students' GPA. 46 The CNN 45 established by Zong et al. is similar to that proposed by Ma et al. The difference is that Zong et al. introduced the attention mechanism to further distinguish and enhance the results of convolution. Zhang et al. used multiple convolution and pooling kernels of different sizes to extract structural features from scores of courses to predict students' grades of courses, and used a global attention mechanism to find the correlation between courses. 35 Graph is composed of vertices and edges between vertices. It can be used to represent various types of data such as images, texts, and molecules. It is the most flexible data structure. GNN has achieved great success in the fields of computer vision, natural language processing, biomedicine. 8 In the process of students' learning, the communication

F I G U R E 15 A heterogeneous graph representing the relationship between students and questions
between students and the interaction between students and learning resources can be represented by graphs. Researchers have begun to use GNN to extract implicit features from the raw datasets to predict students' performance. Pu et al. constructed an undirected graph representing the relationship between students. The vertex in the graph is the student and the edge is the degree of association of students. This degree of association is expressed by the Pearson correlation (15) between students. In the undirected graph, some students have been marked with academic grades (Figure 14).The researchers put the constructed graph into a GNN to predict the performance of unmarked students. 37 Li et al. constructed a heterogeneous graph containing students and questions ( Figure 15). The edges between vertices represent the students' scores of answering questions, and some edges in the graph have been marked. Researchers input the constructed graph into a GNN to predict students' performance on unanswered questions. 61 Obviously, researchers use GNN to solve some semi-supervised tasks, that is, only some samples in the training datasets are marked. We can predict or automatically mark the unlabeled samples through the association between samples.

4.6
RQ-6: How about the performance of the prediction model established by researchers using ANN?

Performance of classification model
In the selected studies, in order to evaluate the performance of ANN used to predict students' performance, researchers used some indicators from machine learning to evaluate the performance of classification and regression. Researchers used indicators based on the confusion matrix, including accuracy, precision, recall, F1-score, to estimate the performance of the prediction model that outputs categorical results like multi-grades of courses and program. According to the definition of these indicators, accuracy indicates the overall accuracy of the prediction model; precision indicates the prediction accuracy of the prediction model for positive samples; recall indicates the recognition accuracy of the prediction model for positive samples, and F1-score is the blending index of precision and recall. Among these indicators, accuracy is the most used indicator by researchers. Unfortunately, the training datasets are usually class-unbalanced. For example, the number of medium level of course is much more than fail or excellent. Accuracy is difficult to reflect the performance of ANN with minority class.
In the selected studies, due to the different training datasets used by researchers, we cannot compare the performance of these prediction models, but we can use the experimental results reported by researchers in the studies for comparative analysis. Since accuracy is the most popular evaluation indicators for researchers, we report the accuracy of the prediction model reported by researchers as shown in Figure 16.
As can be seen from Figure 16, the accuracy of the prediction models reported by the researchers in only two studies is less than 80%, so the prediction models established by the researchers have achieved brilliant accuracy of prediction on their own private datasets. Many researchers have compared the performance of the constructed ANN with traditional classifiers, such as support vector machine, random forest, logistic regression, naive Bayes, K-nearest neighbor, decision tree. ANN always has better performance than traditional classifiers on researchers' private datasets, which also shows that the use of ANN has greater potential in predicting students' performance. It is worth noting that prediction models with low accuracy need to extract implicit features from the raw datasets through representation learning, and the quality of the extracted features may be the main reason affecting the accuracy of the prediction model.

Performance of regression model
Researchers used mean square error (MSE) and root mean square error (RMSE) to estimate the performance of ANN with numerical results of prediction, including scores of course, students' GPA. Meanwhile, researchers generally use RMSE to measure the performance of regression models based on ANN. In the studies we selected, the RMSE results reported in detail by the researchers are shown in Table 7.
It is can be seen from Table 7, the relative error rate of ANN in predicting scores of courses, students' GPA, and other regression tasks is very small, which shows better effect than traditional regression methods such as linear regression. 79 According to the experimental plan and results in selected studies, the factors affecting the performance of ANN used to predict students' performance include the quantity and quality of training datasets, the quality of implicit features extracted from the raw data, and the hyper parameters of ANN, including the number of hidden layers and divine elements, activation function, optimizer, learning rate, and feature selection.

CONCLUSION AND PROSPECTION
In recent years, artificial neural networks, especially deep neural networks, have achieved great success in many fields.
In this survey, we comprehensively investigated the technical details of building a prediction model of student performance based on artificial neural networks, including types of ANN, the source and features of training datasets, methods of optimizing parameters and hyperparameters of ANN, scheme of feature selection and representation learning, and compared the performance of existing models. The key findings of this survey are summarized in Section 5.1 by reviewing six research questions. The results can help researchers quickly understand the latest achievements in this issue. In view of the challenges in previous studies, some suggestions for future research are proposed in Section 5.2, which may help researchers design more general and valuable prediction models.

Review of research questions
RQ-1: What specific objective do researchers use ANN to predict? In the past 5 years, the specific objectives of students' performance predicted by researchers include students' performance of program, courses and questions. The results of prediction include pass and fail, multi-grades, scores and students' GPA. Educational institutions estimate the number of students in the future by predicting the performance of students' program, appraise the income of institutions and adjust the allocation of teachers according to the number of students. Prediction of students' course performance can help teachers to intervene and guide students with the risk of failure or drop-out as early as possible. The predicted results of students' performance on questions, tests, assignments and other small tasks can be used as the basis for providing personalized learning resources for students in the next stage. In comparison, the research on the prediction of students' performance in small tasks is rare.

RQ-2:
What are sources and properties of datasets collected by researchers use for training ANN? In order to achieve the objectives of prediction, most researchers retrieve the records and logs of students' learning activities from many kinds of digital learning and management system to train ANN. The sources of training datasets also include the historical statistical data of educational institutions and questionnaires specially designed by the researchers. There is no unified standard between the training datasets used by different researchers, and there are great differences. Compared with datasets collected manually through surveys and questionnaires, the datasets from records and logs of LMS and MOOC have larger volume and higher quality. However, the lack of uniform standards of datasets makes it impossible to compare the performance of the prediction models established in different research, and these models are not portable. RQ-3: What types and structures of ANN did the researchers use to predict students' performance? The ANN most used by researchers for predicting students' performance is FNN. Some researchers also use CNN, RNN, GNN, and other latest ANN with stronger function of representation learning to predict students' performance by extracting the implicit features based on spatial, sequential and rational in the raw datasets. Because the learning traits of students can be described more accurately by representing learning, the deep neural network will be more widely used in the future.

RQ-4:
What methods have the researchers used to optimize the parameters and hyper parameters of ANN? Researchers often use gradient-based optimizers such as Adam and SGD to optimize the parameters of ANN during training. For the number of hidden layers, the number of neurons, activation function, learning rate, and other hyper parameters of ANN, researchers generally set them manually based on experience. Only a few researchers use heuristic search to optimize the hyper parameters of ANN and improve the performance of predicting students' performance. Although these methods have been widely used by researchers, most of them are manual and empirical, and lack of optimization methods specially designed according to the characteristics of training datasets. RQ-5: What features are provided to ANN by researchers to predict students' performance? Because ANN can automatically adjust weights of features to achieve feature selection, most researchers have not made an additional feature selection. It is worth noting that CNN, RNN, GNN, and other deep neural networks can extract hidden features from the original datasets, and using these features can further improve the performance of the prediction model.

RQ-6:
How about the performance of the prediction model established by researchers using ANN? According to the researchers' report, the accuracy of artificial neural networks in predicting students' performance generally exceeds 90%, which has better performance than traditional classifiers such as SVM, RF, NB, KNN, and DT. ANN also achieves better performance than multiple linear regression in predicting students' GPA and course scores. However, because students' performance is usually unbalanced, which means that students with medium performance are much more than students with low or high performance. The accuracy used by researchers cannot fully reflect the prediction performance of these model for all performance classes.

Suggestions for future research
Although researchers have made remarkable achievements in previous studies, we also find some challenges: inconsistent standards of training datasets, insignificant extracted features, deficiency of effective methods of hyper parameter setting and optimization, and absence of pedagogical interpretation of the prediction. Our suggestions for future research are as follows.
1. Update the data standard of students' learning activities and promote the quality of training datasets. It should be noted that students will generate multimodal data in the learning process. Researchers should try to use multimodal dataset to train ANN to further improve the performance of prediction. Most of the training datasets used by the researchers in previous studies are private, which lack consistent standards and the guidance of pedagogy, cognitive psychology, and other theories, resulting in the lack of universality of the prediction models based on ANN constructed in different studies. We suggest that researchers take constructivism, connectionism and other latest learning theories as the guidance, update the popular xAPI 80 and other data standards of students' learning activity, generate higher quality training datasets of student performance similar to ImageNet 81 and EMINST, 82 so that the prediction models established by different researchers can be compared and transplanted. 2. Improve the representation learning method of extracting implicit features from multi-modal learning data. Results of our survey show that CNN, RNN, GNN, and other latest ANN with strong function of representation learning can extract implicit spatial, sequential and rational based features from the raw datasets. Improving these methods can further improve the prediction accuracy of using the latest ANN. Multimode neural network (MNN), as a new generation of ANN which can deal with multimodal data, has attracted more and more researchers' attention and shown better performance than ANN in many fields. 83 In eLearning, LMS can generate various modes of learning data, such as metadata of learning resources and activities, self-test record of students in previous stages, sequence of learning activities, image collected by webcam, interactive log of learners in community. Researchers can use MNN to realize the joint representation and fusion of multi-modal learning data to achieve better performance of prediction. 3. Propose initialization and optimization method of ANN's hyper parameters for predicting students' performance. In previous studies, researchers manually set the hyper parameters of ANN based on experience, which has a significant impact on the accuracy of prediction. Researchers need to explore the essence of training datasets in type of features, density distribution and interaction between features, propose effective initialization, and optimization methods of hyper parameters to further improve the performance of prediction model. 4. Improve the interpretability of prediction results and processes. In previous studies, few researchers explained the process and results of prediction based on pedagogical theory, which limited the role of prediction model. We suggest that researchers should combine the interpretability methods of machine learning 84 and modern learning theories to put forward the methods to explain the prediction process and prediction results of the prediction model. In this way, we can improve the role of the prediction model in identifying the key factors affecting learning and analyzing students' cognitive characteristics, making the prediction model better serve educators and learners.

FUNDING INFORMATION
This work was supported by key projects of philosophy and social science research in colleges and universities in Anhui Province (Grant Number 2022AH050160).