A novel classification model of collective user web behaviour based on network traffic contents

Web behaviour analysis of a collective user has provided a powerful means for studying the collective user interests on the Internet. However, the existing research merely analyses the behaviour of a single user who accesses multiple applications or multiple users who access one application. The authors propose a web behaviour classification model for collective user, in which the title fields in HTTP flows are extracted from the mirrored network traffic that has been already captured for any given period of time. The title fields, considered as the short abstract of the whole web pages browsed by the users, are vectorized by natural language processing technologies. Specifically, the Latent Dirichlet allocation (LDA) algorithm is used to calculate the topic distribution probability matrix. Afterward, the multi ‐ class classifiers are trained and tested using the manually labelled probability distribution matrix from the output of the LDA algorithm to classify the user behaviour topics. The experiments demonstrate that the highest classification accuracy of the model reaches 81.2% by combing the LDA algorithm with Random Forest classifiers under the classification model.


| INTRODUCTION
Collective behaviour refers to the relatively spontaneous and relatively unstructured behaviour of a group of people acting with or being influenced by each other [1]. Collective web behaviour is one form of collective behaviour, which is the interaction between a group of people and browsers [2]. Widely used in network measurement and network environment simulation [3], collective web behaviour classification is a method used to classify the web accessing behaviour of a group of users in a continuous period. The web behaviour from the classification results of a collective user can also be generalized and applied in many critical industrial applications, such as search engines [4], user behaviour modelling [5] and recommendation systems [6]. The classification results could also be used to provide data support for the collective user behaviour simulation [7] in all kinds of network testbeds, for example, National Cyber Range [8]. In order to realistically simulate the web behaviour of a group of users over a period of time, we classify the behaviours into six categories, namely, video, music, social network/blog, news, Internet information retrieval, and e-commerce according to the content of user's accessing objects mentioned in [9], which have covered almost all the behaviours of users' access to web objects. The six categories that we use are distinct from each other. We further summarize the research methods for the analysis of user web behaviour into but not limited to the following three types: (1) Web-content-based analysis method. This method analyses user behaviour by mining text, HTML and other text formats on a typical business management website. For example, for incorporating content, structure and usage data from web sources, Loyola et al. [10] analysed web users' behaviour with an Ant Colony Optimization algorithm. Awad et al. [11] predicted the next set of web pages to be visited by a user based on the Markov model.
(2) Web-connection-based analysis method. In the Social Network, connections are abstracted into user relationship and interactions [12][13][14], which have been widely used to mine and analyse the evolution of social networks, for example, Al-Garadi et al. [13] predicted the probability of a connection occurring at a specific time using time-series models of single-link occurrences. (3) Log-based analysis method. In this method, the users' logs, collected from servers [15][16][17] or clients [18][19][20], are mapped into terminal operation behaviour. The researchers explored the interactions of users with existing systems to study the users' most concerned issues and to understand users' accessing intent further.
Although the methods mentioned above can analyse the user web behaviour either for a single user or for a single website, the calculation of collective user web behaviour is still a challenge in terms of data collection and classification methods. On the one hand, it is difficult to collect large-scale users' sessions or cookies from clients, and on the other hand, most of the necessary data for web logs analysis is saved in web servers, proxy servers or enterprise log servers, and so forth. Therefore, it is almost impractical to collect logs from them due to the servers' security policy. Table 1 describes the meanings of various abbreviations and acronyms used throughout this thesis. The page on which each one is defined or first used is also given.
Inspired by the web-content-based analysis methods mentioned above and to overcome the above limitations, we propose a new model for classifying the web behaviour of a collective user (CWBCU) based on network traffic contents, which makes use of network traffic to analyse the topics of web behaviour of collective users. To classify the topics the web behaviour of the collective user, in our research, we acquire network traffic from a network device and then employ our proposed model, which combines the LDA algorithm [21] with a multi-class classifier [22]. Based on the CWBCU model, we can classify the topics of web behaviour of the collective user into the six categories from network traffic.
In summary, the main contributions of our work are listed as follows: (1) We propose an algorithm to extract web page names (i.e. title fields in an HTTP flow) from pcap files, by first extracting HTTP flows from a pcap file and then splitting the flows and finally saving the results into HTML files. The web page names are acquired by analysing the header of the HTML files. (2) An algorithm is designed to classify the topics of collective user web behaviour. First, we generate a document-word frequency matrix via analysing the web page names using the techniques in NLP. Then, the matrix is transformed into a probability matrix by the LDA algorithm. Finally, the multi-class classifiers are trained with the probability matrix, which should be labelled by using the topics of user behaviour types. The rest of this paper is organized as follows. Section 2 provides a summary of related work, and the CWBCU model is proposed in Section 3. Section 4 outlines the experimental results on the dataset followed by Section 5, which provided some conclusions and suggestions for future research.

| RELATED WORK
Web behaviour calculation of a collective user is one of the most important means of understanding users' intentions, which provides data support for collective user behaviour simulation. The analysis of behaviour of a collective user is more important than that of a single behaviour in many ways as shown in the following scenario. For example, in using the platform Cyber Range, a realistic network environment needs to be established, and this environment includes building the foreground streams. The foreground streams are generated by simulating user behaviours and in our experiment, for a collective user as a unit, we can get its behavioural topics in percentage, for example, 20% of the behaviours belongs to music. The simulation of collective user behaviour can be performed using the platform Cyber Range according to this percentage to generate foreground streams and so that the simulation environment is more realistic. However, the foreground streams generated by analysing single user's behaviours are less realistic. Currently, most of the research focuses on the study of web-service-based user behaviour analysis, while our study based on network traffics.

| User web behaviour analysis
Based on the assumption that web behaviour is dependent on the type of query, Park et al. [24] analysed user search behaviour in a large-scale query log from Yahoo Image Search, in which two orthogonal classifications are employed to classify queries and identify consequential query types. Considering the sequential aspects of a web user session and taking the sequential information of web pages into account, Mishra et al. [25] proposed a recommendation system for web users. Lopes et al. [26] did similar research, in which a real-time dynamics recommendation system with web mining techniques was constructed to analyse web logs. Also, based on the knowledge of the previously accessed pages, Awad et al. [11] predicted the set of web pages that a user might access, and a new modified Markov model is proposed to reduce the memory requirements. Web logs were also used to assess user learning behaviour and cognition, for example, Rao et al. [27] collected mouse click and movement data during human-machine interaction, and then applied statistical and data mining techniques to model users' learning behaviours. The results of the model were used to predict possible data cognitive bias.
In a nutshell, the methods mentioned above can be used to analyse or infer user web behaviour in some interested applications. However, they are not suitable for analysing the topics of collective web behaviour or topic distribution during any period of time. On the one hand, the data such as web logs, connections and mouse movements are collected from only some concerned web applications; on the other hand, their analysis seldom involves web content. In our study, for data collection, we adopt the network traffic captured from the managed switches instead of web logs. With respect to content analysis, the LDA algorithm is employed to calculate the collective behaviour topics by analysing the name of the web pages that the users accessed.
The LDA is a probability generation model, in which each document is viewed as a mixture of a small number of topics in a corpus, and each topic is viewed as an infinite mixture over a set of topics probabilities [22]. Krestel et al. [28] applied the LDA-based method to elicit a shared topical structure from a recommending resource to solve the tags' recommendation problem, which shows this method achieves better accuracy and recall rate than using association rules in their evaluation.
Chen et al. [29] proposed the Forum-LDA to model the root post-process based on the assumption that the subject of the reply is determined only by the author's serious subjects and root posts. The model is capable of not only learning more coherent topics and serious interests, but also identifying unserious users who posted irrelevant replies. Guo et al. [30] employed the LDA algorithm to identify the critical dimensions of users' evaluation of hotel services and obtained a total of 19 controllable dimensions, which are the key to hotel management.

| Network traffic analysis
Network traffic analysis is the process of inspecting and analysing network traffic for performance, security, or general network management. Supervised multi-class classifiers, assigning each sample to one and only one class label, are broadly used to perform traffic classification, for example, Random Forest is used to traffic identification [31]. Nair et al. [32] proposed an improved decision tree classifier to detect P2P traffic. Guy et al. [33] integrated a Random Forest classifier into the AdaBoost algorithm to perform traffic congestion prediction, and the error rate is lower than the naïve predictors. In [34], Random Forest classifiers were also used to detect the presence of domain name system tunnels in a complex defence-in-depth enterprise environment, and their experimental results showed that the classifiers detected not only three types of tunnels that were in training dataset, but also four new types of tunnels that were not in the training dataset.
To sum up, the methods mentioned above merely studied traffic classification by using machine learning algorithms trained and evaluated by annotated datasets. However, the traffic contents have not been analysed. Some researchers focus on the studies of learning specific actions that a user is doing via network traffic analysis [35,36]. For example, Conti et al. [35] designed a system to identify user's actions for mobile devices via network traffic by implementing machine learning classifiers. Furthermore, in [37], fine grained user activities were identified within the wireless network traffic generated by some APPs based on traffic behaviour. In contrast, our work focuses on classifying the topics of web behaviour of a collective user while not analysing the specific actions. To the best of our knowledge, our research is the first work on studying collective user behaviour by using network traffic contents, in which the title fields from HTTP traffic are aggregated into multi-category documents using the LDA algorithm and the probability matrix under each category is exported. Considered as an eigenvalue of the topics, the probability matrix has been manually annotated by the six categories of user behaviour topics. Finally, supervised machine learning algorithms are introduced to perform a multiclass classification. The trained model is capable of calculating the topics of web behaviour of the collective user from the contents of network traffic captured from most network devices, such as managed switches and firewalls.

| MODEL DESCRIPTION
In this section, we define and describe our proposed CWBCU model, which is composed of three layers, the pre-processing layer, the encoding layer and the computing layer, as shown in Figure 1.
The network traffic contents are processed by these three layers in the given order, and finally, web behaviour topics of a collective user are classified by the model. In Figure 1, network traffic, in the form of pcap files, is the input for the pre- processing layer, which performs extraction of the web page names from the pcap files acquired by bypassing network traffic from network devices. Each web page name is considered as a document, which implies the behaviour topics of the user who is browsing the websites. The document is preprocessed by NLP techniques and technologies and then a document-word frequency matrix is created, which serves as the input to the LDA algorithm.
The encoding layer transforms the document-word matrix into a probability distribution matrix for each topic of web behaviour. The probability distribution matrix is considered as the eigenvectors of web page names and labelled manually with the six categories of topics, which will be used in the computing layer as a dataset to train and test the classifiers. The LDA algorithm cannot directly compute the web behaviour topic of each web page name; therefore, supervised machine learning algorithms are necessary for web behaviour computing.
In the computing layer, the dataset is used to train the supervised classifiers and test the classification accuracy of the classifiers applied in the computing layer iteratively, and finally, the best model is used to classify the topics of web behaviour of the collective user from traffic network contents. It should be noted that any supervised multi-class classifier can be applied in the computing layer, for example, Support Vector Machine (SVM) or Random Forest. The trained classifier will be employed in the CWBCU model to calculate the web behaviour of a collective user.
Overall, by extracting valid information from the network traffic contents, the CWBCU model can be used for classifying the topics of collective web behaviour in a continuous period that passes through pre-processing layer, encoding layer and computing layer. The calculated result can be widely used in user web behaviour simulation and user interest analysis. A detailed description of the three implementations of the CWBCU model is in the following sections.

| Data collection and extraction mechanism
In this section, firstly, we introduce the data collection method from network flow, and then propose an algorithm to extract interested data, the web page names, from the traffic data.
The network traffic is captured in managed switches by port mirroring, as shown in Figure 2, in which all the packets passing through the managed switch are bypassed to the storage server and saved as pcap files.
In Figure 2, the traffic generated by any device in the network will be captured by the managed switch. However, to extract the web page names from the traffic, we have to merge TCP packets into TCP flows through a quintuple < timestamp, source IP, source port, destination IP, destination port>. The quintuple reveals which source IP has accessed which service (destination port) in which server (destination IP) at what time (timestamp). All the fields in the quintuple can be directly obtained by parsing the pcap files. When a user accesses a web page with a URL by a web browser, the web browser sends an HTTP request to the server. Afterward, the server responds with an HTML file to the client. The title tag (web page name) displayed at the top of the web browser, which implies the user's network behaviour topics, can be obtained by parsing the HTTP response packets from a pcap file. In detail, when responding to the request, the server performs chunked transfer-encoding and GZIP compression on the HTML file and then sends them to destination IP chunk by chunk in order to enhance the data confidentiality and improve data transmission efficiency.
To extract web page names from a pcap file, we first detect HTTP response packets according to the HTTP status codes (e.g. HTTP 200 OK), and meanwhile, search the packet whose content-type is 'text/html' in HTTP flows. Next, we seek the next packet in the same HTTP flow according to the ACK number and sequence number in TCP header until the PSH value is 1. Finally, the packets are merged into one HTTP flow. A detailed procedure for extracting web page names from a pcap file is shown in Algorithm 1. Output: title_infor, which is a matrix to store the web information extracted from the pcap file.

Algorithm Extracting web page names from a pcap file
In the matrix title_infor, each row vector is composed of title_name, start_time, end_time, source_IP, source_port, destination_IP and destination_port in sequence. Furthermore, the start_time means what time the HTTP flow is established and the end_time indicates what time the HTTP flow disconnects. Also, Keep-alive connections in HTTP/1.1 allow a client and a server to send and receive multiple HTTP requests and responses by the same TCP connection, respectively. Therefore, there are multiple web page names in one TCP flow with the same quadruple < source IP, source port, destination IP, destination port > in HTTP/1.1 during a period of time.
For training data, we capture the traffic and save it, then extract the titles of the pages in it as short text. Then, data tagging is done by a group nine people, in groups of three, with each person first tagging the well-recognized titles, and the uncertain titles will be decided by the three people to which category they belong.
In the next section, we propose a web behaviour classification method of a collective user by combining the LDA algorithm and a multi-class classifier, as shown in Figure 3. The web page name extracted by Algorithm 1 is a piece of short text, which is used as the input of the encoding layer, as mentioned above. We perform feature engineering to construct the document-word frequency feature matrix based on word segmentation.

| Text pre-processing
Here, we first perform word segmentation by using the API of jieba [38] on the title_name column from the matrix title_infor, which is the output of Algorithm1, and then the stop words in the web page names are removed to exclude interference items for topic analysis, and finally, a document-word frequency matrix, the result of text pre-processing, is generated as shown in Figure 4.
As demonstrated in Figure 4, the row d i is the ith document, and the column w j is the jth word in the corpus, and dw ij is the frequency that w j appears in d i .

| Encoding with the LDA algorithm
In this section, to suit our purpose, we build an algorithm to implement the LDA framework.
The LDA algorithm is a generative probabilistic model of a corpus, a set composed of all user browsing records. Each topic in the corpus is characterized by a probabilistic distribution over web page names. The web page name can be considered as a document, and each document is associated with one of those six topics. The LDA algorithm can analyse these web page names to find not only the hidden topics in the -177 corpus but also the hidden topics of the users. A detailed procedure of our algorithm is shown in Algorithm 2. The LDA algorithm is composed of three hierarchies, as shown in Figure 3. The LDA algorithm assumes the following generative process to create a web page name in user browsing records d using K topics.

Algorithm Hidden topics calculation
Our algorithm has implemented all the elements and processing steps in Figure 3, in which α ! is the per-document topic distribution, β ! is the per-topic word distribution, θ ! m is the topic distribution for document m, φ Z m;n is the word distribution for topic k, Z m,n is the topic for the nth word in document m and W m,n is the specific word. The W m,n represents observable variables, and Z m,n and θ ! m refer to latent variables. The outer plate, the document level, refers to a document, while the inner plate, the word level, refers to the process of iteratively choosing a latent topic and a word in the document in order to traverse all the topics and words. According to the graphical model, a document's generation probability can be obtained by Equation (1).
In Equation (1), variable θ ! m and variable φ ! Z m;n are term distributions in each topic and the topic distribution in each document, respectively. Z ! m is all topics in document, W �! m is all specific words in document m, n is the number of all words in the range [1, N m ], N m is the length of the webpage name and obeying the Poisson distribution with parameter ξ. Φ is the parameter of word distribution. Both variables θ ! m and φ ! Z m;n are inferred by Gibbs sampling methods [39], which are capable of extracting the topics from a large-scale text. Furthermore, parameter estimation with the Gibbs sampling method is regarded as the inverse process of the text generation process, that is, in the case of a known text set (the generated result), the parameter values are obtained by parameter estimation. Furthermore, to estimate actual parameters, the topic of each word is sampled instead of integration.
Once the topic of each word is determined, the parameters are calculated after frequency statistics. Therefore, the parameter estimation problem is transformed into calculating the conditional probability of the subject sequence under the word sequence via Equation (2).
In Equation (2), variable z i corresponds to the topic variable of ith word, ∅i indicates that the ith item is not included and n t k indicates the number of times that the term t appears in the topic k. β t is Dirichlet prior distribution of the word term t, and n k m n t m means the number of topic k appears in document m andα k is the Dirichlet prior distribution of top k. The necessary parameters are calculated by Equations (3) and (4) once the topic index is obtained.
In Equations (3) and (4), the value of φ k;t is the probability of word term t in topic k, while the value of θ m;k indicates the probability of word term topic k in document m.
As an unsupervised classification algorithm, the LDA algorithm can only divide the web page names into set categories according to the theme but cannot calculate what the topic is. In the next section, supervised methods are introduced into the model to learn the topics based on the calculated result of the LDA algorithm.

| Topic classification based on multiclass classifiers
In this paper, we view the output of the LDA algorithm as an eigenmatrix, which is manually labelled as a dataset with six categories of user behaviour topics, that is, video, music, social network/blog, news, Internet information retrieval, and ecommerce. Supervised multi-class classifiers, for example, Random Forest, XGBoost, and Naïve Bayesian, are employed to train and test the classification model in our evaluation experiments, respectively. Naive Bayes classifiers are particularly popular in text classification and are effective solutions for problems such as spam detection. While XGBoost is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework to solve data science problems in a fast and accurate way. In XGBoost, trees are generated in such a way that each new tree corrects errors made by a previously trained decision tree. Compared with XGBoost, Random Forest is also a tree-based ensemble machine learning algorithm that constructs multiple decision trees and merges them to acquire a more accurate and stable result by training each tree independently. The three algorithms are widely used in classification problems, so we use these three methods as representatives to test our model. The trained model will be used to calculate the topics of web behaviour of the collective user. The flow diagram of the model is shown in Figure 5.
As depicted in Figure 5, firstly, web page names are extracted from the title fields of the filtered HTTP flows in a pcap file. Then, text processing is conducted to generate a document-word frequency matrix from the web page names. Next, we input the document-word frequency matrix into the LDA algorithm to calculate the probability distributions over the possible topics. Finally, the calculated results are labelled and then divided into two datasets, a training dataset and a test dataset, for training and testing the multi-class classifiers, respectively. The trained model is used to classify the topics of collective web behaviour from pcap files.

| EXPERIMENTAL RESULTS AND DISCUSSION
The primary purpose of this section is to test and evaluate the classification accuracy of the CWBCU models by using the dataset from Section 3. We also test the convergence of the model with different iterative processes.

| Dataset and experiment setup
As shown in Figure 2, the network traffic is captured by bypassing network traffic in a managed switch and is saved in the form of pcap files. We have collected the network traffic from one managed switch for three months with 56 users at most in our laboratory. The age of the laboratory members is between 20 and 40 years old, of whom more than 80% are students between the ages of 20 and 30. The HTTP flows are extracted from all the pcap files. We have conducted experiments on a balanced dataset with 30,000 non-repeating samples, which are the output of the LDA algorithm and then marked by six categories of topics. Five persons in our laboratory work as assessors who independently performed data labelling. The disagreed labels will be relabelled by another two persons, and the final result is decided by voting of the seven assessors. We validate the label results for each type by random sampling and fix incorrectly labelled data and confirm similar labelling. The raw agreement rate in the first round reaches 99.5% and eventually reaches 100% for all labelled data. To further improve the performance of the LDA model in the sparse and noisy short texts, we aggregate the web page names with the user(source IP in the title_infor), web site (destina-tion_IP in the title_infor) and time interval(20 min) for the training set [40,41].

| Experiment evaluation
We first verify the convergence of the model and then compare the classification accuracy among three concrete methods under the CWBCU model.

| The convergence evaluation of the CWBCU model
In order to evaluate the convergence of the model, we write a testing program, in which the perplexity [42] and loglikelihood [43] are used. The perplexity, defined as the geometric average value of the inverse marginal probability of each word in the held-out set of documents, is used to evaluate the LDA algorithm on held-out data. The perplexity is widely used in indirect measuring of predictive performance, as shown in Equation (5).
Because log pðn test i | α; βÞ cannot be computed directly, the lower bound on perplexity shown in Equation (6) is used to approximate it.
In Equation (7),ŷ i is the predicted values of the ith sample and is the corresponding actual value.
(1) The classification accuracy of the CWBCU model with LDA-Random Forest method We first combine the LDA algorithm with a Random Forest classifier to test the classification accuracy of the model. Specifically, the relationship between classification accuracy and the number of decision trees is evaluated by a wide range of experiments, as shown in Figure 7. The range of the number of decision trees is from 5 to 70 and the accuracy is the average value of the ten rounds of experiments in each group by running a grid searching of hyperparameters with a combination of maxDepth ¼ 5, maxBins ¼ 32 and minInfoGain ¼ 0.0.
As shown in Figure 7, the classification accuracy with K ¼ 23 is superior to K ¼ 20 and K ¼ 30, whose classification accuracy ranged from 72.5% to 81.2%. Table 2 Figure 8.
As illustrated in Figure 8, with different hyperparameter combinations, the classification accuracy is different. The highest classification accuracy is 79.2% when K ¼ 14, max_depth ¼ 10, and learning rate ¼ 0.15. In addition, the lowest accuracy is 71.1% when K ¼ 33, max_depth ¼ 15, and learning rate ¼ 0.25.
(3) The classification accuracy of the CWBCU model with the LDA-Naïve Bayesian method We finally evaluate the model by combing the LDA algorithm with Naïve Bayesian classifier under different K values. The experiment result is shown in Figure 9.
In Figure 9, the classification accuracy of the LDA-Naïve method is above 70.6%, and the accuracy is stable under the combination of different K values (K ¼ 10, 14, 20, 23, 30, and 33)  value ¼ 31.4265, p-value ¼ 4.32�10 À 6 and F critical value ¼ 3.68232. We can conclude that the difference among them is significant but not a random effect. Under the best classification result with the LDA-Random Forest, the classification accuracy of each type is summarized in Table 3.
From Table 3, we can see that the classification accuracy is relatively balanced, among which classification accuracy of the news is the lowest due to the keywords of the news are not sensitive enough that might be classified as other types. Through calculation, the classification accuracy of the video is the highest. Furthermore, we have noticed that most web page names with word 'video' are really classified into the type of video. From Table 2, it should be noticed that the number of web sites for each category does not impact the classification accuracy directly. The unstructured nature of the web page names extracted from pcap files makes it challenging to classify the topics. Moreover, the same words emerging in different web page names lower the discrimination of the web behaviour categories. Considering the context of the web page names with the LDA model, the CWBCU model can acquire better classification results.

| Comparison between LSA and pLSA
We also compare LDA with LSA and pLSA with the same dataset using different K values (K ¼ 5, 10, 15, 20, 23, 30, and 35) and the number of decision trees of Random Forest ranges from 5 to 70. The relationship between classification accuracy and the number of decision trees is evaluated by a wide range of experiments, as shown in Figures 10 and 11.
As illustrated in Figure 9, the accuracy of the classification ranges from 72% to 78% for the LSA, and the highest accuracy is 77.9% under K ¼ 30 and numtrees ¼ 53. In Figure 10, the accuracy for classification for pLSA ranges from 62% to 79% and the best values is 78.5% under K ¼ 30 and numtrees ¼ 67. According to the analysis results of LSA and pLSA, the accuracy obtained by using the Random Forest algorithm is improved with the increase of the value of K but becomes stable after reaching a specific value. The classification accuracy of pLSA with K ¼ 20 is lower than others obviously.

| Discussion
We compare the classification accuracy of the LDA-Random Forest method, the LDA-XGBoost method and the LDA-Naïve Bayesian method, and the result shows the highest classification accuracy of each method is 81.2%, 79.2% and 70.6%, respectively. The accuracy of the LDA-Naïve Bayesian method is the lowest. This might be due to the less strong independence between the features; for example, the same words are probably divided into different types in different contexts. Random Forest is a bagging algorithm that reduces variance and is not sensitive to outliers. However, XGBoost is a supervised classifier that implements parallel boosting. As shown in Figure 8, the maximum accuracy is 79.2% for the LDA-XGBoost method obtained by grid hyperparameter searching. The LDA-Random Forest method is superior to the LDA-XGBoost method in terms of classification accuracy. Moreover, in terms of running time, the former takes much less time than the latter. LDA-Random Forest takes approximately 10.2 s to perform grid hyperparameter searching, while XGBoost takes 105.3 s. The above-mentioned time doesn't include time for data importing and LDA analysis (approximately 84 s).
Compared with LSA-Random Forest method and pLSA-Random Forest method, the LDA-Random Forest method that instantiated the CWBCU model yields higher classification accuracy by aggregating the users and websites. In our experiments, the best method is LDA-Random Forest, and next LDA-XGBoost, and then pLSA-Random Forest, then LSA-Random Forest, and finally LDA-Naïve Bayesian.

| CONCLUSION AND FUTURE WORK
In this study, we use recent developments in topic model technologies and techniques to study user behaviour calculation methods and propose a novel model to classify collective web behaviour from network traffic contents. We have performed extensive experiments to compare the classification accuracies among the LDA-Random Forest method, LDA-XGBoost method and LDA-Naïve Bayesian method, respectively. The highest classification accuracy reaches 81.2% with the LDA-Random Forest method, which demonstrates that the model is useful in classifying the topics of collective web behaviour. Overall, by analysing network traffic contents, the CWBCU model can obtain the topics of web behaviour of a collective user. The network traffic is easily collected from network devices. In addition, compared with the analysis of the whole web page, the calculation of the web page name is quicker to get relatively accurate results. In order to obtain the distribution of the behaviour of collective users in a community on the six categories, the model we build is based on analysing the web topics in the traffic generated by the collective users in a community and identifying which of the six categories each topic belongs to. Therefore, the results are able to reflect the relationship between collective user behaviour and multiple behavioural topics.
Since our research is the first in the area of web behaviour calculation of a collective user from network traffic contents, it may have some limitations. One of them is that the parameters in the LDA algorithm and the multi-class classifier may be coupled, which should be considered carefully in future research. In future work, we wish to apply deep learning algorithms with NLP (i.e. other word segmentation algorithms) for improving the classification accuracy to understand a collective user's web behaviour better.