Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement

: Obfuscation is rampant in both benign and malicious JavaScript (JS) codes. It generates an obscure and undetectable code that hinders comprehension and analysis. Therefore, accurate detection of JS codes that masquerade as innocuous scripts is vital. The existing deobfuscation methods assume that a specific tool can recover an original JS code entirely. For a multi-layer obfuscation, general tools realize a formatted JS code, but some sections remain encoded. For the detection of such codes, this study performs Deobfuscation, Unpacking, and Decoding (DUD-preprocessing) by function redefinition using a Virtual Machine (VM), a JS code editor, and a python int_to_str() function to facilitate feature learning by the FastText model. The learned feature vectors are passed to a classifier model that judges the maliciousness of a JS code. In performance evaluation, the authors use the Hynek Petrak ’ s dataset for obfuscated malicious JS codes and the SRILAB dataset and the Majestic Million service top 10,000 websites for obfuscated benign JS codes. They then compare the performance to other models on the detection of DUD-preprocessed obfuscated malicious JS codes. Their experimental results show that the proposed approach enhances feature learning and provides improved accuracy in the detection of obfuscated malicious JS codes.


Introduction
JavaScript (JS) obfuscation is a transformation aimed at generating a JS code that is obscure to the human eyes and undetectable to scanners. Obfuscation hinders the comprehension and analysis of JS code content. This method of code transformation is rampant in both benign and malicious JS codes. Therefore, it would be wrong to naively render a JS code as malicious if it contains some form of obfuscation [1][2][3]. Obfuscation in software and programs, such as JS codes and HyperText Markup Language (HTML), has advantages of, among others, proprietary code protection, code compression, performance optimisation, and cubing code reverse engineering [4,5]. However, obfuscation is also used by cyber-attackers to camouflage the JS codes malicious intentions while preserving the overall code behaviour. This property, when used in malicious JS codes, is used to circumvent detection and identification, significantly affecting the performance of network and information security tools such as the intrusion detection systems and anti-virus software.
The use of data and encoding-based obfuscation techniques can reduce the detection rate for signature and heuristics-based tools by ∼40 and 100%, respectively [6]. Other languages, such as HTML have a finite number of methods of obfuscation [2,7], for example, a HTML element obfuscation by transforming an iframe to perform evasion attacks. On the other hand, JS code is dynamic. Hence, it is resilient to a variety of obfuscation techniques that are widely used in JS codes, such as variable and function names randomisation, string splitting and concatenation, keyword substitution, dead code insertion, packing, and encoding, such as Base64 [8,9]. It is also possible to implement customised encoding to obfuscate a JS code [6] where a special function is inserted and used to decode the JS code at runtime. Listing 1 (see Fig. 1) is the encoding function where each character's unicode is increased by one. Listing 2 (see Fig. 2) is the resultant obfuscated JS code which is decoded before execution [6]. The function 'document.write('Hello world!')' in the encoding function is encoded into 'epdvnfou/xsjuf)(Ifmmp!xpsme(*', in the resultant obfuscated JS code.
Therefore, it is vital to accurately detect malicious JS codes that use obfuscation to masquerade as innocuous scripts. There exist several methods to analyse such obfuscated JS codes. These approaches perform either static or dynamic analysis of obfuscated JS codes. Some of the JS code deobfuscation approaches assume that it is possible to recover an original JS code entirely using a specific deobfuscation tool. We apply such a strategy to analyse obfuscated malicious JS codes using general deobfuscation tools such as Dirty Markup [10], Online JS Code Beautifier [11], and Dan's Tools JS Code Formatter [12]. Applying the deobfuscation tools helps in realising a readable and pretty JS code that is easier to edit and analyse. However, as part of our findings, there still exists some sections of a JS code which proves hard to deobfuscate, for example, a JS code with multiple layers of obfuscation.
In our previous work [13,14], we performed feature learning for regular JS code using Plain-JS codes (Plain-JS) features, abstract syntax tree (AST) features, and paragraph vectors. AST form of JS code (AST-JS) gives a robust property against some perturbation in JS codes. Using AST-JS, detection of obfuscated JS code with obfuscation techniques that do not affect the original semantics of a JS code contents would be possible. However, it is not appropriate for the representation of a JS code that contains packing and or encoding obfuscation techniques that convert JS code contents into ASCII, Unicode, hexadecimal, and encrypted values. Only about 50% of the obfuscated JS code preserves its original semantics [15]. This study focuses on the detection of obfuscated malicious JS codes, precisely packed and encoded JS code, including the ones which contain multi-layer obfuscation.
This paper proposes a hybrid-analysis approach by integrating the use of a JS code beautifier and formatter, a virtual machine (VM), a JS code editor, and a python int_to_str() function to deobfuscate an obfuscated JS code, unpack a packed JS code, and decode an encoded JS code, respectively. The FastText model, a machine learning model, performs feature learning for DUD-preprocessed JS codes. A classifier model judges the maliciousness of an obfuscated JS code using the learned fixed-length vectors. Our approach automatically learns feature vectors compared to other previous methods that learn a set of manually extracted features. Besides, features learned are of low dimensions hence ensuring faster detection. The organisation for the rest of this paper is as follows. Section 2 highlights the related work. We present our approach in Section 3. Section 4 carries out the performance evaluations through a comparison to the paragraph vector models, the StarSpace model, the long short-term memory (LSTM) model, and the term frequency-inverse document frequency (TF-IDF) model. Finally, we give our discussion and conclusion in Sections 5 and 6, respectively.

Related work
There exist several approaches for the detection of obfuscated malicious JS codes. The most basic and widely used approach is the use of JS code beautifier tools and formatters, such as DirtyMarkup JS code beautifier [10], Online JS code beautifier [11], and Dan's Tools JS code viewer, beautifier and formatter [12], among others. These tools make an obfuscated JS code look pretty, readable, editable, and analysable. The tools are intended to reverse the effects of obfuscation for JS code analysis. However, these JS code deobfuscation tools are no match to an obfuscated JS code with multiple layers of obfuscation as some sections of the code remain obfuscated even after a JS code deobfuscation and formatting.
Other studies attempt to perform JS code deobfuscation for the discovery of the original and malicious JS codes by converting the obfuscated JS codes to regular ones for analysis [7,16,17]. In other approaches, a set of heuristics is created by analysing specific obfuscator features. However, the heuristics-based approach needs security personnel intervention during code analysis, and it is limited to specific obfuscator tools [15,18]. Other methods use a set of features such as JS code functions known to be prevalent in obfuscated malicious JS codes such as eval(), unescape(), and document.write(). These syntactic-based features are prone to a high false positive rate [1][2][3]. Recently, approaches that use machine learning and deep learning to detect obfuscated JS codes are also increasing [15,[17][18][19][20].
Skolka et al. [15] analyse the extent of application of minification and obfuscation in the web with an emphasis on understanding the most popular tools used for code transformation. The study point towards the development of targeted defence techniques that focus on particular obfuscation tools. They use a tree-based convolutional neural network to classify AST for three classification tasks, namely, TRANS, which determines whether a given piece of JS code has been minified or obfuscated using a tool, OBFUS, which checks for the presence of obfuscation and TOOL-X, which determines the specific tool used to obfuscate a JS code. They, however, do not evaluate obfuscation methods applied in benign, and those in malicious JS code and study of specific obfuscation tools could lead to biased features that cannot be easily generalised to other datasets. Some deobfuscation techniques have been developed to conduct reverse engineering of an obfuscated JS code. For example, Yadegari et al. [16] developed a generic deobfuscation method to extract a simplified code. For this, the paper combined a dynamic analysis with concolic (symbolic) executions in order to collect JS code traces. These traces are simplified in order to reconstruct a control-flow graph (CFG) and data-flow of the target program using taint propagation. Obfuscation adds unnecessary transformations instructions and constructs. These details are discarded, and only the parts essential for functionality are maintained. Some automatic attacks are often specific to particular implementations of obfuscation transformations or individual attacker goals, such as CFG simplification. Some JS code analysis technique requires execution traces that cover all the code of the program being analysed. This process ensures that the CFG is complete with no missing statements, basic blocks, or arcs [21]. However, the proposed system [16] only emphasises on the simplification of a JS code execution traces without performing code coverage. This paper does not make any assumption about the obfuscation technique. Moreover, it is a dynamic unpacker that needs to unpack the code in memory at runtime to obtain the unpacked code.
Lu and Debray [7] proposed a semantics-based approach for automatic deobfuscation of a JS code using dynamic analysis. They collect bytecode execution traces from the target program and use dynamic slicing and semantics-preserving code transformations to simplify the trace automatically, then reconstruct deobfuscated JS code from the simplified trace. The resulting code becomes observationally equivalent to the original program for the execution path considered. Obfuscations are simplified away. This method exposes the core logic of the computation a JS code performs. This method achieves an excellent detection performance. However, the deobfuscation and simplification methods do not identify a JS code maliciousness, as they are designed for the detection of code-unfolding and string-based obfuscations.  Attackers use JS code functions, such as eval(), unescape(), and escape(), to obfuscate malicious JS codes. These functions can be easily detected using signature-based tools. Hence, the functions are mostly hidden. Gorji and Abadi [17] use internal function debugging and map the behaviour of a webpage to a sequence of internal function calls, generating a behavioural signature for each JS code malware family. To identify these hidden functions, they intercept the suspicious function invocations, which can be triggered at runtime using a browser. They use reverse engineering to identify internal function calls. However, obfuscation tools can result in a JS code with behavioural characteristics that are different from the original JS code [15]. Also, they select a limited set of 15 internal function calls of jscript.dll and mshtml.dll from jscript and mshtml libraries, which they assume are fully representative for the JS code malware family.
Multiple classifiers have been proposed to detect obfuscation and evaluate the feasibility and accuracy of distinguishing between different classes of a JS code. Tellenbach et al. [18] collect a dataset of obfuscated and non-obfuscated JS codes and selects and extracts a set of 45 features from the dataset. The features used include frequency of specific keywords, number of lines, characters per line, number of functions, entropy, among others. This paper assumes that there are a few JS codes that are both benign and obfuscated, and hence, they mostly focus on classifying obfuscated and non-obfuscated JS codes. They also fail to detect JS codes obfuscated with tools not present in the training set, and detection precision depends on the manually specified limited set of features.
For detection of obfuscated malicious JS code instances, Fass et al. [19] leverage the use of syntactic units using n-grams features from an AST traversal and a random forest classifier. They use specific constructs for accurate detection distinguishing obfuscation from maliciousness. However, they grouped units with the same abstract syntactic meaning, such as FunctionDeclaration and VariableDeclaration as Declaration node, and ForStatement and WhileStatement as statement node, therefore losing context information. They also use specific features, namely extracted syntactic units, which translates to limited learning components. In contrast, we use all AST information so that the FastText model can learn the most detailed features.
Algorithms such as J48 decision tree, Naïve Bayes, Logistic Regression, and linear SVM have been proposed [20] to develop a machine-learning based approach to detect obfuscated malicious JS code. The approach dynamically executes the trace of a JS code and extract unordered and non-consecutive sequence patterns. Semantics-based deobfuscation technique [7] is adopted to simplify the trace using deobfuscation slicing. This approach assumes that the deobfuscated trace will be useful for feature extraction that represents the internal characteristics of a malicious JS code. Also, obfuscation tools have been shown to result in codes that are structurally different from the original ones.
These existing approaches adopt either a static or dynamic approach to deobfuscation and detection of obfuscated malicious JS codes. The main issue in static analysis is coping with obfuscation, specifically, packing and encoding obfuscation. Static techniques have also been employed to assess if a malicious JS code is similar to a previously seen variant, without actually performing the costly task of unpacking and decoding. However, with attacks such as zero-day JS code malware using obfuscation, a static analysis technique generates many false positives. Dynamic analysis techniques have been shown to suffer from limited code coverage and the complexity of the JS code runtime environment [22]. Limited code coverage results from cloaking, a technique employed by attackers to fingerprint the victim'sw e b browser. The malicious content can only be visible if certain conditions are met, for example, the use of a specific browser version and a vulnerable plugin. The complexity of the JS code runtime environment results because JS code is in a variety of applications utilising the functionalities of plugin extensions they support. A dynamic analysis set-up will majorly include known browsers and plugins, ignoring less popular plugins and their vulnerabilities.
Most of the JS code dynamic analysis techniques use unfolded code contexts logs and assume that the original code can be retrieved from the deobfuscated one. JS code event callback feature, such as a mouse click event, also makes it challenging for dynamic analysis to trigger a JS code automatically. For example, the deobfuscator by Lu and Debray [7] is a fully dynamic solution. However, the JS code obfuscated part requires localisation as preprocessing the unfold loops during the abstraction stage.
Some of these JS code deobfuscation and detection approaches assume that the transformation applied to a JS code preserves the overall semantics of the code. Benign JS codes may apply simple obfuscation techniques such as string splitting. However, in the case of malicious JS codes, attackers apply sophisticated obfuscation techniques such as packing and encoding in an attempt to perform evasion attacks. In some instances, a single code contains different obfuscation techniques, for example, double packing or packing coupled with encoding. These multi-layered obfuscation techniques dramatically affect the structure of the resultant JS code. In such a case, the resultant AST-JS would not accurately capture the structure of the original JS code. Therefore, to improve the detection performance of machine learning, we propose a combination of the deobfuscation, unpacking, and decoding (DUD-preprocessing) approach to enhance feature learning.

Proposed method
Malicious JS codes attempt to trick users using items such as text, function names, function types, variables, and other contents. Obfuscation is employed to evade detection by security devices. This property can also hinder the feature learning process of a machine learning model as such JS codes contain numbers and special characters in addition to JS code words. It is essential to capture JS code details and structure by code structure representation to detect such malicious JS codes effectively. Ndichu et al. [14] developed the AST-JS approach for code structure representation and accurately detected some obfuscated malicious JS codes which contain the same semantics and syntax as the original JS codes. However, malicious JS codes employ sophisticated obfuscation methods such as packing, encoding, and multi-layered obfuscation methods. The resulting obfuscated malicious JS codes are characterised by unnatural and unreadable syntax and corrupted code structure that is different from that of the original JS code. To effectively conduct feature learning on such obfuscated JS codes, it is essential to deobfuscate, unpack, and decode the obfuscated JS codes. We propose the use of a JS code beautifier, formatter, VM, JS code editor, and a python int_to_str() function to obtain a Plain-JS code from an obfuscated one and the FastText model for feature learning coupled with a classifier for the prediction task.
In the following sections, we explain the preprocessing for transforming an obfuscated JS code into a Plain-JS code and the FastText model that we use for feature learning.

Preprocessing
Deobfuscation: Deobfuscation is the process of analysing and formatting an obfuscated JS code to make it readable again. It is also referred to as simplification or beautification. Deobfuscation uncovers the actual functionality of a JS code. We use a JS code beautifier to format an obfuscated JS code resulting in a deobfuscated JS code. However, if upon analysis of the resultant deobfuscated JS code, we still find the presence of obfuscation in the dataset, there is still a need to process such an obfuscated JS code further. We identify such JS codes as hard-to-deobfuscate JS codes as sections of the code remain obfuscated even after deobfuscation attempts.
Unpacking: A JS code is packed or encrypted to hide its maliciousness, thereby hindering its interpretation by static analysis. This concept is also used in benign JS codes for code compression and privacy purposes. JS code packers are diverse and more common with attackers attempting to evade detection. These tools wrap the entire code using the eval() function that takes a string and attempts to run it as a JS code. The code is then loaded at runtime via the function.
We evaluate the packed JS codes using oracle VM and a JS code editor. We strip the script tags such that for each JS code; JS_code = "'eval(function(p,a,c,k,e,d)...obfuscated_JS_code…)"', replace eval() with console.log() and parse the packed JS code to extract function arguments using Unpacked JS code = eval ( ′ unpack ′ + JS code[JS code.find( ′ }( ′ ) + 1: − 1]) regular expression. It is important to disable the drag and drop and shared clipboard options in the VM virtual box manager and use a wifi network to avoid infecting the host personal computer and the local area network with malware. This process results in an unpacked JS code. However, upon analysis of the resultant unpacked JS code, we still find the presence of obfuscation in the dataset. The JS codes that remain obfuscated contain encoding implemented using basex() functions.
Decoding: JS code encoding is the process of converting all string literals of the stringArray, for example, by a JS code obfuscator tool, using either Base64 or RC4 [8,9]. JS code packers also encode strings in a JS code to integers. The tools encode a JS code using a basex() function. The resultant JS code is a shorter version that represents the long original one. It is a JS code compression method to reduce download time and to represent a uniform resource locator (URL) anonymously. Encoding methods such as ASCII, Unicode, and HEX affect the semantics of a JS code.
Python allows conversion of a string to an integer of a given base using class int(x = 0), class int(x, base = 10) function, which returns an integer object constructed from a number or string given by x,or returns 0 if no arguments are given [23]. We evaluate encoded JS codes by performing the inverse of this function. We implement an int_to_str(num,base) function in python such that, for a number num and given base, int(int to str(num, base), base) == num which returns a string object constructed from a number num,o r return 0 if no arguments are given. Finally, we parse the encoded JS codes to extract the function arguments using a JS code editor in a VM.
This process results in a Plain-JS code. The Plain-JS code is parsed to AST-JS for feature learning using the FastText model. Listing 3 (see Fig. 3) shows a JS code with encoding obfuscation. The string 'New user' in the hello("New user") function call from Listing 4 (see Fig. 4) had been replaced to a call to a function that retrieves its value at runtime, such as var _0 × 74f5. Fig. 5 shows the overview of the steps to deobfuscate, unpack, and decode an obfuscated JS code. It consists of three steps, namely, deobfuscation or formatting, unpacking, and decoding. During deobfuscation, a JS code beautifier and formatter are used to deobfuscate a JS code to make it pretty, readable, easier to edit, and analyse. Then, the resulting formatted JS code is analysed for obfuscation. If there is still evidence of obfuscation, a VM and a JS code editor are used to unpack the JS code by function redefinition. The unpacked JS code is further examined for the presence of encoding, and a int_to_str() decoding function is applied to realise a Plain-JS code.

Feature learning by the FastText model
There exist several models that learn continuous representations for word, sentence, and paragraphs in the natural language processing (NLP) tasks. Models such as bag-of-words [24,25], Word2Vec [26,27], and Doc2Vec [28,29] result in a unique vector for each word, thereby ignoring the internal structure of words. Besides, the produced vectors do not share parameters.
The FastText model [30,31] is an approach based on the continuous skip-gram training algorithm [26]. The model represents each word as a bag of character n-grams. The resulting vector representation for a word is the sum of its character n-gram vectors taking into account subword information. The model is fast, therefore suitable for feature learning on big datasets and can learn vector representation for out of vocabulary words.
The Word2Vec [26] continuous skip-gram training algorithm learns a vector representation for each word x given a word vocabulary of size X, and each word identified by index x [ {1, ..., X }. The model predicts words in a context given a word. For a large training corpus X c = {x t−2 , x t−1 , x t+1 , x t+2 }, the model predicts the surrounding words, context X c given the centre word x t represented by X t . The objective of the Word2Vec continuous skip-gram training algorithm is to maximise the log-likelihood where C t is the set of indices of the context X c . Given a scoring function f which maps pairs of (X t , X c ) to scores in R, the softmax can be used to define the probability of a context word This type of model only predicts one context word X c given a word X t . The task of predicting context words can be viewed as a set of independent binary classification with the objective of predicting presence or absence of X c . For a word X t , all X c are the positive examples, negatives examples are sampled randomly from the dictionary. The binary logistic loss is used to obtain the negative log-likelihood for X c  where N t,c is a set of negative examples sampled from X. The objective function can be given by by denoting the logistic loss function ℓ:x 7 ! log (1 + e −x ). The FastText model introduces a different scoring function f where a bag of character n-gram vectors represent each word x. This scoring function takes into account the internal structure of a word. Prefixes and suffixes are distinguished using <and>, special boundary symbols, added at the start and end of a word. For a word such as encode with n = 3, its character n-grams will be <en, enc, nco, cod, ode, de> and <encode> which is a special sequence that is used to learn each words representation. Given a dictionary of size Gn -gram vectors and a word x, G x , {1, ..., G} gives the set of n-gram vectors in x. The scoring function f is given by the sum of vector representations of a words n-grams where Z g is vector representation for each n-gram g. The n-gram vector representations are shared across words. This ensures reliable vector representations are learnt for rare words [32]. A computation of the most similar word to encode using the trained FastText model, model_JS.wv.most_similar("encode"), results in 'encodedurl', 0.91, 'encodeuri', 0.89, 'htmlencode', 0.89, 'enc_str', 0.87, and 'decodeuri', 0.83. This function is used in malicious JS code to obfuscate the contents of a JS code. The words identified by the model are correct because each encoding function is accompanied by a decoding function. The encoding function is used to encode a string or a URL in a JS code and the decoding function decodes it at runtime. The FastText model implementation [33] creates n-gram vectors for words. We obtain a JS code feature vector by superimposing the n-gram word vectors. Given M as the total number of words in a JS code, the vector representation of a JS code will be given by where V JS code is the vector representation for a JS code and X n is the n-gram vector representation for a word n. A JS code vector is given by the summation of the average value of each component of a word n-gram vectors divided by the total number of words in a JS code. An implementation [33] of the FastText model is used to learn the feature vectors for a JS code producing fixed-length vectors. Labels obtained from a JS code dataset are used to train the network for the prediction tasks. The vectors are then used to classify a JS code into either malicious or benign. The FastText model uses several parameters for the learning of feature vectors such as, sg, which represents the skip-gram training algorithm, word_ngrams, which enriches word vectors with subword (n-grams) information, and min_n and max_n, which control the minimum and maximum length of n-grams respectively. Other parameters include min_count which ignore all words with a total frequency lower than a preset one, size, which represents the word vectors dimensionality, iter which represents the epochs over the corpus and window, which controls the maximum distance between the current word and the predicted one. Fig. 6 shows the overview of DUD-preprocessing of a JS code. The figure also presents the feature learning process by the FastText model and classification of the AST-JS features by a classifier model. It is a two-stage process. During the first stage, an obfuscated JS code is formatted, unpacked, and decoded, and a JS code parser is used to perform syntactic analyses of the resultant Plain-JS code. Second, the AST-JS obtained from the JS code parser is passed to the FastText model for feature learning, and a classifier is used to judge the maliciousness of a JS code using the learned AST-JS feature vectors. Using a training set of the learned feature vectors, we perform cross-validation and classifier learning to find the optimal hyperparameters. We then compare the performance of the FastText model to the paragraph vector models, the StarSpace model, the LSTM model, and the TF-IDF model on obfuscated malicious JS code detection on the same dataset.

Experimental setup
In the experimental setup, we use a dataset of 40,000 malicious JS codes and 150,000 benign JS codes. We also add 50,000 benign JS codes by crawling the Internet. The malicious JS codes were obtained separately from the Hynek Petrak's dataset [34] and the benign JS codes are from the SRILAB [35] and the Majestic Million service top 10,000 websites [36] datasets. These datasets are used for both training and test purposes. As shown in Table 1, some malicious and benign JS codes in the dataset could not be successfully parsed because of Syntax Error, Unicode Decode Error, Attribute Error, Name Error, Index Error, Not Implemented Error, Key Error, and Recursion Error.
We perform two types of feature embedding: The FastText model for regular obfuscated JS codes and the FastText model for DUD-preprocessed JS codes. We conduct feature learning using the datasets for the FastText model and compare its performance to the distributed bag of words version of the paragraph vector model (the PV-DBoW model), the distributed memory model of the paragraph vector model (the PV-DM model), the StarSpace model, the LSTM model, and the TF-IDF model. First, we conduct feature learning using regular obfuscated JS codes with input defined as AST-JS, that is, by learning feature vectors from both obfuscated benign and malicious JS codes represented as an AST-JS. Second, we conduct feature learning using DUD-preprocessed JS codes with input defined as an AST-JS, that is, by performing feature learning on code structure representation for DUD-preprocessed JS codes. We also perform AST-level merging and AST sub-tree realignment [14] on the AST-JS to enhance feature learning. This AST-level manipulation facilitates simulation of a real web setting by ensuring that the ratio of the benign AST-JS features is more compared to that of the malicious ones.  The FastText model parameters such as sg, size, window, min_count, word_ngrams, min_n and max_n significantly affect the performance of the FastText model. We conducted experiments with the dimensions of the range 100-900 for size,1 -8 for window,1 -8 for min_count and 3-6 for min_n and max_n.W e learned feature vectors of 200 dimensions for the word vectors, with sg = 1 to select the skip-gram training algorithm, window =8 for the number of context words, min_count = 5 for the word frequency, word_ngrams = 1 to activate the use of sub-word information and min_n = 3 and max_n = 5 for n-gram minimum and maximum length, respectively. These are the parameters that we found optimal for the task of feature learning on DUD-preprocessed JS codes, for the FastText model. For performance evaluation, we used cross-validation with k = 10, to get ten folds of feature vectors. The folds translate to ten iterations, divided into nine folds for classifier training and one fold for model evaluation. For feature vectors classification, we used SVM with kernel = linear and C = 1 parameters. SVM is a classification method for regression and classification. The algorithm is mostly used as a binary classifier and it constructs a hyper-plane in high dimensional space [37,38]. For this, we computed precision, recall, and F1-score. Precision is ability of a classifier not to identify a benign JS code as malicious and recall is the classifiers ability to identify all malicious JS codes. F1-score can be interpreted as a weighted harmonic mean of precision and recall [38]. Precision, recall and F1-score are computed as where TP is the number of malicious JS codes correctly classified as malicious, TN is the number of benign JS codes correctly classified as benign, FN is the number of malicious JS codes classified as benign, and FP is the number of benign JS codes classified as malicious.
For performance visualisation, we used a receiver operating characteristic (ROC) graph plotting the true positive rate against the false positive rate. We calculated area under the curve (AUC) score [38] for the ROC to determine the model performance. We compared the difference in performance between our approach and other feature learning approaches: the PV-DBoW model (vector_size = 200, window =8, min_count = 5), the PV-DM model (dm = 1, vector_size = 200, window =8, min_count = 5), the StarSpace model, the LSTM model, and the TF-IDF model.

Performance comparisons
Precision, recall, and F1-score: The results of the experiments are presented in this section. Table 2 presents the performance of the FastText model, the PV-DBoW model, the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model when performing feature embedding for regular obfuscated JS codes with input defined as AST-JS similar to the work by Ndichu et al. [14]. As shown in the table, the models can detect regular obfuscated malicious JS codes with an error rate of ∼0.0489, 0.0717, 0.0649, 0.0892, 0.1205, and 0.1657 in recall for the FastText model, the PV-DBoW model, the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model, respectively. AST can bypass obfuscation. Therefore, this approach can detect regular obfuscated JS codes that have similar syntactic and semantic characteristics as the original ones at an abstract level. However, malicious JS codes employ methods such as packing, encoding, and multi-layered obfuscation methods which results in complex codes that hinder the detection of JS-based attacks by signature, heuristics, and machine learning approaches. The performance of paragraph vector models in Table 2 is far from perfect as this would result in many misclassified obfuscated malicious JS codes which translates to multiple successful attacks. Table 3 presents precision, recall, and F1-score for the FastText model, the paragraph vector models, the StarSpace model, the LSTM model, and the TF-IDF model trained using a deobfuscated JS code dataset with input defined as an AST-JS and includes the standard deviation. The models achieved an above-average performance with precision, recall, and F1-score being above 50%. The objective of this study is to improve the performance of machine learning models on the detection of obfuscated malicious JS codes. A model with a high recall rate is desired to achieve this objective. Such a model would result in a few misclassified obfuscated malicious JS codes, that is, low false-negative. JS code deobfuscation results in a formatted code and can do away with only the simple forms of obfuscation. This property explains the reason why there is only a slight improvement between the models trained using regular obfuscated JS codes and those trained using a deobfuscated JS code dataset of ∼0.0115 for the FastText model, 0.0302 for the PV-DBoW model, 0.0176 for the PV-DM model, 0.0016 for the StarSpace model, 0.0062 for the LSTM model, and 0.0205 for the TF-IDF model in the recall. The input for the models in both cases is defined as an AST-JS. The best performing model when using a deobfuscated JS code dataset, the  FastText model, has an error rate of ∼0.0374 in the recall. Even though the FastText model is the best performing model here, a lot of obfuscated malicious JS codes will go undetected, which translates to multiple successful attacks. The failure in detection is because sections of the obfuscated malicious JS codes that employ sophisticated obfuscation techniques cannot be easily solved by deobfuscation. Therefore, further processing of the deobfuscated malicious JS codes is needed to realise improved performance. Table 4 presents precision, recall, and F1-score for the FastText model, the PV-DBoW model, the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model obtained with AST-JS features by feature learning using an unpacked JS code dataset. The FastText model achieves an improvement of over 2.78 and 2.07% in precision, 2.09 and 0.94% in the recall, and 2.54 and 1.82% in F1-score, when compared to the same model, trained using a regular obfuscated JS code dataset and a deobfuscated JS code dataset, respectively. The resultant code structure can explain the improvement in performance for the FastText model after deobfuscation and unpacking. Deobfuscation results in a well-formatted JS code, which facilitates the feature learning process of the model. When the unpacking is done for the deobfuscated JS code dataset, the complex code structure which remains after deobfuscation is simplified away. Therefore, the model can learn better features from a deobfuscated and unpacked JS code dataset compared to a regular obfuscated or a deobfuscated JS code dataset. Table 5 presents precision, recall, and F1-score for the FastText model, the PV-DBoW model, the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model obtained with AST-JS features by feature learning using a DUD-preprocessed JS code dataset. The FastText model, the best performing model, achieves an improvement of over 5.21, 4.5, and 2.43% in precision, 4.2, 3.05 and 2.11% in recall, and 4.14, 3.42 and 1.6% in F1-score when compared to the same model trained using a regular obfuscated JS code dataset, a deobfuscated JS code dataset, and an unpacked JS code dataset, respectively. The performance for the PV-DBoW model, the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model in this task is also improved. However, the FastText model achieves the highest recall rate for the task of obfuscated malicious JS code detection. This is because deobfuscation, unpacking, and decoding results in a Plain-JS code which has JS code transformations simplified away. Therefore, the resultant JS code provides better AST-JS features during feature learning.
True positive and false positive rate: The improved performance of the FastText model on the task of detection of obfuscated JS codes is also shown in Fig. 7a, which presents the ROC graph and AUC results for the model. As shown in Fig. 7a, the model achieved high and low rates of true positives and false positives, respectively, compared to the PV-DBoW model, which is presented in Fig. 7b, the PV-DM model, which is presented in Fig. 7c, the StarSpace model, which is presented in Fig. 7d, the LSTM model, which is presented in Fig. 7e, and the TF-IDF model, which is presented in Fig. 7f. The PV-DBoW model and the PV-DM model achieved the same AUC score. However, the latter model achieved the least AUC score compared to the PV-DBoW during cross-validation, as shown by sections of its ROC curve. The StarSpace model, the LSTM model, and the TF-IDF model achieved the lowest AUC scores with sections of the curves in Figs. 7d-f going below the performance threshold. The coloured curves represent the performance of the model for each fold during cross-validation. The paragraph vector models: the PV-DBoW and the PV-DM models ignore the internal structure of words. The FastText model takes into account a word's subword information. The n-gram word vectors learned by the FastText model during feature learning are shared across words. This property enables the FastText model to learn better and reliable vector representations for the DUD-preprocessed JS code dataset. Using a trained FastText model, words that do not exist in the model's vocabulary can be inferred from the learned character n-gram vectors. The model also results in low dimensional representations for the learned n-gram word vectors.

Discussion
One of the objectives of obfuscation is to hinder a JS code reverse engineering. Therefore, the resultant Plain-JS codes after performing DUD-preprocessing for the obfuscated JS codes may not result in the same original JS code as before JS code obfuscation was applied. However, as part of our findings in this research, an obfuscated JS code that has been DUD-preprocessed results in regular Plain-JS code. The detection performance for the machine learning models is improved because the resultant JS code has the same semantic and syntactic properties as the original one. After performing deobfuscation, which is the first step in our obfuscated JS code preprocessing, we have to inspect the resultant JS codes for the presence of obfuscation manually. Some approaches have been proposed for the automatic detection of obfuscated JS codes. Some of these approaches focus on specific JS code obfuscator tools that are unique to specific obfuscation techniques. Therefore, given a dataset that contains JS codes obfuscated by a JS code obfuscator tool other than the one focused in the approach would result in misclassification of obfuscated JS codes. The existing detection methods for obfuscated JS code detection cannot be easily applied to other JS code obfuscator tools. Other studies use features such as the structure of the obfuscated JS code and functions such as eval() and unescape() within a x number of bytes of each other. These are heuristic-based approaches that are limited in capability when faced with sophisticated JS code obfuscation techniques. In the case of a packed JS code, single-step unpacking is not enough as the majority of the obfuscated malicious JS codes contain multiple layers of obfuscation.
To address this issue, as future work, (i) there is a need to develop a technique to check for obfuscation in the dataset automatically. This technique should remain independent of specific JS code obfuscator tools or obfuscation techniques. (ii) Recently, studies are focusing on embedding on source code. Both regular and  obfuscated JS codes follow some code structure rules for sentences, functions, variables, operation procedures, and stop conditions. As such, a word embedding model may lose much of the information in a JS code structure. Embedding on JS source code would be language-specific and share the same concept as the NLP embedding models with a few tweaks. (iii) While the existing word embedding models, such as the FastText and the paragraph vector models, yield high-quality vector representations on programs and source code tasks such as obfuscated malicious JS code detection, they are not well-tailored for obfuscated JS code parts. As such, parsing errors are common when trying to realise a Plain-JS code for static analysis. It is envisioned that directly embedding a JS source code would facilitate the capturing of a JS code structure and avoid such errors.

Conclusion
In this paper, we propose DUD-preprocessing of obfuscated malicious JS codes and feature learning using the FastText model. The NLP models have previously achieved good results and resulted in high-quality vector representations for words, sentences, and paragraphs. Consequently, the models have been used for malicious JS code detection using features such as Plain-JS and AST-JS. However, the models struggle to learn useful representations when faced with obfuscated JS codes as the JS codes or sections of the JS code do not maintain appropriate code structure. One expectation is that DUD-preprocessing of the obfuscated JS codes would result in improved performance for machine learning models for the task of obfuscated malicious JS code detection and enable detection of JS codes transformed using sophisticated obfuscation techniques. We have conducted experiments using feature embedding for regular obfuscated JS codes and DUD-preprocessed JS codes. We compared the performance of the FastText model to the paragraph vector models: the PV-DBoW model and the PV-DM model, the StarSpace model, the LSTM model, and the TF-IDF model on the detection of obfuscated malicious JS codes. We analyse the resultant JS code after DUD-preprocessing to check for the presence of obfuscation. This further processing is essential to simplify away sophisticated JS code obfuscation techniques such as JS code multi-layer obfuscations. We use AST-JS for code structure representation of the resultant Plain-JS code and realise an improvement in performance for the models used for feature learning. AST-JS makes the structural relationship of elements within a JS code distinct while at the same time preserving all the relevant information of a JS code. Our proposed approach provides improved detection performance for the task of obfuscated malicious JS codes contents detection compared to feature learning on regular obfuscated JS code. Besides, there is considerable improvement in performance compared to our previous work [14]. The FastText model is fast and suitable for big datasets and, therefore, well suited for the JS code feature learning. The model can also learn vector representations for JS code words that are not contained in the vocabulary at the time of model training and would, therefore, result in reliable vector representations for future, new and unseen JS codes. As future work, there is a need for a study on obfuscated URLs and their properties. URLs in JS codes are heavily obfuscated. The obfuscation is even more complicated for URLs in obfuscated malicious JS codes as attackers endeavour to hide their attack methods such as URL redirection attacks to attackers untrusted websites. Such websites employ insecure protocols that would be effortlessly flagged down by implemented security controls such as the firewall. URL obfuscation is stealthily and deliberately applied to evade such detection. A malicious URL detection system must be developed to realise a comprehensive security system against web-based attacks.