Network intrusion detection system by learning jointly from tabular and text‐based features

Network intrusion detection systems (NIDS) play a critical role in maintaining the security and integrity of computer networks. These systems are designed to detect and respond to anomalous activities that may indicate malicious intent or unauthorized access. The need for robust NIDS solutions has never been more pressing in today's digital landscape, characterized by constantly evolving cyber threats. Deploying effective NIDS can be challenging, particularly in accurately identifying network anomalies amid the ever‐increasing sophisticated and difficult‐to‐detect cyber threats. The motivation for our research stems from the recognition that while NIDS studies have made significant strides, there remains a crucial need for more effective and accurate methods to detect network anomalies. Commonly used features in NIDS studies include network logs, with some studies exploring text‐based features such as payload. However, traditional machine and deep learning models may need to be improved in learning jointly from tabular and text‐based features. Here, we present a new approach that integrates both tabular and text‐based features to improve the performance of NIDS. Our research aims to address the existing limitations of NIDS and contribute to the development of more reliable and efficient network security solutions by introducing more effective and accurate methods for detecting network anomalies. Our internal experiments have revealed that the deep learning approach utilizing tabular features produces favourable results, whereas the pre‐trained transformer approach needs to perform sufficiently. Hence, our proposed approach, which integrates both feature types using deep learning and pre‐trained transformer approaches, achieves superior performance. These findings indicate that integrating both feature types using deep learning and pre‐trained transformer approaches can significantly improve the accuracy of network anomaly detection. Moreover, our proposed approach outperforms the state‐of‐the‐art methods in terms of accuracy, F1‐score, and recall on commonly used NIDS datasets consisting of ISCX‐IDS2012, UNSW‐NB15, and CIC‐IDS2017, with F1‐scores of 99.80%, 92.37%, and 99.69%, respectively, indicating its effectiveness in detecting network anomalies.

integrating both feature types using deep learning and pre-trained transformer approaches can significantly improve the accuracy of network anomaly detection.Moreover, our proposed approach outperforms the state-of-the-art methods in terms of accuracy, F1-score, and recall on commonly used NIDS datasets consisting of ISCX-IDS2012, UNSW-NB15, and CIC-IDS2017, with F1-scores of 99.80%, 92.37%, and 99.69%, respectively, indicating its effectiveness in detecting network anomalies.

| INTRODUCTION
In recent decades, cybersecurity-related incidents have increasingly captured society's attention.The development of security systems is primarily motivated by the disastrous effects of cybercrimes.Since any cyber-infrastructure is engaged in nearly every aspect of the current digital society, such as type of industry, military, business, and medical and so forth it is crucial to secure the confidentiality and integrity of information (Djenna et al., 2021;Perwej et al., 2021).
The rapid expansion of cyberspace has ushered in a proliferation of cyber threats, raising significant concerns for society's digital security.In response, effective security systems have become imperative to combat these malicious activities.Among these systems, the intrusion detection system (IDS) serves as a guardian by monitoring system and network operations for potential breaches and activating protective measures when needed.The primary goal of an IDS is to prevent cyberattacks or, when necessary, to mitigate ongoing intrusions (Sabahi & Movaghar, 2008).IDS comes in two primary types: host-based and network-based.Host-based is installed on individual devices, where it monitors specific device activity, alerting users or administrators to unusual or potentially malicious behaviour.In contrast, network-based operates at the network level, offering a broader view of potential threats by identifying patterns or anomalies across multiple devices (Vacca, 2009).
The need for robust network intrusion detection systems (NIDS) has never been more critical, given the rising frequency and severity of cyber-attacks.According to a report by Cybersecurity Ventures, the projected annual cost of cybercrime is set to reach an alarming $10.5 billion by 2025 (mor, 2020).In 2020 alone, data breaches reached an all-time high, with over 4000 reported incidents and the exposure of more than 16 billion records (Neto et al., 2021).Furthermore, ransomware attacks surged dramatically, with a staggering 365% increase reported from Q2 2018 to Q2 2019 (Richardson et al., 2021).Small and medium-sized businesses, in particular, bore the brunt of cyber-attacks, with 60% reporting such incidents, according to the Ponemon Institute-Keeper (Institute, 2019).Globally, only 38% of organizations felt adequately equipped to manage complex cyber-attacks, while approximately 54% of companies reported experiencing one or more cyber-attacks in the past year, as per a report by Cybit News.These statistics underscore the critical need for reliable and efficient NIDS to detect, prevent, and mitigate these types of attacks, ensuring the confidentiality and integrity of information (uni, 2018).
False negatives are a critical concern in NIDS as they represent instances where an actual attack goes undetected, potentially leading to significant harm or data breaches (Joo et al., 2003).This underscores the need for effective detection methods.NIDSs are categorized into two primary methods: signature-based and anomaly-based detection.Signature-based methods rely on predefined attack signatures but may struggle to detect new threats or polymorphic attacks that adapt to evade detection.In contrast, anomaly-based NIDSs excel at identifying unusual or unexpected behaviour, making them particularly effective against zero-day attacks and unknown threats (Ahmed et al., 2022;Khraisat et al., 2019;Ozkan-Okay et al., 2021).By analyzing traffic patterns and identifying anomalies, anomaly-based NIDSs can detect previously unknown attacks that would otherwise go undetected by signature-based systems.While anomaly-based NIDSs have an advantage in reducing false negatives, they may produce a higher rate of false positives, flagging non-malicious activity as an attack.Achieving the right balance between false positives and false negatives is crucial to ensure optimal security (Ahmed et al., 2022).
Machine learning (ML) has proven to be a powerful asset in anomaly-based NIDS, enabling the detection of new intrusions and various network behaviours, including those previously unseen or unknown (Ahmed et al., 2022;Tufan et al., 2021).NIDSs analyze network logs, encompassing both tabular features like source and destination IP addresses, ports, timestamps, and non-tabular features like payload data, often in non-human-readable forms like binary or encoded text.Overcoming the challenges posed by non-human-readable payload data is valuable for identifying and classifying network anomalies in NIDSs.Tokenizer-free approaches can be employed to extract information from non-human-readable format payload data, potentially enhancing the overall performance of NIDS by identifying complex and stealthy network anomalies that traditional methods may miss.
In this article, we propose a new approach to anomaly-based NIDSs that leverages the power of ML and NLP to enhance the performance of these systems.Our model architecture utilizes ML algorithms to detect unusual or suspicious activity in network traffic while utilizing NLP techniques to analyze the payload of packets and extract relevant information.By integrating these two approaches, we can more accurately detect a wide range of security threats and uncover valuable insights from large volumes of log data.
Our contributions can be summarized as follows: 1. To the best of our knowledge, we have used the pre-trained CANINE transformer model for the first time in the field of NIDS.
2. We have shown the effectiveness of using both ML and NLP techniques together for network intrusion detection and the individual performance of each technique.
3. Our proposed model, 'OverPowered' or 'OP', has surpassed the state-of-the-art results obtained on the well-known NIDS datasets, ISCX-IDS2012, UNSW-NB15, and CIC-IDS2017.This article is structured as follows: Section 2 presents the related work.The overview and construction of the datasets, the descriptions of the evaluation metrics and models are presented in Section 3. The test results are discussed and compared with the related studies in Section 4, and lastly, the conclusion and future work are given in Section 5.

| RELATED WORK
NIDS play a pivotal role in safeguarding networks against cyber threats.They analyze network packets to detect potential attacks and have two primary types: signature-based and anomaly-based NIDS.While signature-based NIDS rely on predefined attack patterns, anomaly-based ones identify deviations from normal traffic, requiring a balance between sensitivity and specificity to minimize false positives (Khraisat et al., 2019;Li et al., 2021;Ozkan-Okay et al., 2021).
ML techniques have been increasingly applied to improve the accuracy and reduce false positives of anomaly-based NIDS (Ahmed et al., 2022;Tufan et al., 2021).Researchers have leveraged various datasets and models to develop effective NIDS.Sarhan et al. emphasized the importance of high-quality datasets for training ML-based NIDS (Sarhan et al., 2021).Albasheer et al. presented a comprehensive survey of NIDS approaches, highlighting the role of alert correlation and advanced ML techniques (Albasheer et al., 2022).Mustafa compared popular NIDS datasets, favouring UNSW-NB15 for its comprehensiveness (Moustafa & Slay, 2015a).
Rizvi introduced a deep learning approach for intrusion detection in resource-constrained environments, achieving high accuracy (Rizvi et al., 2023).Panigrahi provided a comprehensive assessment of supervised classifiers for designing NIDS, identifying the J48 Consolidated classifier as ideal (Panigrahi et al., 2021).Disha conducted a comparative study of ML models using the UNSW-NB15 dataset (Disha & Waheed, 2021).
Ghurab conducted a detailed analysis of benchmark datasets for NIDS (Ghurab et al., 2021).
Siddiqi and Pak present an optimized process flow for feature selection within IDSs.Their methodology encompasses normalization, power transformation, and multi-objective optimization.On the ISCX-IDS2012 dataset, their model delivers an F1 score of 93.17%, while achieving a remarkable F1 score of 99.69% on the CIC-IDS2017 dataset, signifying a notable progression in feature selection strategies for IDSs (Siddiqi & Pak, 2020).
Kanna and Santhi propose an inventive IDS harnessing deep learning methodologies, specifically the synergy of convolutional neural networks (CNNs) and long short-term memory networks (LSTMs).Their model excels in accuracy, maintains a low false positive rate, and achieves competitive classification coefficients, reflecting a substantial enhancement in IDS performance (Kanna & Santhi, 2021).Khan et al. introduce a scalable hybrid IDS built upon convolutional-long short-term memory (Conv-LSTM) networks.Their IDS employs a two-stage anomaly detection module to identify potential malicious traffic and a misuse detection module to categorize anomalous traffic flows into known attack types.This IDS achieves an F1 score of 93.17% on the ISCX-IDS2012 dataset, offering scalability and robust pattern learning capabilities (Khan et al., 2019).
Vinayakumar et al. focus on enhancing IDSs through deep learning algorithms, particularly deep neural networks (DNNs).Their DNN-based approach, supported by advanced natural language processing (NLP) techniques and text representation methods, including Bag-of-Words, N-grams, and Keras embedding, has demonstrated impressive performance.On the UNSW-NB15 dataset, their model achieves an F1 score of 82.00%, while on the CIC-IDS2017 dataset, their approach attains a notable F1 score of 93.90% (Vinayakumar et al., 2019).Ho et al. propose an innovative intrusion detection strategy involving a flow-to-image conversion technique combined with a vision transformer (ViT) classifier.This methodology transforms network traffic flows into a sequence of vectors, encoding them into a latent space, and subsequently decoding them into images.Notably, their experiments yield substantial results, with binary classification F1 scores of 98.5% on the CIC-IDS2017 dataset and 96.3% on UNSW-NB15.In multi-class classification, their method attains an F1 score of 96.4% (Ho et al., 2022).Das et al. introduce an ensemble-based network IDS that unites ML and feature selection.Their IDS achieves an F1 score of 99.5% in successfully identifying intrusions, all while maintaining an impressively low false alarm rate of 0.5%.This approach represents a significant step forward in network security (Das et al., 2021).
Payload analysis, a crucial component of network intrusion detection, has seen significant progress.By analyzing the payload in addition to other packet metadata, IDSs can gain a more complete understanding of network traffic and detect threats that might otherwise go unnoticed (Sen et al., 2004).Several studies have demonstrated that analyzing payload can significantly enhance the accuracy and efficiency of IDSs, making them more effective at identifying security threats and reducing the number of false positives.Min et al. (2018) demonstrated the efficacy of NLP techniques in identifying payload attacks, achieving an impressive 99.13% accuracy rate.This approach enhances the system's ability to recognize subtle threats within network traffic.Kim et al. (2020) introduced an artificial intelligence-based intrusion detection system (AI-IDS) tailored for real-time HTTP traffic.Their work highlights the model's capacity to distinguish complex attacks from benign traffic patterns.Furthermore, it contributes to refining Snort rules and strengthening signature-based NIDS.Hassan et al. (2021) proposed payload embeddings, a method that harnesses byte embeddings and a shallow neural network.Across various datasets, this approach consistently achieved accuracy rates ranging from 75% to 99%, outperforming traditional intrusion detection techniques.
It showcases the potential of leveraging embeddings for effective network intrusion detection.
Recognizing the growing need for standardized approaches in modern network intrusion detection datasets, Farrukh et al. (2022) introduced Payload-Byte.This versatile tool streamlines dataset curation and establishes a standardized foundation for future research.It offers the capability to extract and label various protocols, enabling the transformation of labelled data into byte-wise feature vectors for ML model training.
Collectively, these studies emphasize the vital role of payload analysis in network intrusion detection.Utilizing diverse methodologies, ranging from NLP techniques to innovative tools, they contribute significantly to the continuous advancement of NIDS, providing valuable insights for addressing cybersecurity threats.
In recent years, transformer-based models have gained much interest and achieved state-of-the-art results in various sequence-related tasks (Lin et al., 2022).Transformer models were first introduced in the article 'Attention Is All You Need' by Vaswani et al. in 2017Vaswani et al. (2017).These models utilize self-attention mechanisms to compute representations of each element in a sequence, capturing both short and longrange dependencies efficiently.
Transformers have excelled in NLP tasks, such as language modelling, machine translation, sentiment analysis, and named entity recognition (Gillioz et al., 2020).They've also demonstrated versatility in non-NLP tasks like speech recognition and image captioning (Khan et al., 2022;Sun et al., 2019).Key advantages include parallel processing of sequences of varying lengths and the ability to capture global dependencies (Lin et al., 2022;Vaswani et al., 2017).
Recent surveys (Casola et al., 2022;Min et al., 2021;Minaee et al., 2021;Qiu et al., 2020) along with the studies discussed in the previous paragraphs, clearly demonstrate that current transformer-based models, particularly pre-trained transformer models fine-tuned on downstream tasks, have consistently outperformed traditional machine and deep learning models in sequence classification tasks in various domains, including cybersecurity.
Table 1 summarizes studies on developing and evaluating NIDS.These studies have been selected for inclusion based on their specific focus on creating NIDS and assessing their performance.The table provides insights into the methods employed, the datasets used for evaluation, and the limitations identified in each study.The information presented in this table is a valuable reference for understanding the landscape of NIDS research, showcasing the diverse approaches, datasets, and challenges encountered in this field.
Table 1 summarizes studies on developing and evaluating NIDS.These studies have been selected for inclusion based on their specific focus on creating NIDS and assessing their performance.The table provides insights into the methods employed, the datasets used for evaluation, and the limitations identified in each study.The information presented in this table is a valuable reference for understanding the landscape of NIDS research, showcasing the diverse approaches, datasets, and challenges encountered in this field.

| METHODOLOGY
In the methodology section, first, the datasets used in experiments are introduced.Second, the dataset generation steps, including preprocessing, are described, and the reasoning behind the process is explained.Then, pre-trained transformer models, CANINE, and our proposed (OP, a.k.aOver Powered) model architecture is explained.
T A B L E 1 Summary of studies on network intrusion detection systems: Methods, datasets, and limitations.
Focuses on developing a deep learning-based network intrusion detection system for resource-constrained environments.
1D-dilated causal neural network architecture with dilated convolution.

CIC-IDS2017, CSE-CIC-IDS2018
The exploration of the 1D convolutional network without dilated convolutions as a natural predictor is lacking, potentially limiting a comprehensive assessment of its effectiveness.

Disha and Waheed (2021)
Conducts a comparative analysis of ML models for building an intrusion detection system using the UNSW-NB 15 dataset.
Decision tree, random forest, gradient boosting tree, multi-layer perceptron models; chi-square test, response coding, min-max scaling.
UNSW-NB15 Dataset specificity may limit generalization.The selected algorithms and their parameters could introduce bias.

Siddiqi and Pak (2020)
Focuses on optimizing filterbased feature selection methods for intrusion detection systems in the context of big data security.(Continues) The ISCX-IDS2012 dataset was particularly useful for our study as it provided text information for our method's NLP component and tabular features for the MLP component.However, the UNSW-NB15 and CIC-IDS2017 datasets presented a different challenge.These datasets only provided tabular features in their shared format, which meant that we needed to find a way to obtain the necessary text information for the NLP component of our method.We overcame this challenge using the packet capture (PCAP) files associated with these datasets.We could match the text information from the PCAP files to the corresponding labels in the shared dataset by the use of common features (Section 3.1.2dataset generation).
The main reason for choosing these datasets is that they have many common features with PCAPs, which ensure reliable matching.Moreover, the datasets contain a mix of normal and attack samples, allowing for the development of models to detect network threats.

ISCX-IDS2012
Reliance on a single dataset may limit the generalizability of the findings and the ability to assess the method's performance across diverse data sources.The distributions of label counts after removing duplicates and extracting missing labels from these three datasets are presented in Table 2.

| Dataset generation
The UNSW-NB15 and CIC-IDS2017 datasets originally only provided tabular features but did not include text information in the labelled dataset.
A dataset generation process is implemented for these datasets in order to use the proposed method with text information.
The reasoning behind this dataset generation process is that by extracting the payload information from PCAP files and merging it with the labelled dataset, we can add text information to the labelled dataset, which can be used as input for text classification with NLP techniques.The dataset generation process involved several steps of data preprocessing and cleaning to ensure that the final dataset is consistent, reliable, and ready for use in the proposed method for intrusion detection using payload information.
The following is a step-by-step explanation of the dataset generation process applied to the UNSW-NB15 and CIC-IDS2017 datasets individually: 1. Parsing the PCAP files of the UNSW-NB15 and CIC-IDS2017 datasets for each day to extract the payload information and generate a CSV file that includes the following features: source IP, destination IP, source port, destination port, protocol, and payload.The reasoning behind the selected features is that the labelled datasets also include these features (except payload).
2. Merging the PCAP-generated CSV file with the corresponding labelled dataset for each day using the five columns (source IP, destination IP, source port, destination port, and protocol) as the key columns, hence obtaining a new dataset for each day.
3. Decoding the payload data from its original format (such as bytes) to a human-readable format (such as text or ASCII).
-It is important to note that not all payloads can be decoded to a human-readable format, as some payloads may be encrypted or encoded in a way that makes them unreadable without the proper decryption or decoding method.In such cases, as much of the payload as possible is decoded, and the non-decoded portion is stored in its original format.Therefore, these payloads include non-human readable texts.
-Handling the challenges presented by the semi-decoded payload data and the fact that even a decoded payload includes meaningless words or structures (Table 3) can be addressed using a tokenizer-free deep learning-based algorithm (Section 3.3.2CANINE).
4. Concatenating all the resulting datasets for different days into one final dataset.The final dataset includes the labelled dataset features, the payload information extracted from the PCAP files, and the label.
T A B L E 2 Label count distribution of the datasets.T A B L E 3 Different types of payload information examples.

Hexadecimal 474554202f646f6c61646d696e2e…
Binary representation b' The ISCX-IDS2012 dataset originally included four payload features.Since the data generation process described above is applied to generate the source payload for the UNSW-NB15 and CIC-IDS2017 datasets, only the source payload in UTF format is utilized to ensure consistency between datasets.
The following is a step-by-step explanation of data preprocessing applied to the ISCX-IDS2012, UNSW-NB15, and CIC-IDS2017 datasets individually: 1. Deleting the duplicate rows according to the columns: source IP, destination IP, source port, destination port, protocol.
2. Deleting rows where the payload information is empty or NaN.

3.
Transforming the problem into a binary problem by focusing on the network traffic data as either normal or attack and not considering the specific attack types present in the data.
The preprocessing steps described above were implemented to ensure consistency and reliability in the datasets.Although the resulting datasets are reduced in sample size, they still contain sufficient data to produce meaningful results.It is important to note that while these steps were taken, the label distribution in the ISCX-IDS2012 and UNSW-NB15 datasets remained relatively consistent with the original.In contrast, the CIC-IDS2017 dataset became more imbalanced than it was initially.The effect of preprocessing on the datasets is shown in Table 4.

| Evaluation metrics
In network intrusion detection problems, the nature of the data often results in a highly imbalanced class distribution, where the number of negative examples (representing normal behaviour) far surpasses the positive examples (representing intrusive behaviour) (Rodda & Erothi, 2016).As demonstrated in Table 4, the datasets leveraged in our study exhibit a highly imbalanced class distribution, highlighting the importance of selecting appropriate evaluation metrics to accurately evaluate a classification model's performance.
Several metrics can be used to evaluate the performance in the presence of imbalanced data.Some commonly used metrics include: 1. Accuracy: Accuracy measures how well a model can correctly predict the outcome.It is calculated as the proportion of correct predictions made by the model out of all predictions made.In imbalanced data problems, accuracy can be misleading, as it only measures the proportion of correct predictions and does not consider the data distribution (Equation 1) (Boughorbel et al., 2017).
2. F1 score: The F1 score is a balanced measure that combines precision and recall, overall evaluating the model's performance.F1 score provides a more balanced view of the model's performance, considering both precision and recall (Equation 4).
T A B L E 4 Total sample size and label percentage distribution of the datasets.3. Recall (sensitivity or true positive rate (TPR)): Recall measures the model's ability to identify all positive observations, including intrusions.In NID, recall can be used to evaluate the model's ability to detect intrusions (Equation 2).
4. Specificity (true negative rate (TNR)): Specificity measures the model's ability to identify true negative predictions without producing false negatives.In NID, specificity can be used to evaluate the model's ability to identify normal network traffic without generating false negatives (Equation 5).
5. Matthew's correlation coefficient (MCC): The MCC is a balanced measure that considers both true positive and true negative predictions and false positive and false negative predictions.It provides an overall evaluation of the model's performance (Equation 6).
6. Fooling rate (FR) is a critical metric for assessing the success of adversarial attacks.It measures the percentage of data samples that undergo a change in the model's predicted label after adversarial perturbation (Equation 7).This metric is of great significance in adversarial attack assessments, particularly for targeted attacks, where it reveals the proportion of samples successfully misclassified as the intended target label (Akhtar & Mian, 2018).
Fooling rate ¼ Number of samples with changed predictions Total number of adversarial samples : ð7Þ In addition to the evaluation metrics, a confusion matrix is a supplementary technique utilized to assess the performance of a classification model.This method provides a comprehensive evaluation of the model's performance by capturing the total number of accurate and inaccurate predictions made by the model, thereby offering a more nuanced view compared to a singular evaluation metric.
In the realm of NIDS, the false negative rate is deemed to be a vital criterion.Hence, evaluating a model's performance through the confusion matrix is more valuable than relying exclusively on conventional metrics such as F1, accuracy, TNR, TPR, and MCC (Luque et al., 2019).

| Models
This study utilized three models: MLP, CANINE (character architecture with no tokenization in neural encoders), a pre-trained transformer model, and the newly proposed OverPowered (OP) model.These models have been thoroughly described in the following.

| Multi-layer perceptron (MLP)
MLP is an artificial neural network widely used for supervised learning tasks.MLPs are composed of multiple layers of interconnected nodes, also known as artificial neurons, that process the input data and produce the final output (Taud & Mas, 2018).
Each node in an MLP takes inputs from the previous layer, performs a weighted sum of these inputs, and applies an activation function to produce its output.The activation function adds non-linearity to the network, allowing it to learn complex relationships between inputs and outputs.This characteristic makes them particularly useful for solving problems where a simple linear model is inappropriate.The outputs from one layer are then used as inputs for the next layer, and this process is repeated until the final layer produces the final output (Aitkin & Foxall, 2003).
MLPs exhibit exceptional potential as a solution for diverse problems, including image classification, NLP, and finance, due to their capability to learn complex, non-linear relationships between inputs and outputs.MLPs demonstrate versatility and adaptability by altering the number of layers and nodes and the type of activation function used.Figure 1 shows MLP architecture in detail.

| CANINE
Transformer models are a type of neural network architecture that has revolutionized the field of NLP.These models are designed to handle sequential data, such as text or speech, using self-attention mechanisms to weigh the importance of each input element when making predictions.
This allows them to effectively capture long-range dependencies and relationships between elements in the sequence, which is important for many NLP tasks such as language translation, text classification, and named entity recognition (Lin et al., 2022;Vaswani et al., 2017).
Pre-trained transformer models are transformer models that have already been trained on large amounts of data and are then made available for fine-tuning on specific NLP tasks.These pre-trained models provide powerful features that can be used as a starting point for new NLP models, allowing developers to train models quickly without investing the time and resources needed to train models from scratch (Qiu et al., 2020).
CANINE (character architecture with no tokenization in neural encoders) is a pre-trained encoder model designed to overcome the limitations of traditional tokenization techniques, such as word-piece and sentence-piece tokenization.Unlike conventional pre-trained models, CANINE uses neural encoders that encode the sequence of characters or sub-words without explicitly tokenizing the input data (Clark et al., 2021).
CANINE is pre-trained on the masked language modelling (MLM) and next sentence prediction (NSP) tasks.The lack of tokenization makes CANINE more versatile, as it can be used in specialized domains where traditional tokenization may not be suitable (Clark et al., 2021).
In the field of network intrusion detection, the tokenization-free strategy employed by CANINE might be particularly well-suited, as payload information (Table 3) may not be suitable for traditional word-tokenization.Using character-based encoding allows CANINE to capture the fine-grained details of the input data, which is critical for accurate network intrusion detection.Figure 2 illustrates the CANINE model structure.
The performance of each model is evaluated by comparing their results on each dataset using various metrics such as F1, accuracy, TPR, TNR, and MCC.

| Results
The results have been analyzed for each individual dataset used in our study.The detailed performance comparison of models on each dataset can be found in Tables 5-7, providing an in-depth analysis of the results through the use of various evaluation metrics.
Based on the comprehensive evaluation of performance metrics across all three datasets, the OP model consistently demonstrates superior F1 score, accuracy, MCC, TPR, and TNR results.These findings strongly suggest that the OP model outperforms the other two models in this study.In contrast, the CANINE model consistently lags in all metrics, indicating its limited effectiveness in intrusion detection tasks.While the MLP model exhibits performance comparable to the OP model, it needs to catch up to achieve the same level of performance.
An analysis of the UNSW-NB15 dataset distribution, where benign samples comprise 92.52% and malicious samples make up 7.48% (as shown in Table 4), underscores the limitations of the CANINE model.Notably, the CANINE model's performance is even poorer than a simple baseline or 'dummy' model that predicts all test samples as normal.These results strongly indicate that relying solely on text-based features may not provide adequate accuracy for detecting network anomalies within this imbalanced dataset.
The evaluation results indicate that the MLP and OP models are more appropriate to proceed with based on their evaluation performance.In order to gain a more comprehensive understanding of the capabilities of the MLP and OP models, as they appear to be more favourable than the CANINE model, it is essential to examine their confusion matrices on each of the datasets depicted in Figure 4.This will provide a thorough insight into the models' ability to distinguish between malicious and benign samples.
When evaluating the MLP and OP models' confusion matrices, comparing their false negative counts is crucial to determine the more suitable model for scenarios where even a single false negative result can have substantial consequences.
To make the comparison easier, we created a table of false negatives, which includes each model's total attack count and false negatives based on the confusion matrices for all datasets.This table enables a straightforward evaluation of the models' performance.
The results presented in Table 8 show that the OP model outperforms the MLP model regarding false negatives across all three datasets.The case of the CIC-IDS2017 dataset is particularly noteworthy, where the OP model has a significantly lower number of false negatives, with only  two recorded.These results highlight the superiority of the OP model in reducing false negatives, making it a more suitable choice for NIDS scenarios where minimizing false negatives is of the utmost importance.
In the context of intrusion detection, adversarial attacks pose a unique challenge.These attacks involve deliberate attempts to manipulate input data subtly, aiming to deceive the system while maintaining its functionality.Malicious actors create subtle manipulations of input data to deceive IDSs and avoid their identification.Researchers have traditionally focused on exploring adversarial attacks and developing corresponding defense mechanisms in intrusion detection (He et al., 2023;Jmila & Khedher, 2022).
However, the focus diverges from conventional defense mechanisms in this particular study.Instead, the primary objective lies in assessing the inherent resilience of the model itself.This entails evaluating how well the model can withstand adversarial attacks without relying on explicit defense mechanisms, providing valuable insights into its robustness and reliability.
Within the realm of adversarial attacks in NIDS, it is essential to distinguish between various attack types, including white-box attacks.In the context of white-box attacks, adversaries possess comprehensive knowledge of the target intrusion detection model, including its architecture, parameters, and internal workings.Armed with this extensive information, they craft adversarial samples to deceive the model.For instance, in this study, we specifically explored the fast gradient sign method (FGSM) (Goodfellow et al., 2014), a well-known white-box attack.FGSM, first proposed in 2014, operates by utilizing gradients.Neural networks, including intrusion detection models, minimize their loss by adjusting weights through the feedback of backpropagated gradients during training.To mount an attack, the FGSM method maximizes the loss using the same backpropagated gradients, resulting in carefully calculated perturbations of input features.
The FGSM-based adversarial attack is formulated by modifying the input data x as follows: where x represents the model's inputs, ϵ is the magnitude of the perturbation, and Jðθ, x, yÞ is the gradient of the adversarial loss.This attack strategy aims to deceive the intrusion detection model by perturbing input data in the direction that maximizes the loss, thereby revealing vulnerabilities in its performance.
Comparing the FN before and after subjecting the network intrusion detection models to FGSM attacks with varying epsilon values (ϵ0.02, ϵ0.2, ϵ0.5) (Tables 8 and 9), it becomes evident that the original MLP model had a notably higher FN count in all three datasets compared to the OP model.Specifically, for the ISCX-IDS2012 dataset, the FN count increased from 22 to 2559 with ϵ0.02 for the MLP model, whereas the OP model exhibited a more modest increase from 13 to 1738 under the same conditions.Similarly, on the UNSW-NB dataset, the MLP model saw an increase in FN count from 68 to 5022 with ϵ0.02, while the OP model's FN count rose from 17 to 2858.In the CIC-IDS dataset, the FN count for the MLP model increased from 21 to 2442 with ϵ0.02, while the OP model's FN count went from 2 to 1652.
Furthermore, when subjected to the FGSM white-box attack, our experiments with the MLP model across all three datasets revealed varying fooling rates (Table 10).These rates align with findings in the broader NIDS literature.Models similar to ours, without specific defenses against adversarial attacks and with ϵ set at 0.02, typically exhibit fooling rates ranging from 40% to 100% (Abou Khamis et al., 2020;Deng et al., 2020;Wang et al., 2022;Zhang et al., 2022).In our case, the MLP model demonstrated fooling rates of 45.13%, 48.88%, and 42.45%, underscoring its susceptibility to adversarial perturbations and the significant deviations observed during these attacks.
In stark contrast, our OP model exhibited a significantly higher robustness against the FGSM attack.The OP model consistently demonstrated lower fooling rates, recording values of 30.69%, 28.03%, and 28.93%.These results signify the OP model's enhanced ability to withstand adversarial perturbations and its superior performance when faced with such challenges.
T A B L E 8 False negatives (FN) comparison.The enhanced robustness of the OP model against adversarial attacks can be attributed to its holistic approach, which leverages a combination of features and advanced contextual understanding.By incorporating structured tabular data and textual information, the OP model gains a broader network traffic perspective.This multifaceted view enhances its capacity to detect subtle adversarial manipulations effectively.Including the CANINE pre-trained transformer model facilitates transfer learning and bolsters the OP model's contextual comprehension of network payloads.Furthermore, the OP model's higher complexity, arising from the fusion of MLP and transformer components, allows it to capture intricate patterns and potentially benefit from ensemble-like effects.Collectively, these factors contribute to the OP model's superior resilience when faced with adversarial attempts in intrusion detection tasks.
We compared our results with previous literature studies to better understand the OP model's performance.
Table 11 presents the performance comparison of the OP model with previous studies on the ISCX-IDS2012 dataset.The OP model achieves strong results with an accuracy of 99.98%, precision of 99.80%, recall of 99.77%, and an F1 score of 99.80%.Notably, a couple of these studies achieved precision rates of 100%, surpassing our model's precision.
The performance comparison on the UNSW-NB15 dataset is detailed in Table 12.Our OP model achieves an accuracy of 98.77%, precision of 85.94%, recall of 99.83%, and an F1 score of 92.37%.Despite some previous studies demonstrating higher precision, the OP model maintains a significantly improved recall rate, which is crucial for intrusion detection tasks.
Table 13 outlines the performance comparison on the CIC-IDS2017 dataset.The OP model excels with an accuracy of 99.97%, precision of 99.41%, recall of 99.96%, and an F1 score of 99.69%.It's noteworthy that one of these studies achieved a precision rate slightly higher than our model's, reaffirming its suitability for intrusion detection tasks.
Across the evaluated datasets, it is evident that there is room for improvement in precision, particularly when compared to certain previous studies where higher precision has been achieved.Our OP model consistently achieves the highest accuracy and recall rates across all datasets, underscoring its reliability in identifying network anomalies.Notably, in the ISCX-IDS2012 and CIC-IDS2017 datasets, the F1 score, which T A B L E 1 0 Fooling ratios comparison after FGSM attack (Epsilons: ε0.02, ε0.2, ε0.5).balances precision and recall, also ranks highest for our model.However, on the UNSW-NB15 dataset, where precision is comparatively lower, the F1 score follows suit.
These findings highlight the importance of a balanced approach in intrusion detection, where both precision and recall play vital roles.While precision helps minimize false alarms, recall ensures that genuine intrusions are not missed.Given the potential risks associated with network attacks, a slightly higher emphasis on recall is favoured in our model to prioritize minimizing missed detections and ensuring comprehensive security coverage.
Intrusion detection using NLP techniques on payload features has been explored in prior research, including studies by Hassan et al. (2021) and Vinayakumar et al. (2019).In our study, we extend the work of Hassan et al. in the intrusion detection domain, specifically leveraging payload features and NLP techniques.Our approach combines the analysis of packet headers through our OP model with MLP, along with payload data, as achieved with CANINE.This dual perspective lets us understand network traffic and detect anomalies effectively.Notably, Hassan et al. primarily focused on payload data analysis and did not incorporate packet header information into their approach.
Our findings reveal that our model consistently outperforms Hassan et al.'s approach on ISCX-IDS2012 and CIC-IDS2017 datasets across all metrics, showcasing its robustness in these contexts.However, when examining the UNSW-NB15 dataset, it is important to note that Hassan et al. achieved a slightly higher F1 score, primarily due to their significantly higher precision, indicating their strength in accurately identifying intrusions in this specific dataset.This observation highlights an exciting aspect of intrusion detection: the relevance and effectiveness of header information versus payload information can vary depending on the dataset or specific attack patterns.On the other hand, our model struggles with precision, indicating that it may classify some normal traffic as intrusions.This divergence in performance on the same dataset underscores its complexity and the need for further investigation.
Our OP model demonstrates robust intrusion detection performance across diverse datasets, leveraging a fusion of MLP and NLP techniques, enhanced by the pre-trained transformer model CANINE.While acknowledging room for precision improvement, our model consistently excels in accuracy and recall, achieving top-tier F1 scores in select datasets.This approach contributes to more reliable network security solutions, addressing evolving cyber threats.Additionally, our research highlights the OP model's resilience against adversarial attacks, underlining the significance of holistic, NLP-based strategies in advancing intrusion detection and network security.

| Limitations
A crucial limitation to consider when working with network intrusion detection datasets is that even if multiple studies use the same dataset, they may not use the same samples, and the distribution of samples can differ significantly from the original version.This can make it challenging to compare results across studies and hinder the development of standardized evaluation procedures.To overcome this limitation, researchers must pay close attention to how they sample from datasets to ensure greater comparability across studies.Additionally, researchers may consider providing more detailed information on the samples they use and how they were selected to enable more accurate comparisons between studies.
The challenge of working with large PCAP files is a common issue encountered in many network intrusion detection datasets, including the ISCX-IDS2012, UNSW-NB15, and CIC-IDS2017 datasets.These files capture and store network traffic, which can be voluminous and require high-performance computing resources to extract information.The size of PCAP files can pose challenges in portability, sharing, and processing,  limiting the types of analyses and evaluations that can be performed.In real-life settings, the size of PCAP files can be an even more significant problem, as network traffic captures can span days or weeks, making it more challenging and computationally costly to analyze and process the data.To address these issues, researchers may consider alternative data storage and processing techniques, such as data compression, distributed computing, or federated learning, or use smaller subsets of the datasets focused on specific attack types or network behaviours.

| CONCLUSION AND FUTURE WORK
This study explored deep learning and NLP techniques to improve the performance of NIDS based on network logs, focusing on the strength of payload features.Three models, MLP, CANINE, and OP, were evaluated on three datasets to determine their performance.
The findings from our internal evaluation demonstrate that the deep learning approach, MLP, exhibited strong independent performance.
However, the pre-trained transformer approach CANINE, which relies solely on text-based features, failed to deliver sufficiently effective results.
In contrast, the OP model has surpassed the MLP and CANINE models by integrating tabular and text-based features, indicating the importance of utilizing both feature types for accurate network anomaly detection.
Through a comparative analysis of our study's results with the existing literature, it has been observed that the performance of the OP model has achieved state-of-the-art outcomes in network anomaly detection.However, we acknowledge that there is still room for improvement in terms of precision.Further research could focus on optimizing the model's performance by exploring alternative feature engineering techniques and implementing more advanced deep learning architectures.
Federated learning is a promising approach that enables collaborative training of ML models while preserving data privacy and security for multiple participants.This technique can potentially overcome the limitations of centralized data processing in NIDS and enhance the real-time performance of network anomaly detection.We believe that further exploration of this research direction is warranted, as it presents a significant potential for future advancements in the field of NIDS.
3.1.1| Overview of datasets1.ISCX-IDS2012(Shiravi et al., 2012): The ISCX-2012 dataset is a network traffic dataset collected during the 2012 Information Security Conference.The dataset contains a collection of network traffic captures for different types of attacks, including botnet, DoS, port scans, and bruteforce attacks.The dataset consists of seven days of network traffic captures, with each day's capture lasting approximately 24 hours.The dataset also includes features extracted from the network traffic captures, such as packet and flow features and payload information.2.UNSW-NB15(Moustafa & Slay, 2015b): The UNSW-NB15 dataset is a network traffic dataset that was created by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) at the University of New South Wales (UNSW).The dataset includes network traffic captures for 15 attacks, including SQL injection, port scans, and backdoor attacks.The dataset also includes captures of normal network traffic to provide a realistic representation of modern network traffic.The dataset consists of approximately 2.5 million instances and includes features extracted from the network traffic captures, such as packet and flow features.It also includes PCAP files containing the raw network traffic.T A B L E 1 (Continued) of a hybrid intrusion detection system based on the convolutional-LSTM network.Two-stage IDS: Spark ML for anomaly detection, Conv-LSTM network for misuse detection.
Comparison of OP model performance with previous studies in literature on CIC-IDS2017 dataset.
Sharafaldin et al., 2018): The CIC-IDS2017 dataset is a network traffic dataset by as part of the Canadian Institute for Cybersecurity (CIC).The dataset includes captures of benign and malicious network traffic and several types of attacks, such as DoS, botnet, and web attacks.The dataset consists of approximately 10 million instances and includes features extracted from the network traffic captures, such as packet and flow features.The dataset is designed to provide a realistic representation of modern network traffic.It is intended for use in the research and development of intrusion detection systems and other cybersecurity applications.It also includes PCAP files containing the raw network traffic.
Performance comparison of models on ISCX-IDS2012 dataset.Bold indicates the highest score achieved in each respective performance metric across the models compared.Performance comparison of models on UNSW-NB15 dataset.Bold indicates the highest score achieved in each respective performance metric across the models compared.Performance comparison of models on CIC-IDS2017 dataset.
T A B L E 5 T A B L E 7Note: Bold indicates the highest score achieved in each respective performance metric across the models compared.
Comparison of OP model performance with previous studies in literature on ISCX-IDS2012 dataset.Bold indicates the highest scores achieved in each respective metric when compared to the results reported in the referenced studies.Comparison of OP model performance with previous studies in literature on UNSW-NB15 dataset.Bold indicates the highest scores achieved in each respective metric when compared to the results reported in the referenced studies.
T A B L E 1 2 Bold indicates the highest scores achieved in each respective metric when compared to the results reported in the referenced studies.