ASYv3: Attention-enabled pooling embedded Swin transformer-based YOLOv3 for obscenity detection

The rampant spread of explicit content across social media can leave a damaging mark on our society. Hence, the need to be vigilant in detecting and curtailing sexually explicit content cannot be overstated. As such, it becomes paramount to discern and manage sexually explicit material to curb its dissemination and safeguard our digital communities from its harmful effects. In this article, we propose a unique technique entitled attention-enabled pooling (ABP) embedded Swin transformer-based YOLOv3 (ASYv3) for the detection of obscene areas present in the images with a bounding box around the offensive regions. ASYv3 employs a unique two-step approach for enhanced performance in obscene detection. In the first step, a scalable and efficient Swin transformer block is integrated, utilizing self-attention and model parallelism to train massive models effectively. In the second phase, the embedding layer of the Swin transformer is replaced with ABP, mitigating disruption of feature context. ABP allows for the projection of raw-valued features into linear form with proper attention to feature context information at specified locations, resulting in optimized feature extraction. The proposed ABP embedded Swin transformer-based YOLOv3 (ASYv3) was trained with annotated obscene images (AOI) dataset. The proposed ASYv3 model surpassed the state-of-the-art methods by achieving 97% testing accuracy, 96.62% precision, 97.40% sensitivity, 3.48% FPR rate, 97.37% NPV values, and 95.59% mAP values, respectively.

Additionally, many platforms have implemented community guidelines and terms of service that prohibit posting obscene content, and users can report content that violates these guidelines. The large scale and rapid pace of content sharing on social media platforms pose challenges to the full effectiveness of these measures. Furthermore, deepfake technology and AI-generated images and videos make it increasingly hard to detect and remove obscene content. According to a survey of parents of teens ages 13 to 17 conducted this spring by the Pew Research Center, parents have a variety of concerns about their children's use of social media, the most common of which is their children's exposure to pornographic content and the amount of time their children waste on these sites (Kota et al., 2014;PewResearch, 2023). Figure 1 depicts the graphical representation of the concerns of parents regarding their teen's use of social media. Exposure to pornography at a young age can negatively impact a child's sense of morality and lead to harmful behaviours such as sexual assault. For instance, child pornography should never be accepted, and content that supports such crimes should be removed from the internet as quickly and thoroughly as possible so that the harmful message does not spread worldwide. Because it is harmful to humankind as a whole, offensive content such as obscene material or anything that demeans someone should never be made available to the general public. Detecting obscene information is essential for restricting access to such content. Furthermore, it is a complex task with significant legal, ethical, and moral implications, requiring careful consideration. In order to prevent cybercrime, this necessitates a detection algorithm capable of automatically monitoring obscene content on social media. Several more disciplines have benefited greatly from the implementation of deep learning algorithms (Del Ser et al., 2022;Goel et al., 2022;Stevenson et al., 2023). Several obscene classification methods described in the literature can be used to identify offensive material and prevent its future dissemination or access. Classification techniques are broadly classified into machine learning models and deep learning models. Machine learning models are less accurate, have a high computational cost, and are unable to accurately detect obscenity in images. Deep learning models surpass machine learning models in terms of accuracy, faster training, and stability during training, amongst other factors. Based on our findings, there is currently no established method for automatically detecting and removing explicit content from images shared on social media platforms. This allows us to classify the entire image as sexually suggestive and accurately pinpoint the specific objectionable area with a bounding box.
In this paper, we have considered the advantages of a detection-based algorithm YOLOv3 and redesigned the backend by incorporating Swin transformer blocks and attention-enabled pooling (ABP) blocks. As a response to the problems of (i) misclassification of obscene regions and (ii) high false positive rate, the primary emphasis of this study is placed on improving the obscene region detection performance in sexually explicit images. In this model, we use a Swin transformer-based feature extractor to inherit the spatial features during the learning process. To improve the efficacy of the Swin transformer inside the YOLOv3, ABP is incorporated inside the Swin transformer by replacing the linear embedding layer to avoid the disruption of feature locations. The training procedure comprises 9000 images along with their respective annotation file. The images are thoroughly processed, and the annotation is performed with a bounding box for all the images. The following are the significant contributions of the work: 1. The proposed model features a novel backend architecture, an ABP embedded Swin transformer-based YOLOv3. Incorporating the Swin transformer into YOLOv3 offers a unique opportunity to improve the model's precision beyond the baseline version. The integration of convolutional and Swin attention aims to create a novel architecture that leverages the strengths of both components. The Swin transformer helps improve scalability, efficiency, generalization, and computational resource utilization by selectively focusing on relevant parts of the input. On the other hand, attention-based pooling enables the model to consider the significance of various aspects of the input by allowing it to weigh F I G U R E 1 A survey was conducted from April 14 to May 4, 2022, regarding parents' increasing concern about their teens watching obscene content. (PewResearch, 2023). the importance of different parts, rather than simply taking the maximum or average value as in traditional pooling methods. The design of the proposed model is supported by incorporating a Swin transformer and ABP to produce hierarchical feature maps. This combination can enhance the accuracy of objection detection tasks, such as obscenity detection, as it allows the model to maintain both global and local features for accurate pixel-level predictions.
2. The Swin transformer encoder block was inserted after the output of the darknet53 block to substitute several CBL Convolutional þ BatchNormalization þ LeakyRelu ð Þ and upsampling layers. To alleviate the computational burden and preserve memory, we apply the Swin transformer block after the Darknet-53 block when the feature map is on the small side. This technique can enhance the efficacy of the integrated model and make it usable even on devices with limited processing power, bringing object detection tasks closer to perfection.
3. Furthermore, ABP is used at the foundational level of the Swin transformer block inside YOLOv3. The Swin transformer has a linear embedding layer at the initial position of the block. The linear embedding layer employs a simple projection of features, which disrupts the feature locations. To address this issue, we have substituted the embedding layer with ABP. The context information in a small neighbourhood is crucial for images. Looking closely at each feature map, raw valued features are projected into a linear form. This allows us to focus raw valued features into a linear form and give equal priority to each feature map via the attention mechanisms. The integration of attention pooling into the Swin transformer-based YOLOv3 backend is a revolutionary approach proposed in this study for obscenity detection in images.
4. The proposed model is trained with annotated obscene images (AOI). 13,500 images are collected from copyright-free pornography websites.
All the images went through a process of bounding box generation using Label studio software. As a result, 13,500 obscene images with their respective bounding box coordinate files are trained and validated using the modified YOLOv3 algorithm to obtain the annotated obscene region in obscene images. The remaining 3000 images are employed for testing.
The following is the outline for this paper: Section 2 addresses relevant related studies, whereas Section 3 details the dataset, the proposed model, and its underlying methodology. Section 4 describes the experimental setup, results, and a discussion of the proposed method by analysing the acquired experiment values. Section 5 presents the conclusion.

| RELATED WORK
This section presents a concise summary of the significant contributions to the literature on classifying and detecting obscene material. Due to a few cyber criminals, digital platforms are contaminated with obscenity, and categorizing obscene information is a new issue that is getting more challenging. As images and videos proliferate on social media, privacy concerns arise in the context of digital video forensics, where the authenticity and integrity of media content can be compromised, highlighting the need for robust privacy measures in the age of digital media (Javed et al., 2021;Nagasree et al., 2023). In the literature, there exist classification-based and detection-based approaches for obscene images. Methods for classifying obscenity are usually based on skin pixels, visual perception, or using deep learning. Skin pixel estimation was the first approach for obscenity classification. In (Lin et al., 2003), an SVM-based classification method for classifying obscenity based on the skin pixel count was discovered. In (Basilio et al., 2011), researchers conducted a study where they used the YCBCR colour space to identify skin pixels in a dataset consisting of 450 nude images and 550 non-obscene images. They were able to achieve an accuracy of 88.4% using this method, but they also encountered 5% false positives. In (da Silva Eleuterio & de Castro, 2010), authors carried out the detection of child pornography in photographs using Nu-Detective, an image-oriented method that relied on frame segmentation and frame extraction with the PHP video toolkit. However, the authors noted that explicit content detection through skin colour segmentation methods was highly dependent on the number of skin colours present. According to the Bayesian model used in (Chai & Bouzerdoum, 2000), the distribution of skin tones follows a Gaussian distribution using YCbCr colour space. Due to high misclassification and processing costs, these methods could not match the need. When people wear swimwear or wrestling uniforms with short dresses, the skin detector had problems because the pixels representing their skin were numerous. This led to an incorrect obscene image classification. Later, methods were proposed to extract features related to visual motion. From a MPEG video stream, the authors retrieved motion data and smoothed it with a median filter to detect pornography (Zhiyi et al., 2009). Pornography detection was accomplished by establishing a threshold for the measured intensity and direction of each motion vector. To identify pornographic images, the authors of (Mao et al., 2018) employed seven different types of features and the gradient boost-decision-tree model (GBDTM). However, the considerable computing cost was not justified by the low accuracy of these methods.
To deal with such issues, researchers prefer to use deep learning. In (Moustafa, 2015), introduced the first approach using deep learning for obscene classification using AlexNet (Krizhevsky et al., 2017) and GoogleNet (Szegedy et al., 2015), which were trained using the publicly available NPDI dataset (Avila & Ara'ujo, 2018). The scores from the individual network were fused at the classification level and achieved an accuracy of 94.1%. In (Song & Kim, 2020), authors utilized a multi-modal stacking approach to detect pornographic content on social media platforms in realtime. They combined a bidirectional recurrent neural network and VGG 16 model to achieve a high accuracy rate of 95.60%. The Pornography 2k dataset (Avila et al., 2013) was used for training and testing purposes. Overall, the study aimed to improve the identification and filtering of inappropriate content on social media. In (Perez et al., 2017), fused features at the intermediate level using GoogleNet to extract spatial and temporal data from the videos. These attributes were used as inputs to a support vector machine classifier to identify pornographic material. An accuracy of 97.9% was attained by training the GoogleNet model using images from the publicly available Pornography-2k dataset. In (Nurhadiyatna et al., 2017), authors utilized VGG16 (Simonyan & Zisserman, 2014), ResNet 18, 34, 50 (He et al., 2016) to develop a computer-based solution for categorizing pornographic contents. The models were trained and tested using the NPDI dataset. With an accuracy of 75.08%, ResNet-34 was found to be the most effective model. In (Qamar Bhatti et al., 2018), trained a Resnet-50 model for obscene classification with 1000 obscene and 1000 non-obscene photos. The model achieved a respectable 95% accuracy. In (Wehrmann et al., 2018), suggested the ACRODE model, which used a convolutional neural network (CNN) ResNet 152 and a long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997). This model achieved a 95.3% accuracy in classification. In (Papadamou et al., 2020), authors addressed the challenge of detecting sexually explicit videos by employing the Inception V3 model. These are some instances of deep learning-based classification approaches. In , the classification of obscene images using multiple DL models and merging of the characteristics of the two best-performing models using feature fusion was done in the study.
The literature offers relatively little in the way of methods for detecting pornographic regions in obscene material (AlDahoul et al., 2020;Samal, Zhang, et al., 2023;Srivastava et al., 2020). In , the SBMYv3 model was proposed which is a BAM-enabled Convolutional neural network-based methods are more accurate and computationally efficient than conventional methods for detecting obscene images, but their accuracy depends on having access to large datasets and high-end training modules. The accuracy of these methods can also be unreliable when dealing with images taken under varying conditions. To address this issue, we developed a model that utilizes a bounding box to identify obscene areas and detection algorithms to achieve high testing accuracy. By enclosing the obscene area in a bounding box, the algorithm can focus on analysing only that specific region, which significantly improves its efficiency and accuracy. Additionally, the bounding box provides valuable information such as the location and size of the obscene area, which can be useful in content moderation and criminal investigations. We also incorporated the Swin transformer, a technique that has not been widely used in this context, to focus attention on specific obscene areas. The Swin transformer is a neural network architecture that can assist in detecting obscene regions in images by breaking the input image into smaller patches. These patches are processed by a sequence of self-attention layers that are trained to recognize relevant features and relationships between the patches. By doing so, the Swin transformer can selectively concentrate on the parts of the image that are more likely to contain obscene content and dismiss unimportant areas. As a result, the Swin transformer can enhance the model's accuracy by decreasing the amount of extraneous and unnecessary information that the algorithm needs to handle. Moreover, integrating the Swin transformer with the bounding box approach enables the model to target the specific obscene areas of the image, resulting in a more efficient and precise detection algorithm. To achieve a high degree of precision, we believe it is necessary to develop both a powerful algorithm and a real-time annotated obscene dataset.

| MATERIALS AND METHODS
This section provides a comprehensive explanation of the proposed methodology ASYv3, and Figure 5 presents a graphical representation of the proposed framework. This section also explains the dataset and augmentation process.

| Dataset and augmentation
This section discusses the method of data collection, the number of images included in the dataset, and the augmentation mechanism for a diverse range of images. Over 15,000 pornographic images were selected from websites that do not require purchasing a copyright. Since existing models can wrongly classify images that have different angles, skin tones, contrast levels, and other features, it is important to comprehensively augment the collected dataset. To address the presence of various objectionable images with different pixel intensity spaces, levels of contrast (robust, medium, and light), chroma values, spaces, and saturation values, the collected dataset was thoroughly enriched.
The training utilizes 9000 images, the validation process includes 4500, and testing uses 3000. The datasets were meticulously annotated with the help of Label Studio software. The specifications of the dataset used for training are outlined in Table 1 Figure 4 illustrates an example of the data augmentation process, showcasing how it generates diverse images by applying various techniques.

| Theories of proposed ASYv3
The following is a detailed explanation of the proposed ABP-embedded Swin transformer-based YOLOv3 (ASYv3) for obscenity detection. Figure 5 provides a diagrammatic representation of the proposed approach ASYv3 to the problem. The proposed technique is a two-step modification process using YOLOv3. YOLOv3 utilizes Darknet53 as its feature extractor block. In the first step, the Swin transformer is implemented into YOLOv3 in the initial stage of development. In the second step of the process, the Swin transformer is modified by replacing the embedding F I G U R E 3 Some non-obscene samples from the AOI dataset.

Alpha blending Zooming
Original Image Brightness F I G U R E 4 Process of data augmentations for a diversified dataset.
layer with an ABP. Afterward, the modified Swin transformer is then utilized inside YOLOv3. The subsequent sections will discuss the YOLOv3 architecture, Swin-YOLOv3, and the proposed technique which is an ABP embedded Swin transformer-based YOLOv3.

| Backbone detection model: YOLOv3
YOLOv3 is a state-of-the-art real-time object detection technology (Redmon & Farhadi, 2018). In the literature, there are several detection methods (Liu et al., 2016;Redmon et al., 2016;Ren et al., 2015); however, in terms of mean average precision (mAP) and intersection over union (IOU) values, YOLOv3 is both precise and quick. Other detection models may not be able to detect small features, which is a significant limitation.
As a result, in some pornographic images, the private parts may not be clearly visible due to seductive movements, making it difficult for the model to determine whether the image is indecent or not. YOLOv3 is a desirable choice for detecting small objects accurately, which may be challenging for other models. However, the suitability of YOLOv3 versus other versions of the YOLO algorithm depends on several factors, such as the specific requirements of the application, user expertise, resources, and constraints. When comparing other detection algorithms, it has been observed that YOLOv3 offers a favourable combination of accuracy and resource efficiency compared to other algorithms that may require more computational resources. This makes YOLOv3 a better option for users with limited resources, as it can achieve high accuracy whilst being less demanding in terms of memory and processing power. The choice of the best YOLO algorithm depends on factors such as application requirements, available resources, and user expertise. Specifically, for the objective of detecting pornographic content in small areas of images, YOLOv3 may be a preferable choice due to its high accuracy in detecting small objects, particularly when compared to other versions like YOLOv4 or YOLOv5 which may offer lower accuracy or similar inference times. Furthermore, YOLOv3 comprises 106 neural network layers and three anchor boxes per scale. This enables the identification of tiny features to detect less obvious indecent parts in certain photos or films. It features a complex deep neural network architecture consisting of several blocks: a backbone network, a neck, and a head. Darknet-53 provides the feature extractor backbone, whilst the neck is made up of convolutional and pooling layers that compress the features spatially and increase their depth.
The head, the network's detection portion, has multiple output layers that predict the bounding boxes and class probabilities of objects in an F I G U R E 5 The framework of the proposed methodology ABP-embedded Swin transformer-based YOLOv3 (ASYv3).
image at varying scales. The head combines convolutional and fully connected layers to make these predictions. The predicted bounding boxes are then filtered using non-maximum suppression to eliminate overlapping boxes and keep only the most confident detections. The core mathematical principle behind YOLOv3 is the use of CNNs to extract features from the input image. These networks have numerous layers of filters trained to detect specific input data features, with each layer passing its output to the next. The final result is a set of features used to make predictions about the input image. YOLOv3 also uses anchor boxes, which are mathematical constructs with varying aspect ratios that predict the final bounding boxes of objects in the image. These anchor boxes represent the model's expected object shapes. Finally, YOLOv3 adopts a localization-classification loss function that trains the model and adjusts network weights to reduce the gap between predicted and ground-truth bounding boxes. When YOLOv3 processes an image, it first splits the image into an S Â S grid. Each cell in the grid then makes a prediction about the N bounding boxes and their confidence levels, which reflect how accurately the bounding boxes depict real-world objects and whether they contain an object. YOLOv3 can also make accurate predictions about the categorization of each box for each trainable category, allowing the probability of each category is present in a given box to be combined into a single value. As a result, YOLOv3 can estimate the number of S Â S Â N boxes.
In brief, the YOLOv3 in ASYv3 can be broken down into several steps: 1. Input preprocessing: The input obscene/non-obscene image is resized and normalized. 8. Loss calculation: The distinction between the expected and actual bounding boxes is calculated using a loss function. This loss is then used to adjust the network weights.

| Swin transformer integrated YOLOv3 (Swin-YOLOv3)
After the initial vision transformer, the most intriguing piece of research since then has likely been the development of the Swin transformer. The Swin transformer, developed by (Liu et al., 2021), is a transformer-based deep learning model that achieves state-of-the-art results in visual challenges. Swin transformers are the foundation of many modern vision-based models since they possess these admirable characteristics. Hierarchical feature maps and shifted window attention were two significant features introduced by the Swin transformer to remedy the shortcomings encountered by the original vision transformer. In point of fact, "Shifted window transformer" is where the name "Swin transformer" originates.
There are 2 units present inside the Swin transformer. Each component includes an initial normalizing layer, an attention module, an additional normalization layer, and a multi-layer perceptron (MLP) layer. The Window multi-head self-attention module is utilized by the first unit, whereas the Shifted Window multi-head self-attention module is used by the second unit. To facilitate global self-attention, the default multi-head selfattention (MSA) will compute the relationship between each patch and all other patches. Consequently, the overhead of the approach grows quadratically in the number of patches, rendering it inappropriate for high-resolution pictures. Hence, the Swin transformer employs a window-based multi-head self-attention (W-MSA) method to resolve this issue.
A window is a collection of patches, and the only part of the information contained within a window that receives any kind of computational attention is the content of that window. An illustration of how to calculate attention within each window using a W-MSA and a window size of two patches by two patches is provided below. The initial sublayer consists of W-MSA. It divides the feature map into non-overlapping windows before computing self-attention in each of these localized windows. Given that each window comprises m Â m patches, the computational cost of a global MSA unit and a window-based MSA unit built on an image of h Â w patches are depicted in Equations (1) and (2).
Where ht, wt, and cl are the height, width, and channel of the image, respectively. m is the window size of the respective image feature. An evident drawback of window-based MSA is that limiting self-attention to individual windows reduces the network's capacity for modelling. As a result, the Swin transformer employs a Shifted Window MSA (SW-MSA) module following the W-MSA module to solve this problem. The subsequent layer adopts a different windowing configuration than the previous W-MSA by moving the windows away from the regularly partitioned windows by a certain number of pixels. A visual diagram of the SW-MSA is illustrated in Figure 6. It is possible to calculate successive Swin transformer blocks using the shifting window partitioning method using Equation (3)  upsampling layer with the Swin transformer encoder block after the output produced by the darknet53 block. As we are applying the Swin transformer block after the Darknet-53 block when the feature map size is very low, it helps us to reduce computational load and saves memory space.
This can help to make an integrated model more efficient and allow it to run on devices with limited computational resources. This approach can potentially increase the performance of obscene region detection tasks.

| Attention-enabled pooling embedded swin tranformer inside YOLOv3
The attention mechanism has recently been implemented as an aggregate function in multiple instance learning (Ilse et al., 2018). Attentionenabled pooling uses the same attention mechanism proposed in the transformer model (Vaswani et al., 2017) with a significant modification. In normal attention, the implementation dimension is the same as input X. However, in ABP, the batch image dimensions are reduced from (N, C, H, W) to (N, C 0 ) so that the output is one-dimensional linear. Where N is the number of input samples, and C 0 is the desired size of the 1D output.
Furthermore, learned aggregation in attention pooling is an important modification developed by (Touvron et al., 2021). With certain tweaks, it can function as a conventional self-attention mechanism whilst also serving as a learnable pooling mechanism. ABP used a learnable fixed vector q c to create a channel query Q c using the linear layer W Q . Q c generated by matrix multiplication of the channel query vector q c and the query weight matrix W Q which is depicted in Equation (4). The matrix size of the channel query vector q c is equal to (1, C), where C is the number of channels needed to pass output from ABP to the Swin transformer block. Equations (4) and (5) represent the mathematical formulation of Q c and the ABP strategy. N, C, H, and W are the batch, channel, height, and width, respectively.
Other self-attention mechanisms operate normally as proposed in the transformer model to generate learnable embedding of linear size, with the exception of the above equation. q c is the learnable embedding vector that is generated by the Swin transformer's channel attention mechanism. It represents the importance of each channel in the input data, and it is used to weight the corresponding rows of the Q matrix in the selfattention calculation. W Q is the learnable weight matrix that is applied to the q c vector to generate the final Q matrix used in the self-attention calculation. The purpose of this weight matrix is to transform the channel attention vector into a form that is compatible with the other selfattention mechanisms in the transformer model.
Where K is the key matrix and V is the value matrix, as proposed in the original transformer model and d k is the dimension of the key matrix.
1= ffiffiffiffi ffi d k p is the scaling factor used to scale down values. This allows for more stable training, as multiplying values can have an exploding effect.
The scaling factor is used to scale down the dot product of the q c vector and the transpose of the K matrix before applying the softmax function.
This scaling down helps to prevent the values from becoming too large, which can cause numerical instability during training. The equations discussed above have parameters that collaborate to facilitate the Swin transformer's self-attention and also consider the significance of each channel in the input data. The mechanism of channel attention denoted by q c and W Q enables the model to learn the most important channels for the given task. Additionally, the key and value matrices K and V support the model in capturing relationships between various elements in the sequence. The softmax function and scaling factor normalize and stabilize the computations during training, guaranteeing effective learning by the model.
In the original Swin transformer model, linear embeddings are used, which transform input into the linear dimension with the arbitrary dimension C. In the linear embedding layer, simple projection of features is applied, which results in disruption of feature context regardless of their neighbourhood. In images, the context information in a small neighbourhood is essential. As a result, we replaced linear embeddings in place of ABP, which allows us to project raw, valued features into linear form whilst also giving fair consideration to each feature map via the attention module. Figure 8 shows the ABP-embedded Swin transformer.

| RESULTS
This section illustrates the efficiency of detecting obscene images using our proposed ASYv3 model. The following are the three stages of the dis-

| Experimental setup
Multiple measures, including testing accuracy, precision, sensitivity, false-positive rate (FPR), Negative predicted value (NPV), false negative rate (FNR), and mean absolute precision (mAP), are utilized to quantify the detection accuracy of our proposed model. Equation (6) describes the mathematical formula for computing various performance measures. Accuracy is the proportion of correct predictions achieved across all test samples.
Precision is the percentage of naked samples properly anticipated. Sensitivity is the number of accurately predicted nude samples relative to all true obscene images. NPV is the ratio of the number of samples correctly predicted as non-obscene to the total number of samples correctly predicted as non-obscene. FNR is the number of incorrectly predicted non-obscene samples across all obscene images. FPR is the ratio of erroneously predicted obscene samples to real non-obscene images.
Where TP, TN, FP, FN are true positive, true negative, false positive, and false negatives, respectively. The area under the Precision recall (PR) curve is the average precision, so p and r are the precision and recall values, respectively. Average precision values range between 0 and 1. Therefore, we used the model Swin-YOLOv3 for further modifications. To prevent features from being disrupted during training, we employed the model Swin-YOLOv3 and included the ABP unit in it.

| Evaluation and discussion of experimental findings
Several experiments examined the viability of proposed and progressively developing models. Lower FPR and FNR values are preferred, as they indicate higher accuracy and reliability in detecting obscene regions. Furthermore, the proposed ASYv3 has the highest mAP value of 95.59%, followed by Swin-YOLOv3 with a mAP of 90.57%, and YOLOv3 with a mAP of 82.62%. This indicates that ASYv3 has the best overall accuracy in detecting objects across different classes, as it has the highest mAP value amongst the three models. mAP is a crucial metric for evaluating the performance of object detection models, as it considers both precision and recall, which are important factors in detecting objects accurately.
The confusion matrix of YOLOv3, Swin-YOLOv3, and ASYv3 are illustrated in Figure 9. The training and validation accuracy curves using the AOI dataset for all the developed models are presented in Figure 10. Table 2 displays the results of experiments run on images from the AOI dataset using both the original and upgraded versions of YOLOv3. As Table 2 shows, ASYv3 outperforms Swin-YOLOv3 and YOLOv3 in terms of detection accuracy, precision, sensitivity, NPV, and mAP. ASYv3 has significantly lower values for false positive rate (FPR) and false negative rate (FNR) compared to other models. Figure 11 displays the validation loss curves for all the models, and it can be seen that the curve of ASYv3 converges more quickly and reaches a lower loss value. In summary, the ASYv3 model successfully maintains an acceptable level of detection accuracy due to the effective deep feature extraction and appropriate attention to the relevant features needed in the obscene detection task. Figure 12 shows some detection results with annotated obscene regions produced by the proposed ASYv3 model. Table 3 details the results of an ablation study on three different models (M1, M2, and M3), each of which uses the ABP and Swin transformer in the architecture. In an ablation study, different components of a model are removed or modified to analyse the effect on the overall efficiency of the model. In this table, the two columns marked with "Â" or "✓"indicate whether certain components were included in each model or not. Specifically, the "Â" indicates that the corresponding component was not used, whilst "✓"indicates that the component was used.

| Ablation study
Here's a breakdown of the three models and the differences between them: 1. Model 1 (M1) does not include a transformer or any other modifications to the architecture.
2. Model 2 (M2) includes a transformer inside YOLOv3, but not ABP module in it.
3. Model 3 (M3) includes both a transformer and ABP module to the architecture.

F I G U R E 9
Confusion matrix of all the models using AOI testing dataset.
T A B L E 2 Comparison results of all the models using AOI dataset.

| Performance analysis with state-of-the-art methods
The ASYv3 technique is compared to previous research conducted in the field of obscene content identification (AlDahoul et al., 2020;Liu et al., 2016;Redmon et al., 2016;Ren et al., 2015). Table 4 provides the evaluation results using the performance metrics for various object detection algorithms on different datasets. The datasets used for evaluation are NPDI Dataset, Pornography-2k Dataset, and AOI Dataset. The evaluation process was conducted consistently by manually annotating the NPDI dataset and Pornography-2k dataset to match the AOI dataset. In (AlDahoul et al., 2020), the authors utilized a YOLO-CNN for feature extraction, which is a variant of the YOLO algorithm. The extracted features were then categorized into two distinct classes using a support vector machine (SVM) for classification. In (Redmon et al., 2016), the base YOLO algorithm is described. YOLO is an object detection algorithm that divides an image into a grid and predicts bounding boxes and class labels for the objects in each grid cell in a single pass. In (Ren et al., 2015) and (Liu et al., 2016), two different object detection algorithms are depicted.
Faster R-CNN is described in (Ren et al., 2015), which is a region-based convolutional neural network (R-CNN) approach that uses region proposal networks (RPN) for generating potential object proposals and then classifying and refining them. SSD (Single Shot MultiBox Detector) is described F I G U R E 1 3 Comparison of proposed ASYv3 with state-of-the-art obscene detection models using AOI dataset.
in (Liu et al., 2016), which is another object detection algorithm that uses a series of convolutional feature maps of different resolutions for detecting objects at multiple scales and aspect ratios in a single shot. The algorithms compared in the table are YOLO-CNN, YOLO, Faster R-CNN, SSD, and the proposed ASYv3. The comparison results are presented in Table 4 and visually shown in Figure 13. The performance metrics reported in Table 4 include accuracy values in percentage for each algorithm on the respective datasets. These accuracy values are further broken down into different categories such as overall testing accuracy, precision, sensitivity, NPV value, and mean average precision (mAP) which is a common evaluation metric for object detection algorithms. Additionally, the table also includes values for false positive rate (FPR) and false negative rate (FNR) which are measures of the algorithm's performance in terms of false detections and missed detections.
It is worth noting that the proposed ASYv3 consistently outperforms the other algorithms in terms of accuracy on all three datasets, with the highest accuracy values reported in most categories. YOLO and Faster R-CNN also show competitive performance, whilst YOLO-CNN and SSD generally have slightly lower accuracy values. Overall, the table provides a comparison of the performance of different object detection algorithms on different datasets, with proposed ASYv3 showing promising results. The results, presented in Table 4 in tabular and Figure 13 in graphical form, clearly show that the proposed ASYv3 model achieved higher accuracy compared to state-of-the-art methods on all three datasets. Specifically, the ASYv3 method achieved an average detection accuracy of 97.00% on the AOI dataset, surpassing the performance of any existing models. These findings highlight that the ASYv3 technique outperforms previous methods in identifying obscene content.

| CONCLUSIONS
This paper presents an innovative version of YOLOv3 entitled ASYv3 for the efficient detection of obscene content. Obscenity detection can have a positive impact on society by safeguarding vulnerable populations from explicit content, preventing online harassment and cyberbullying, combatting illegal activities, and promoting digital ethics. We rebuilt the YOLOv3 backend for enhanced feature extraction to mark obscene regions more accurately in obscene images. To effectively annotate the obscenity in the obscene images, we redesigned the backend of YOLOv3 by incorporating a Swin transformer. Swin transformer can potentially enhance the accuracy of obscene image detection by capturing fine-grained details and contextual information from images. In addition, we also replace the embedding patches in the Swin transformer with attention-based pooling to focus on the necessary set of features. Additionally, the proposed ASYv3 approach is evaluated against various existing methods using two publicly available datasets, namely NPDI and the pornography 2 k dataset. The proposed ASYv3 method outperforms all other models with 97% testing accuracy, 96.62% of precision value, 97.40% of sensitivity value, 3.48% FPR value, 97.37% of NPV value, 2.60% FNR value, 95.59% mAP values using the AOI dataset. Our proposed ASYv3 model will seamlessly integrate into multiple social media platforms, enabling efficient detection and blocking of pornographic images. This advancement in technology will play a pivotal role in safeguarding the online environment, ensuring a clean and secure digital landscape for users in the future.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.