SBMYv3: Improved MobYOLOv3 a BAM attention-based approach for obscene image and video detection

Countless cybercrime instances have shown the need for detecting and blocking obscene material from social media sites. Deep learning methods (DLMs) out-performed in recognizing obscene content flooded on many online platforms. However, these contemporary DLMs primarily treat the recognition of obscene content as a simple task of binary classification, rather than focusing on the labelling of obscene areas. Hence, many of these methods could not pay attention to the fact that misclassification samples are so diverse. Therefore, this paper focuses on two aspects (i) developing a deep learning model that could classify and label the obscene portion, and (ii) generating a labelled obscene image dataset with a wide variety of obscene samples to minimize the risks of inaccurate recognition. We have proposed a method named S3Pooling based bottleneck attention module (BAM) embedded MobileNetV2-YOLOv3 (SBMYv3) for automatic detection of obscene content using an attention mechanism and a suitable pooling strategy. The key contributions of our article are: (i) generation of a well-labelled obscene image dataset with a variety of augmentation strategies using Pix-2-Pix GAN (ii) modifications to the backend architecture of YOLOv3 using MobileNetV2 and BAM to ensure focused and accurate feature extraction, and (iii) selection of an optimal pooling strategy, that is, S3Pooling strategy, while taking the design of the feature extractor into account. The proposed SBMYv3 model outperformed other state-of-the-art models with 99.26% testing accuracy, 99.39% recall, 99.13% precision, and 99.13% IoU values respectively.

per the experts' suggestion, pornographic content on the internet deteriorates the moral standard of society and leads to cybercrimes such as sexual harassment, child exploitation, and so on. This compels an effective detection algorithm that can automatically monitor obscene content on social media to prevent cybercrime. Deep neural network has enabled considerable gains in a variety of domains, including object detection, pattern recognition, digital video forensics, and semantic segmentation (Ibrahim et al., 2022;Tiwari et al., 2022;Javed et al., 2021). Over the past few decades, machine learning and deep learning methods also have shown considerable achievements in categorizing obscene and non-obscene content. However, existing methods are limited to binary classification, that is, they enable to categorize the image/video under consideration as obscene or non-obscene. However, these methodologies do not label the obscene regions in obscene content. During misclassification, the model incorrectly classifies certain obscene images as non-obscene images owing to the existence of a large number of background objects that are non-obscene. Figure 2 depicts some obscene samples (Shutterstock, 2022) where women standing in a forest and surrounded by nature, the model becomes confused by the very small amount of space occupied by the naked person in the image and hence classifies it as a non-obscene image. In such a scenario, developing a detection algorithm that can annotate the naked/objectionable area of the obscene image becomes essential. This motivates us to propose a detection technique, which will classify and annotate the naked/objectionable area of the obscene image through a bounding box.
Initially, obscene image or video classification was centred on human skin volume with the notion that higher volumes of detected skin would indicate a greater likelihood of nudity. However, these methods are susceptible to a high proportion of false positives, particularly in the context of the samples shown in Figure 2.
In the BoWs approaches, the major limitations are high processing costs, the inability to discern between naked and semi-naked sexual contents. In contrast, deep learning models outperformed these methods. Deep learning algorithms are primarily used to perform the classification of obscene contents with zero emphasis on the annotation of the naked regions, which may result in a high rate of misclassification when a diverse set of test images is provided. In such instances, annotation is crucial for naked images. There exist great deals of unexplored potentials of CNNbased detection algorithms, which will give the target output with a boundary box. Recently, the detection algorithms R-CNN, SSD MobileNetv2 F I G U R E 1 Distraction caused by obscene ads on social media F I G U R E 2 Misclassification of obscene samples by the existing models (Srivastava et al., 2020), and a single method based on YOLO-CNN (AlDahoul et al., 2020) have been suggested to identify pornographic regions in the obscene content. However, the aforementioned techniques suffered from mediocre classification accuracy, slow detection speed, and a time-consuming training procedure.
Motivation of this study is to address the aforementioned issues. Here, nudity detection is performed utilizing the single-stage detector YOLOv3 algorithm, which is upgraded in three major steps. We have proposed a S3Pooling based BAM embedded MobileNetV2-YOLOv3 (SBMYv3) model for the detection of obscene images and identification of pornographic regions in the obscene images. The proposed SBMYv3 model utilizes the lightweight MobileNetV2 as its backbone instead of the model's original backend, that is, DakNet53. The MobileNetV2 architecture is adopted to optimize feature extraction while minimizing costs. Second, in the backend of proposed YOLOv3, we incorporated an appropriate attention module, namely the bottleneck attention module (BAM). The purpose of attention-based YOLOv3 is to train the algorithm to pay close attention to the annotated naked regions in the dataset. The third major modification is a more refined pooling strategy. We have upgraded with S3Pooling to make the model more generalizable by increasing its adaptability to random distortion and its capacity for data augmentation at the pooling layer. For increased detection accuracy, we carefully generate a well-annotated dataset.
Key contributions of this paper are summarized as follows: 1. Generation of GAN generated labelled obscene images (GLOI) dataset. The acquired obscene images get enriched by pixel transformations using Pix-2-Pix GAN followed by accurate labelling. Variety of image enhancement techniques, such as contrast enhancement, cropping, and blur processing etc. are utilized in the pixel transformations process.
2. Replacement of the backend architecture of YOLOV3 model with lightweight MobileNetV2 instead of Darknet-53 by considering the benefits of the depth wise and pointwise convolution along with the bottleneck residual layer.
3. Embedding BAM attention module with the intention of paying special concentration on the naked parts that are annotated in the dataset. 4. After the architecture was upgraded with MobileNetV2 and the BAM attention module, average pooling was replaced with S3Pooling. S3Pooling makes the model robust to random distortion and involves intermediate data augmentation at the pooling layer.
The remainder of the article is organized as follows. Section 2 summarizes relevant work on obscene content classification utilizing CNN, YOLO, and attention mechanisms. Section 3 highlights the proposed method and its framework, including dataset generation and analysis, YOLO's underlying architecture, and the proposed design framework. Section 4 outlines the implementation process, including the setup and analysis of the results using appropriate evaluation metrics. Section 5 provides the conclusion.

| RELATED WORK
This section includes some significant contributions in the field of obscene content classification using machine learning (ML), CNN and YOLObased approaches. Additionally, this section provides a comprehensive explanation of the attention mechanism used by different CNNs to classify nudity contents.

| ML-based obscene content classification
Initially, explicit content categorization relied on machine learning techniques such as skin colour segmentation, bag of words analysis, and assessment of visual and motion features. SVM was the first method used to locate pornographic images with a success rate of approximately 75% and an error rate of approximately 14% (Yin et al., 2012). Afterward, different colour spaces, like the RGB colour space, as well as the Y CbCr colour space, could identify explicit contents with an accuracy of 88.8% and 5% of false positives (Adnan & Nawaz, 2016). BossaNova descriptor was used for obscene classification in (Caetano et al., 2016), but it experienced issues with beach sceneries, breastfeeding films, and toddler bathing. In (Tripathi & Piccinelli, 2008), an MPEG video stream was presented to lower the error rate by focusing on static and dynamic motion information fusion using the Pornography-2k dataset, and in this regard GoogleNet was found to be superior than other CNN models, with an accuracy of 82%.
However, all algorithms have a significant proportion of false positives including incorrect skin pixel counts, high processing costs (BoW approach), and repeated obscene image/video misclassifications motivated the innovation of investigating CNN-based categorization of such sensitive contents.

| CNN-based obscenity classification
Considering the drawbacks of machine learning algorithms researchers focused on the deep learning classification process. A VGG-16 model was used to classify the pornographic images with a 93.8% level of accuracy (Agastya et al., 2018). Using optical flow and MPEG motion vectors an unique method was proposed for fusing static and dynamic motion data for obscene video classification which yielded a classification accuracy of 97.9% with an error reduction of 64.4% (Perez et al., 2017). In method (Qamar Bhatti et al., 2018), pornography was classified using a ResNet-50-based residual network, which achieved an accuracy of 95%. ACRODE, a combination of LSTM and CNN, was developed to categorize pornographic content using the NPDI dataset, and it achieved 95.3% accuracy (Wehrmann et al., 2018). In addition, VGG 16, ResNet 18, 34, 50, and other neural networks were used to construct a portable system for identifying pornographic contents on smartphones and tablets (Nurhadiyatna et al., 2017). A DCNN was suggested in (Cheng et al., 2019) for adult picture categorization by integrating global and local contexts. CNNs have recently emerged as new possibilities for solving the classification and detection challenge, spurred by their high performance. Region predictions divides deep learning-based object detection techniques into two types. R-CNN and Fast R-CNN are two-stage, while SSD and YOLO are onestage detection algorithms. Two-stage detector first creates significant bounding boxes through region proposal network (RPN). Further classifying these proposed boxes yields the final result. A two-stage detection architecture is computationally expensive and inefficient because the potential bounding boxes retrieved by the first stage RPN restrict the effective range of the second stage feature extraction. One of the most well-known one-stage detectors is the YOLO series, which observes object detection as a regression problem. Therefore, it is feasible that the one-stage object detection approach is superior than the two-stage detection algorithms for obscene detection. Nudity detection with 92% accuracy was accomplished using R-CNN and SSD MobileNetV2 model (Srivastava et al., 2020). YOLO-based approach (AlDahoul et al., 2020) using multiple CNNs was trained and validated using several classifiers for nudity detection. However, in certain scenario where some of the private areas that are not visible enough in the image/video for the classification model to correctly categorize them as obscene. To avoid such misclassification several detection algorithms were proposed in literature, which can be used for both classification and annotation of targeted obscene areas in the obscene image. The YOLO family (Redmon et al., 2016;Redmon & Farhadi, 2017;Yi et al., 2019;Redmon & Farhadi, 2018;Bochkovskiy et al., 2020;Jocher et al., 2021), SPP-Net (He et al., 2015), RCNN (Kido et al., 2018), faster-RCNN (Girshick et al., 2014), SSD (Liu et al., 2016) have shown significant advancement in achieving both classification and detection.

| Attention-based CNN for obscene content classification
Deep learning based pornography detection methodologies outperform other methods. However, they show limited performance if the pornographic content is low in contrast or the person wears skin-coloured clothing or a naked person stands in a forest, and so on. Many times, the model is trained only for binary classification, that is, either obscene or non-obscene contents. But there are some images or videos that fall somewhere in the middle of both that are thought to be seductive or worrisome for kids and teens. In such cases, deep learning models integrate attention mechanism, which will give focus to the objectionable portions of the obscene images to avoid classification errors. Attention module has a significant impact on the training experience (Guo et al., 2022). Various advantages of the attention mechanism have led to widespread use in image classification, object recognition, and natural language processing, and so on. However, the growth of utilizing attention mechanism in the field of obscene image classification is slowly emerging. A unique dot-product-based attention technique using CNN-based architecture was proposed for pornography detection with an accuracy of 92.72% (Gangwar et al., 2021). A moderate CNN known as DOCAPorn was proposed by utilizing a novel visual attention mechanism CBAM and Scale Constraint Pooling (SCP) to better evaluate relevant portions of the image while minimizing same-class fluctuation and optimizing the distance between classes, which attained an accuracy of 98.41% (Chen et al., 2020). However, owing to the lack of more such attention-based approaches for pornographic image classification, we aim at combining YOLO, a modified backend CNN, and a suitable attention mechanism to improve the classification and detection of pornographic images and videos.

| PROPOSED METHOD
This section describes step-by-step approach of the proposed S3Pooling based BAM attention embedded MobileNetV2 YOLOv3 (SBMYv3) framework. Alongside the framework, it also illustrates the generation of labelled dataset and its augmentation procedure. A thorough ablation study of the proposed framework is also conducted in this section. images using the proposed model. The following subsections will discuss briefly about the conventional YOLOv3 model followed by detailed discussion on each modification integrated in the architecture. Finally, the entire process to generate the GLOI dataset is also described in this section.

| YOLOv3
YOLO stands for You Only Look Once. This algorithm uses CNN for real-time object recognition. This algorithm is popular due to its enhanced learning capability, operational efficiency, and recognition accuracy. It operates primarily through the use of residual blocks, bounding box F I G U R E 3 Illustration of the proposed framework F I G U R E 4 Basic work frame of the proposed model regression, and intersection over union (IoU). Its working technique explains how a single convolutional network predicts multiple bounding boxes and their associated class probabilities simultaneously. Each image in the training set is partitioned into S Â S grids by the YOLO network. To detect a target, the grid has to know the location of ground truth target region. Using a probability information class, each grid predicts B bounding boxes and their confidence ratings. Within the object's bounding box, the current grid predicts the following data values of the object: location (x, y), width (w), height (h), confidence (c) and IOU. Conventional YOLOv1 model uses (448 Â 448) sized input image and consists of 24 convolutional layers and two fully connected layers. Major shortcoming of the YOLOv1 is its failure to grasp tiny details. Hence, sometimes in pornographic images, the private parts are barely visible because of the seductive gestures, and so the model becomes confused about whether the image is obscene or not. Afterwards various higher versions of YOLO family were proposed in literature. As compared to YOLOv1 model, YOLOv2 consists of 30 layers and includes anchor boxes which assist in locating and detecting objects. Next, YOLOv3 version contains 106 layered neural network and three anchor boxes per scale. This helps to identify microscopic structures to large sized objects through three scale predictions (Redmon & Farhadi, 2018). Hence, it can be utilized to identify less visible obscene areas in some images or videos. Figure 5a explains the detection process of YOLOv3 by extracting the bounding boxes and class probability results. Figure 5b depicts the architecture of the YOLOv3 model.

| Embedding MobileNetV2 in the backend of YOLOv3 (MYv3)
The conventional YOLOv3 model has Darknet-53 in its backend to execute feature extraction. In the proposed work, this backend is replaced by MobileNetV2 model and is termed as MobileNetV2 based YOLOv3 (MYv3) model. MobileNetV2 model accomplishes depth wise convolutions followed by pointwise convolutions (Sandler et al., 2018). Here, the convolutional operation is performed in two independent layers. In the first layer, one convolutional filter is used per input channel to accomplish lightweight filtering known as a depth wise convolutions. After the linear combination of all input channels additional features are created by constructing a (1 Â 1) convolution known as a pointwise convolution. Hence, both depth wise and pointwise convolution enable to preserve wide varieties of input features at a lower computational burden than the conventional 2D convolution. Along with this benefit, linear bottleneck strategy in the model helps to reduce the non-linearities in the data whereas, the shortcut connections between the bottleneck layers allows faster convergence. Hence, the replaced lightweight MobileNetV2 model allows to extract more deep features at a faster rate. Figure 6a represents the depth wise and pointwise convolution architecture and Figure 6b elucidates the bottleneck layer of stride 1 and stride 2 blocks in MobileNetV2.

| Inclusion of S3Pooling in MYv3 model (SMYv3)
This subsection describes the inclusion of S3Pooling in MYv3 model to enable the ability to learn invariant features and to further alleviate the issue of overfitting. The MYv3 model using S3Pooling is termed as SMYv3 model. Instead of using average pooling, SMYv3 model utilizes S3Pooling strategy which is a combination of mixed pooling and stochastic down sampling (Sharma & Mehra, 2019). To do this, the feature map is first partitioned into horizontal (h) and vertical (v) strips, h = c Â g and v = r Â g, where c, r are the columns and rows of the feature map and g is Stride is replaced by s. Afterwards, ( g/s) rows and ( g/s) columns are randomly chosen from each horizontal or vertical strip for partitioning. This yields a down sampled feature map with a size of r/s and c/s. Figure 7 represents an example to demonstrate the S3Pooling strategy. In this example, the feature map size is 4 Â 4 where x = 4 and y = 4. With stride s = 1 in step 1, no padding is provided to any of the sides, whereas stride s = 2 is used in the step 2 to obtain down sampled feature map. At each epoch during training, S3Pool introduces a random distortion into the feature maps due to the stochastic nature of sampling in step 2. It also offers the ability to manage distortion by adjusting the grid size (g) to adapt to a variety of architectures and applications. These distorted feature maps lead to a virtual data augmentation at the intermediate layers, which helps to make S3Pool a strong regularizer. It enables more generalization by preserving the varieties of intermediate augmented images. As a result, the model is familiarized with three types of samples: original images, images augmented by Pix-2-Pix GAN, and images augmented by S3Pooling.

| BAM attention embedded in SMYv3 (SBMYv3)
In SMYv3 model, the bottleneck attention module (BAM) is included in the MobileNetV2 network to narrow down the focus in localizing specific obscene areas. BAM attention module embedded in SMYv3 is termed as SBMYv3 model. BAM module has two components: the channel attention module and the spatial attention module (Park et al., 2018). For the input feature map F m R CÂHÂW BAM provides an attention map A Fm R CÂHÂW . These parameters are computed by using Equations (1) and (2). The updated feature map (Fm 0 ) after the incorporation of BAM attention is computed by using Equation (1) as shown below: where N is the element-wise product. The channel, height, and width of the feature map are C, H, and W, respectively. Channel (channel AFm ) and spatial attention (spatial AFm ) are computed independently and the required BAM attention based feature map is computed by using Equation (2)  F I G U R E 7 S3Pooling process where σ is the sigmoid function. Keeping this module at each bottleneck tier is a fine decision because it is quite adaptable. BAM begins by removing low-level information, such as the background texture feature, and then narrows its focus to the specific target, which is a high-level semanticity. Final attention map is accomplished by layering channel and spatial attention on top of one another. In order to prevent vanishing/ exploding gradients we need to have a smooth gradient flow in the model during backpropagation. Backpropagation updates the weights during feature extraction by finding the gradient, which is the derivative of the loss function. If the flow of the gradient is smooth, the neural network will be able to fulfil the requirements in the proper direction. Due to the need of smooth gradient flow during back propagation, when combining channel and spatial attention element-wise summation is chosen over multiplication. Element-wise multiplication can assign a big gradient to a small input, making it hard for the network to converge and resulting in a mediocre performance. Integrating and protecting the data from previous levels can be accomplished easily by element-wise summation. In addition, it enables the network to utilize data from two distinct branches, channel and spatial, without experiencing any information loss. Schematic diagram of BAM architecture is depicted in Figure 8.
However, most of these datasets are without annotations of the obscene regions. Moreover, detection models using the sample images from these datasets provide erroneous classification when tested with images in the countryside, on the beach, and on sandy beaches. Here, in this paper we have generated a well-labelled dataset by using Pix-2-Pix GAN augmentation technique. Input images to this GAN model are manually collected from copyright-free online pornographic platforms such as "https://tubepornlist.com/", "https://pornabc.com/", "www.pornhub.com", "https://thepornlist.com/", "https://www.shutterstock.com/search/naked". Similarly, non-obscene images are collected from copyright-free online platforms such as "https://www.shutterstock.com/search", "https://freer angestock.com/", "https://pixabay.com/". The following discussion provides a detailed steps to generate the dataset.

| GAN generated labelled obscene images
We initially gathered obscene images from some copyright-free online pornographic sites with adult material that offer free downloads of pornographic material. We use the Google Search Machine with some keywords such as "naked in nature", "nude images in beaches", "semi-naked images", "young teen exposed selfie", "seduced semi-naked pictures", and so forth, to collect varieties of obscene images. Similarly, we have used the keywords such as "celebrities", "industrial people", "food & drink", "sports people", "holidays family", "artist", and so forth, to collect nonobscene images. We categorized "infant nude" in the non-obscene category. Whereas, "children nude", "nude teenagers", "nude adults", and "older nude people" are categorized into the obscene category. We have collected images of individual from all over the world such as "African", "American", "east Asian", "middle eastern", "southeast Asian", "Chinese", "Japanese", "native American" and "Brazilian". A total of 6000 images were collected, with a 50:50 split between obscene and non-obscene images to ensure roughly the same amount of training examples in each class to generate a balanced dataset. Otherwise, the classifier may be biased towards assigning an image to the majority class with the most training samples. Furthermore, these images are subjected to pixel transitions via Pix-2-Pix GAN to generate more augmented samples. Through data F I G U R E 8 BAM attention module augmentation, volume of data increases from 6000 to 13,000. These images are manually annotated via the Labellmg software (Labellmg, 2022).
The images with obscene content are labelled as "Obscene", whereas, the images with no explicit content are labelled as "Non-Obscene". The generated dataset is termed as GAN generated labelled obscene image (GLOI) dataset. The GLOI dataset consists of 13,000 images with 6500 Obscene images and 6500 Non-obscene images. Figure 9 depicts some of the sample images from each category in the GLOI dataset.
Section 3.6.2 describes the pixel transformation procedure via Pix-2-Pix GAN for the data augmentation. While creating the dataset, the labelling component received the greatest attention.

| Data augmentation
The dataset is augmented by performing pixel-by-pixel transformation such as contrast enhancement, noise addition, and brightness transformation. To diversify our dataset, Pix-2-Pix GAN (Henry et al., 2021) is used. GAN executes an minimax optimization process to optimize the adversarial loss function. The adversarial optimization function has been depicted in Equations (3) and (5). The Pix-2-Pix GAN contains a generator network and a discriminator network. The generator is used to conditionally create the output image by performing the aforementioned operations on the input images. The generator network produces new images that match the input ground truth image (G) by minimizing the loss function loss L1 (G) as described in Equation (6). Discriminator network measures how much the generated image differs from the input image in terms of its loss L1 (G) and maximizes the loss function loss GAN (G, D) as described in Equation (5). U-Net is used as a generator in the Pix-2-Pix GAN architecture, while PatchGAN is used as a discriminator. Figure 10 represents the block diagram of Pix-2-Pix GAN architecture. The hyper parameters used in Pix-2-Pix GAN are depicted in Table 1. Skin colour pixel, background colour (which can be indoor or outdoor), and contrast level distinction, all contribute significantly to naked image recognition. Numerous individual images were subjected to the entire augmentation procedure, resulting in a more complex training-appropriate samples.
F I G U R E 9 Non-obscene and obscene samples from GLOI dataset

| RESULTS AND DISCUSSION
This section demonstrates the classification and detection efficiency of our proposed SBMYv3 model. The results are discussed in three steps: (i) performance evaluation of the proposed and sequentially evolved models, (ii) secondly, the analysis of ablation studies of the three major modifications, and (iii) finally, performance comparison with state-of-the-art methods along with a statistical significance test and measurement of effect size.  Table 2.

| Experimental configuration
Detection performance of our proposed model is quantified by using several performance metrics such as accuracy, precision, recall, IOU score, false-positive rate (FPR), Fowlkes mallows index (FM), true negative rate (TNR), and mAP. mAP is a prominent metric used by computer where TP, TN, FP and FN are true positive, true negative, false positive and false negative respectively.

| Model evaluation and analysis of experimental results
Extensive experiments have been performed to evaluate the performance of proposed and sequentially evolved models. To have a fair analysis, training and testing of each sequential developed model is performed using the GLOI dataset as depicted in Table 3. Among 13,000 images in the GLOI dataset, 3500 images of each class are used for training. After training, 3000 images of each class are used for validation and testing. Figure 11 shows the confusion matrix of SBMYv3, SMYv3, MYv3, and YOLOv3 models respectively and is observed that out of 1500 images in each class the SBMYv3 model correctly classifies 1491 non-obscene images and 1487 obscene images. Figure 12 depicts the area under curve (AUC) of all the subsequently evolved models through precision-recall (PR) curves. By analysing this figure, it is observed that the SBMYv3 model (red curve) covers larger area as compared to other models, which signifies the superior detection capability of the model. Figure 13 illustrates the comparison of mAP curves. From the plot, it is observed that proposed SBMYv3 model achieves higher mAP values as compared to other models and gets stabilized around 0.98 after the 130 epochs. The overall performance measures as well as GFLops and Memory of all the models is depicted in Table 4. Moreover, the testing execution time (ms) is also provided in this table. By analysing Table 4 it is observed that SBMYv3 model outperforms the other models in terms of detection accuracy. With respect to the compared models, SBMYv3 model achieved 99.26% accuracy and 0.6% false positive rate. Whereas, GFLops and Memory are little more than MYv3, SMYv3 models due to the integration of BAM.

T A B L E 2 Parameters used in the proposed architecture
Size (  F I G U R E 1 2 P-R curve of the four YOLO models using GLOI dataset F I G U R E 1 3 Illustration of the mAP curve using GLOI dataset 1. generalizing capability of GLOI dataset.
2. efficient deep feature extraction and rejection of irrelevant features via BAM and S3Pooling.
3. low weight MobileNetv2 as a backend design.
The detailed analysis and contributions of each component is described in the ablation study.

| Ablations studies
The proposed model SBMYv3 is the upgraded version of YOLOv3 with three key improvements such as modifications of backend design, integration of BAM, and utilization of S3Pooling. It is essential to evaluate how each of the modified contributions at different levels impact the overall performance of the model. The following subsections describe the generalizing capability of the generated GLOI dataset followed by ablation experiments to demonstrate the impact.

| Generalizing capability of GLOI dataset
The GLOI dataset is generated by the Pix-2-Pix GAN-based data augmentation technique. Specification of the generated dataset is provided in Table 1. Table 5 shows the mAP values of each model by using raw data without any augmentation and using GLOI dataset. Benefits of GANbased augmentation technique is quite evident in this table. Table 6 provides the average testing and training accuracy of the models to check the generalization capability of GLOI dataset compared to publicly available dataset such as NPDI and Pornography 2k. To have a fair analysis these two existing benchmark datasets are manually annotated and termed as labelled NPDI dataset (LND), labelled Pornography-2k dataset (LPD) respectively. From Table 6. it is observed that like LND and LPD, the GLOI dataset provides stable and consistent generalizing ability to adapt unseen data. Moreover, irrespective of the model type GLOI dataset provides superior training and testing accuracy.

| Performance comparison with state-of-the-art methods
In this section, we compared the performance of our proposed method to the existing techniques. In section 4.4.1, the baseline approaches are explained, and in section 4.4.2, the proposed model is evaluated in comparison to the baseline approaches using several performance measures.

| Evaluation measures
Classification performance of our proposed model is compared with the baseline approaches using performance metrics as described in Equation (7) along with Brunner-Munzel statistical significance test (Brunner & Munzel, 2000) and Cliff's effect size (Cliff, 1993 Munzel test is used to check the statistical significance between the data obtained from the baseline and proposed method. A p value is calculated from this test. When the p-value is less than 0.05, we can say that our proposed method is statistically significant. Further, the magnitude of the observed discrepancy (δ) is quantified using Cliff's effect size. The δ lies in the range of [À1,1]. The significance of various δ values is provided in Table 8 (Brunner & Munzel, 2000). If the δ value is 0, then all of the values between the two models are same and if it is in the range of 0.474 ≤ jδj < 1.000 then it is considered that model exhibits significant change over the compared model. Table 9 presents the performance metrics comparison of the proposed methodology with baseline approaches. Average values of these parameters are plotted in Figure 15. From Table 9 and Figure 15 it is observed that our proposed SBMYv3 model outperforms the baseline approaches and achieves testing accuracy of 99.26%. The false positive rate of SBMYv3 is 0.6% as compared to the baseline models, which signifies its efficiency in providing accurate detection results.
The result of both statistical significance test and Cliff's effect size are shown in Table 10. A p-value of SBMYv3 model with respect to all the baseline models are less than 0.05, which proves the statistical significance of our proposed detection model. Moreover, by analysing Cliff's effect size values of SBMYv3 model with respect to all the baseline models it is evident that our model exhibits larger substantial effectiveness level. By analysing Tables 9 and 10, it is observed that method (SSD MobileNetV2) is the next outperforming model compared to the proposed SBMYv3 model. With respect to SSD MobileNetV2 model, proposed model achieves average gain of TA, precision by 6.92%, 6.46% respectively and reduction of FPR by 93.75%. Figure 16 depicts the classification of obscene and non-obscene images from the GLOI dataset. In the classified obscene images obscene area is annotated through bounding box.

| CONCLUSION
This paper proposes a modified YOLOv3 model termed as SBMYv3 model for the classification and detection of obscene images and videos. The backend of YOLOv3 model was replaced with lightweight MobileNetV2 with bottleneck attention module (BAM) to exhibit adequate focus on the annotation region of the image during feature extraction. Adding to that the model utilizes S3Pooling strategy to retain intermediate augmen- tation results. The model is trained and tested using the generated GLOI dataset. Pix-2-Pix GAN based augmentation technique is used to generate the well-labelled obscene image dataset. Extensive experimental analysis has been performed to demonstrate the outperforming behaviour of suggested SBMYv3 model than the existing baseline models. The results indicate that the proposed model exhibits high testing accuracy of 99.26%, precision of 99.13%, and false positive rate of 0.6%. Furthermore, the FM index of 99.2% is also attained. The model not only classifies and detects obscene images and videos but also highlights the obscene or target areas with a bounding box. In terms of computational burden the suggested SBMYv3 model has lesser Memory and GFLops as compared to the conventional YOLOv3. The proposed model is also validated using p-value, and δ-value from Brunner-munzel and Cliff's Delta Effect Size test respectively. These tests revealed that the performance of the SBMYv3 is statistically significant with respect to the state-of-the-art techniques. In summary, the proposed architecture is efficient regarding statistical significance, training stability, computational cost, and classification accuracy. Hence, the proposed model is reasonably fit for the classification and detection of obscene images in real-time.