Dual‐Stream Fusion and Multi‐scale Analysis: Introducing the Synergistic Dual‐Stream Network (SDS‐Net) for Image Manipulation Segmentation

This article introduces the synergistic dual‐stream network (SDS‐Net), a novel neural network architecture that significantly enhances the detection of image manipulations. SDS‐Net employs a unique dual‐stream fusion strategy that processes both RGB image and the corresponding noise map. It innovatively combines features computation from the different blocks of dual backbones, and leverages a multi‐scale spatial pyramid pooling (MSPP) module to expand the receptive fields of shallow features. This approach not only enriches the feature representation, but also ensures the precise localization of manipulated regions. Extensive experiments conducted on various public datasets demonstrate the superiority of SDS‐Net over several state‐of‐the‐art methods.


Introduction
Digital images are repositories of vast information, playing a pivotal role in diverse domains such as media, social networks, and criminal investigations.However, the proliferation of sophisticated tampering tools has precipitated a crisis of authenticity, undermining the credibility of digital media.Manipulated images have been implicated in spreading insidious rumors, facilitating telecommunications fraud, corrupting academic integrity, and even fabricating forensic evidence, thereby exerting a profoundly detrimental impact on society. [1][4][5] Copy-move involves duplicating a segment within the same image, whereas splicing entails transferring a segment between different images.Inpainting, or removal, conceals specific content by synthesizing background pixels, effectively erasing targeted information.The ubiquity of these manipulations for nefarious purposes underscores the urgent need for robust detection research.
Traditionally, specialized algorithms have been developed to detect manipulation traces, targeting anomalies such as resampling, [6,7] median filtering, [8,9] contrast adjustments, [10,11] and double JPEG compression. [12,13]While these methods have demonstrated utility, they are often labor-intensive and lack guaranteed accuracy. [14]Given that each algorithm is tailored to a specific manipulation trace, a comprehensive image assessment requires multiple tests, leading to a cumbersome and error-prone process, exacerbated by the advent of novel editing techniques that further complicate detection.
[17][18][19] At the first glance, localizing tampered regions may appear similar to a semantic segmentation task, which has been extensively researched to outline and classify every pixel of a target within images. [20]While semantic segmentation networks excel in extracting semantic information, this capability alone falls short in accurately identifying tampered regions within images, presenting a significant challenge. [21]Instead, non-semantic cues such as edge discrepancies, [15] noise patterns, [22] color consistency, [23] and exchangeable image file format (EXIF) data coherence [24] are instrumental in detecting forgeries.Thus, the key of addressing image manipulation lies in training neural networks to discern these subtle forgery indicators beyond mere semantic content. [20]umerous methods have emerged to address the challenge of detecting image manipulation, which, from our perspective, generally fall into two categories.The first category aims to modulate the input to neural networks.This is achieved through the application of specialized convolutions or filters, such as constrained convolution (also known as Bayar convolution) and the steganalysis rich model (SRM) filter. [14,25]These techniques are designed to diminish the semantic information within the input images, thereby prompting the neural networks towards prioritizing the learning of forgery features over semantic content. [16,26]or example, Zhou et al. employed the SRM filter to distill noise patterns, which were then processed by a Faster region-based convolutional neural network (R-CNN) network to exploit discrepancies in noise between authentic and tampered regions. [25]evertheless, the effectiveness of such methods is limited by the overly diversity of forgery features, making it challenging for any single, manually-tuned convolution or filter to detect all types of manipulations. [16]For instance, copy-move manipulations, which do not introduce new elements into the image, elude detection by noise analysis.
][21] Proposals such as RGB-N, [25] edgesupervised branches, [21] and spatial pyramid attention networks (SAPN) [19] have been developed to learn non-semantic features.The dual-stream approach, in particular, which processes both the original image and a pre-extracted feature map, has significantly influenced the field, with many studies adopting this technique.However, in such methods, the feature extraction processes within the backbone networks operate independently. [21,25]Inspired by CB-Net, the pursuit of more sophisticated fusion strategies for multi-stream feature extraction is a promising direction for research. [27]Furthermore, most approaches typically adopt an encoder-decoder structure, leveraging deep-layer features from established backbone models like VGG16, [19] ResNet50, [21] and ResNet101 [25] for the image manipulation localization task (IMLT).Dong contends that these deeplayer features, being semantically rich, may not be optimal for IMLT. [21]Therefore, the underutilization or neglect of shallow and middle-layer features can compromise segmentation performance, leading to imprecise edge delineation and the omission of small manipulated regions. [28]Figure 1 showcases typical outputs from several notable networks, [16,21,29] all of which rely on deep-layer features for manipulation detection.The shortcomings of these networks become apparent in their inability to reliably detect smaller manipulated areas and their tendency to yield predictions with blurred and indistinct boundaries.
In this work, we introduce the synergistic dual-stream network (SDS-Net), a novel architecture designed to address the challenges previously outlined.Our approach incorporates a dualstream structure that processes both the original image and a corresponding noise map.We have engineered an innovative fusion strategy within the backbones: the features from blocks 1 and 2 of both backbones are concatenated, while the outputs of blocks 3 and 4 from the noise map backbone are element-wise added to the corresponding features of blocks 2 and 3 from the image backbone, with subsequent processing by the latter.Furthermore, we have devised and integrated a multi-scale spatial pyramid pooling (MSPP) module, which significantly enhances the receptive fields of the shallow features in blocks 1 and 2. The pyramid structure initially produces feature maps at varying scales, which are then convolved to facilitate multiscale feature representation, thereby enriching the shallow features.Leveraging these enriched features, SDS-Net adeptly identifies forged regions across a comprehensive range of feature depths, resulting in enhanced accuracy in its predictions. [16,21]DS-Net, as a fully trainable end-to-end system, is engineered for cohesive optimization, guaranteeing optimal performance.Our contributions are as follows: 1) A novel SDS-Net for image manipulation segmentation with a unique dual-stream fusion strategy is presented.It synergistically merges features across dual streams to enhance feature extraction capabilities.Its structure is shown in Figure 2. 2) Specially designed MSPP module that augments the receptive fields of the shallow features, enabling the network to produce feature maps at various scales and improve multi-scale feature representation.3) Extensive experiments on different public datasets demonstrate the superiority of SDS-Net over the state-of-the-art methods.

Related Work
Numerous networks designed for localizing image manipulations typically adopt an encoder-decoder structure. [14,16,19,21,25,30,31]These frameworks utilize backbone networks to extract features, which are subsequently fed into a segmentation network.Within the encoder, a variety of feature extraction streams are deployed, each tailored to process a particular type of data.These range from original RGB images [30] to more specialized inputs like the noise-revealing SRM, [19] the pattern-detecting Bayar convolution, [14] and the artifacttracing discrete cosine transform (DCT). [32]Except for the RGB stream, these are all aimed to extract non-semantic cues.For instance, the SRM filter reveals noise patterns that highlights inconsistencies between authentic and altered regions, [25] while the quantized DCT coefficients can be pivotal in uncovering compression anomalies. [32]Bayar convolution, introduced by Belhassen Bayar, also contributes to the inconsistent noise detection. [14]Single-stream approaches [14,19,30,31] may overlook critical data, prompting some researchers to adopt multi-stream strategies that promise a more comprehensive and robust analysis. [16,21,25,29,33]In these multi-stream approaches, each stream operates autonomously to compute and extract features, leaving the potential for cross-backbone feature fusion largely unexplored.
Numerous current methods in IMLT tend to favor deep-layer features, particularly those derived from the terminal layers of backbone models. [16,19,20,25,32,33]This preference can lead to a neglect of valuable data, given that shallow and mid-layer features often have a wealth of indicators critical for identifying forgeries [28] While recent studies have started to acknowledge the importance of shallow features, [31] their utilization remains basic, with a lack of in-depth exploration into methods for enhancing these essential features.It is evident that these methods tend to generate predictions with unclear edges, and miss small manipulated areas.

Proposed Model
We introduce the SDS-Net, a sophisticated framework for manipulation localization that leverages a dual-stream approach.This system incorporates a novel fusion strategy within its streams to improve feature extraction capabilities.Additionally, we integrate a custom-designed module to augment and refine the representational capability of shallow features.An overview of our scheme is depicted in Figure 2. SDS-Net consists of two primary components: feature extraction and prediction.For feature extraction, we utilize two DenseNet169 [34] as backbones, complemented by Bayer [14] convolution to capture both visual anomalies and noise patterns.The initial and secondary blocks' output features from each backbone are merged via a concatenation technique.Furthermore, the 3rd and 4th block outputs from one backbone are combined with the 2nd and 3rd features of the other backbone through element-wise addition.The integrated features are subsequently processed by the corresponding blocks of the second backbone.To further enhance the shallow features, we introduce a specially designed MSPP module.The MSPP is engineered to broaden the receptive field and draw spatial relationships across various scales, thereby enriching the shallow features.Prior to the final prediction, we also employ an adaptive spatial feature fusion (ASFF) block to amalgamate features of varying levels. [35]The ASFF block is adept at spatially filtering conflicting information, resolving discrepancies between semantic content, forgery signatures, and other elements.Our approach demonstrates exceptional resilience against splicing, copy-move, and removal manipulations.
In the following four subsections, we will explain these aspects of SDS-Net: initially, the rationale behind selecting DenseNet169 as our backbone for feature extraction; secondly, how do we determine the feature fusion architecture within our framework; thirdly, the mechanism by which the MSPP block improves the network's capacity to learn forgery features; and finally, the reason the ASFF block is effective in integrating features of various levels and resolving potential conflicts among them.

Selection and Analysis of Backbone Networks
][38][39] Although these networks are excellent in semantic segmentation tasks, their efficacy in IMLT with multi-level features remain further evaluation.The reasons why we choose these networks are as follows: 1).VGG16, ResNet50, and ResNet101 were chosen based on their established success as backbone networks in SPAN, [19] MVSS-Net, [21] and RGB-N, [25] respectively.We aimed to assess their capabilities as multi-level feature extractors; 2).MobileNetV2 and V3 were included to investigate the impact of dataset volume.Given that many image forgery datasets are relatively small (e.g., CASIAv1 with only 920 forged images and Coverage with 100), and considering that network parameter count is typically proportional to the required dataset volume, lightweight networks might yield better performance with smaller datasets.However, this does not necessarily reflect their ability to learn forgery features accurately.Thus, we examined these typical lightweight networks to determine their training efficacy; 3).DenseNet is good at exploring new features in each layer [40] which suggested it might excel in IMLT, where a diverse array of forgery features exists. [16]Detailed structures and illustrations of these backbones can be found in their seminal publications: VGG16, [37] ResNet50 and 101, [38] MobilNetV2, [39] MobilNetV3, [36] DenseNet121 and 169. [34]o ensure a balanced evaluation, we conducted experiments on the SDS-Net, replacing only the backbone architectures, each pre-trained on ImageNet.We utilized the CASIAv1 dataset, partitioning it into training, validation, and testing sets in an 8:1:1 ratio. [41]We adopted the pixel-level area under the receiver operating characteristic curve (AUC) as our primary metric for assessment.The comparative outcomes are presented in Table 1.The results highlighted discernible performance differences across the backbone models.VGG16 lagged with the lowest AUC of 0.721, while MobileNetV2 and V3 performed slightly better, scoring 0.753 and 0.731, respectively.The ResNet50, ResNet101, DenseNet121, and DenseNet169 models all demonstrated good performance, with DenseNet169 achieving the highest AUC at 0.826.Evidently, within SDS-Net, VGG16 was the least effective.Moreover, despite outperforming VGG16, the lightweight networks-MobileNetV2 and V3-did not match the efficacy of their more complex counterparts.This suggests that a reduced number of parameters in backbone networks do not necessarily compromise the detection of forged features.
In our tests, DenseNet169 wins the competition.Its superior performance over the ResNet models could be attributed to the hypothesis that "residual networks are indeed a special case of densely connected networks". [40]Theoretically, the diversity of features that DenseNet can provide may encompass those offered by ResNets.Consequently, DenseNet169, with its topscoring performance, has been selected as the backbone network for subsequent experiments.

Dual-Stream Fusion Strategy
The multi-stream approach has been widely utilized in the field of IMLT.Many methods just combine the features of two streams before using them.For instance, RGB-N employs bilinear pooling to merge features, [25] while ManTra-Net, SPAN, and CAT-Net utilize concatenation for this purpose. [16,19,32]However, we consider that there is significant potential in investigating and crafting a fusion strategy that operates within the backbone networks of different streams, potentially unlocking more sophisticated feature integration.The feature extraction module of our system is built upon a dual DenseNet169 backbone structure, as depicted in Figure 2.Each backbone consists of four blocks, with each block containing several convolutional layers that produce feature maps of the same size.Backbone B 1 extracts features from noise pattern, while B 2 focuses on the original RGB image.The transformation executed by the L th block of a backbone is represented as F L .The feature maps generated by the L th block in B 1 are denoted as x th 1 , and those from B 2 as x th 2 .The fused feature maps are symbolized as x th .In our design, we concatenate the features from the 1st and 2nd blocks of both backbones to maximize the retention of shallow features.Conversely, the deeper features from the 3rd and 4th blocks of the noise pattern stream are combined with the corresponding features from the second and third blocks of the RGB stream through element-wise addition.This approach allows the deeper features of the noise stream to bolster those of the RGB stream.The fusion process is mathematically articulated as follows In the aforementioned formula, "||" denotes concatenation, and "þ" represents element-wise add.To evaluate the effectiveness of our fusion strategy, we conducted comparative experiments against a range of alternative backbone fusion structures.The structures we compared are depicted in Figure 3. Figure 3a illustrates the common concatenation (CC) method used by many works, where features are concatenated after extraction without any inter-backbone fusion computation.This CC operation is mathematically represented as follows Figure 3b presents another fusion approach.In this method, the feature maps from each block of B 1 are element-wise added to the corresponding feature maps in B 2 before further processing.This method differs from the previous one in that it seeks to intertwine the computational processes of the two streams, allowing the features from B 1 to directly influence the computations in B 2 .We name this method same-level series connection (SSC), which is mathematically expressed as follows In the SSC configuration, B 1 serves as an auxiliary function, supplying information from noise pattern.The feature maps for subsequent tasks are computed by Backbone B 2 .Figure 3c Table 1.AUC comparison on CASIAv1 dataset.The left row is the backbone models and the right row is the AUC score after trained on CASIAv1 for 20 epochs.We used the SDS-Net structure shown in Figure 2 to test each backbone networks.illustrates a variation on the SSC method.this setup, the output from a higher-level block of B 1 is input into a lower-level block of B 2 .This method is designated as descending-level series connection (DSC), and it is mathematically formulated as follows This DSC also positions the B 1 as an auxiliary component, introducing a strategic interplay of features across different levels.Figure 3d illustrates the fusion strategy employed in our SDS-Net, where only the deep-layer features of B 1 are integrated into the computations B 2 and the shallow-layer features are concatenated.This method synthesizes the principles of CC and DSC, and is thus termed integrated concatenation-descent fusion (ICDF).To evaluate the efficacy of ICDF, experiments were conducted to compare these four schemes in Section 4.2.2.

Enhancing Shallow Feature Expression Through MSPP
In the previous section, we integrated features computation within backbones, whereas shallow features were simply concatenated.To maximize the potential of these shallow features, we have developed a module specifically aimed at enhancing their expression.Ideally, we need shallow features to provide more forgery clues.However, shallow features are inefficient at learning fake features.The features which are provided by the shallow layer in backbone network, only have limited receptive field of a small image patch and lack of multi-scale analysis. [42]This situation leads to two drawbacks: Firstly, throughout the training process, the likelihood that each pixel in the feature map will encounter forgery regions is diminished.Secondly, this approach may lead to feature redundancy owing to the limited receptive field that lacks contextual information.To address these issues and improve the detection of forgery traces in the initial layers, we integrate multi-scale patch processing (MSPP) blocks.These blocks are specifically designed to handle the features extracted by the 1st and 2nd blocks in backbone, as depicted in Figure 2. The fundamental principle behind MSPP involves checking image patches of varying positions and dimensions to uncover anomalies.
The architecture of MSPP is illustrated in Figure 4a.To capture information across multiple scales, we employ standard convolution, pooling, and a trio of dilated convolutions with varying dilation rates (1, 2, 5).And the dilated convolutions are arranged in sequence.The feature maps obtained from these convolutions are then merged through a process of concatenation followed by convolution to the final output.This concept bears a resemblance to the atrous spatial pyramid pooling (ASPP). [43]ASPP similarly utilizes dilated convolutions to achieve an extensive receptive field.However, despite ASPP's expansive receptive field, its internal structure is discrete, as demonstrated in Figure 4b.Such discreteness is not ideal for tasks like IMLT that necessitate the comparison of adjacent regions.In contrast, MSPP not only significantly enlarges the receptive field without sacrificing resolution or coverage but also enriches the feature representation with multi-scale information.This particularly beneficial for learning forgery features from shallow-layer representations.Within the SDS-Net framework, MSPP is only applied to process shallow-layer features, as deeper layers naturally possess a broad receptive field and abundant contextual information, obviating the need for MSPP enhancement.

Features Fusion by ASFF
In SDS-Net, before make the final decision on tampered regions, we employ an ASFF module. [35]Originally introduced to filter out conflicting information across spatial dimensions, ASFF adeptly sieves through multi-level features, retaining only those that are useful for fusion.This enables SDS-Net to utilize rich features for accurate tampering prediction.However, not all information, such as certain semantic details, contributes positively to this process.By implementing ASFF at the final decision stage within SDS-Net, we can effectively remove the interference from nonessential information, ensuring that the critical forgery features are preserved and highlighted.The specific implementation of ASFF utilized in our network is depicted in Figure 5.
As implied by its designation, ASFF dynamically determines the spatial fusion weights for feature maps.If denoting by x m n the feature vector at position n on the m th feature map.The fusion of features is as follows Where Y n represents the n th channel within the feature map Y.The terms α n , β n , γ n , and δ n correspond to the weights assigned to the feature maps.These weights are adaptively learned by the network.Each of these weights is a scalar variable, and they are constrained as that This adaptive weighting scheme ensures that features from all levels are intelligently fused into a single feature map.

Experimental Section
We demonstrate our SDS-Net on four standard image manipulation datasets and perform ablation experiments to evaluate the performance.We also compare the result with the state-of-the-art methods.

Datasets
We established a synthetic dataset for the initial pre-training of our models, following the methodology outlined in ref. [2].Subsequently, these models were fine-tuned on various datasets  for evaluation purposes.We benchmarked our model against several SOTAs methods on the CASIA, [41] COVERAGE, Nist Nimble 2016(NIST16) (https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation/) and Columbia [45] datasets.CASIA consists of CASIAv1 and CASIAv2.It provides various spliced and copy-moved images.The tampered regions are carefully selected and some post-processing techniques like filtering and blurring are also applied.Binary ground-truth masks are provided.We use CASIA 2.0 for training and CASIA 1.0 for testing like others did. [19,21,25]OVERAGE is a small dataset which only contains 100 images generated by copy-move techniques.the ground-truth masks are also available.
NIST16 is a challenging dataset which contains all three tampering techniques.The manipulations in this dataset are post-processed to conceal visible traces.They also provide ground-truth tampering mask for evaluation.
Columbia dataset focuses on splicing based on uncompressed images.Ground-truth masks are provided.
To ensure an fair comparison with the SOTA methods, we adhere to the most common training and testing splits for the CASIA, NIST16, and COVERAGE datasets. [19,25]For the COLUMBIA dataset, we utilize the entire collection exclusively for validation, following the model's pre-training on a synthetic dataset.The specifics of these datasets are detailed in Table 2.

Evaluation Metric
The metric we use for evaluation and comparison are pixel-level F1 score and the AUC.Unlike many previous studies that report F1 scores optimized using the test set's decision threshold, we employ a fixed threshold of 0.5.Because the optimal thresholds for tampered images in real-world scenarios is unpredictable.This fixed threshold offers a more objective reflection of performance.

Baseline Models
We compare our method with baseline models described as follows: ManTra-Net (https://github.com/ISICV/ManTraNet) [16]uses a manipulation-trace feature extractor to create a unified feature representation, and a local anomaly detection network to localize the forgery regions.
SPAN (https://github.com/ZhiHanZ/IRIS0-SPAN) [19]models the relationships between patches on multiple scales through a pyramid structure of local self-attention blocks, to detect image manipulation.
CAT-Net (https://github.com/mjkwon2021/CAT-Net) [31]is a modified network based on HR-Net and considers RGB and DCT domains simultaneously to effectively learn forensic features.
MVSS-Net (https://github.com/dong03/MVSS-Net) [21]consists of the edge-supervised branch and noise-sensitive branch.The former exploits subtle boundary artifacts around tampered regions, and the latter aims to capture the noise inconsistency between tampered and authentic regions.
We retrained the above models using same datasets, with their released source codes.During training, hyperparameters were either optimally chosen or configured in accordance with the specifications detailed in their publications.

Implementation Detail
The proposed network is trained end-to-end.The input image as well as the extracted Bayar features are re-sized to 512 Â 512 pixels.The backbone DeseNet169 is ImageNet-pre-trained. Regarding other normal and dilated convolutional layers, the kernel weights were initialized with He initialization [46] and the biases were initialized with zero.We use the Adam optimizer with a learning rate of 10 À4 without decay. [47]The validation loss is monitored at each epoch and learning rate is halved if the validation loss fails to decrease per 10 epochs, until it reaches 10 À7 .We train the models for 100 epochs with a batch size of 18.For the loss function, we use dice loss which can tackle the dataimbalance, [48] as forgery datasets are often highly imbalanced. [49]ur model is implemented in PyTorch and trained on 6 NVIDIA RTX3090 graphics processing unit (GPUs).In order to demonstrate the objective and independent performance of our model, no data augmentation is used.

Comparison with the State-of-Arts
We benchmarked the performance of our method against the baseline models.Initially, all models were pre-trained on a synthetic dataset and subsequently fine-tuned on CASIA and COVERAGE, with NIST16 datasets, except for COLUMBIA which was used directly for validation purposes.As depicted in Table 3, our approach surpasses other models in terms of pixel-level localization accuracy across most datasets.It only falls marginally behind MVSS-Net in terms of AUC score during validation on the COVERAGE dataset.Our SDS-Net carefully constructs the fusion of noise and RGB streams, while  simultaneously improving the expression of shallow features.strategy equips our model with the capability to deliver exceptional precision in IMLT.
Our approach demonstrates a marked improvement in F1 scores over MVSS-Net on the CASIA and NIST16 datasets, which can be attributed to two primary factors: Firstly, CASIA and NIST16 contain a higher proportion of images with smaller tampered areas.Our method's emphasis on dual-stream fusion and reinforcing shallow features ensures that these subtle manipulations are easier to detect.Secondly, our model generates prediction masks with pixel values that are predominantly at the extremes-close to 0 or 255-resulting in more decisive and less ambiguous localizations, as clearly depicted in Figure 6.Consequently, SDS-Net does not depend on an optimal threshold for achieving impressive F1 scores, unlike other methods. [21]he enhanced performance can be attributed to the cooperation between the dual-stream fusion and the ASFF module.The dual-stream fusion strategy enriches the features, while the ASFF module selectively filters irrelevant information, sharpening the focus during the decision-making process.

Ablation Study
The ablation study was conducted in by two parts.The first part was dedicated to ascertaining the efficacy of the stream fusion strategy, which are introduced in Section 3.2.The second part focused on examining the impact of MSPP and ASFF on the enhancement and selection of features.
Firstly, we performed comparative experiments on CASIA to assess the effectiveness of the stream fusion strategy.The outcomes of these experiments are comprehensively detailed in Table 4.
The ICDF method outperforms all other structures, achieving the highest AUC score, while the DSC lags behind with the lowest score of 78.5%.Notably, both SSC and DSC yield inferior results compared to the CC, which does not engage in interbackbone fusion computations.This suggests that the fusion strategies employed by SSC and DSC may actually be detrimental, potentially due to conflicts between the noise and RGB features during the fusion process.
Furthermore, DSC underperforms in comparison to SSC, which could be attributed to the aggravation of feature conflicts through its descending-level feature fusion operations.We hypothesize that such conflicts are more serious within the  shallow features.Therefore, by systematically eliminating fusion calculations from shallow to deep layers, we developed the ICDF method.Our ICDF approach achieves an AUC score of 81.7%.We infer that the higher-level layers, which are abundant in semantic features, are tractable to fusion computation.Another set of experiments was designed to evaluate the contributions of MSPP and ASFF, aiming to quantify their roles in feature enhancement and selection relative to the model's overall performance.Specifically, we investigate: 1) The degree to which MSPP boosts performance by applying it exclusively to the 1st or 2nd block of backbones, and by testing the model without MSPP for comparison.2) The influence of ASFF by comparing it with a simple element-wise addition.
The comparative analysis of MSPP and ASFF is summarized in Table 5.Our findings indicate that the implementation of MSPP enhances performance, regardless of whether it is applied single one or two.Specifically, employing MSPP solely in the 1st block yields a more substantial improvement than in the 2nd block.This could be attributed to the fact that the feature map from the 1st block has a smaller receptive field, which potentially allows for the acquisition of richer features once the receptive field is expanded.
When comparing the effects of ASFF, as shown in the last two columns of the table, we observe that ASFF yields better performance in terms of the F1 score, although with a marginal decrease in AUC.This supports the earlier assertion that ASFF is capable of filtering out irrelevant information, thereby facilitating more precise decisions.The improvement in the F1 score is not only a result of feature refinement but also comes from a reduced reliance on the optimal decision threshold.However, it is important to note that ASFF might lead to a slight loss of information, which is reflected in the small dip in AUC.Nevertheless, we consider this minor loss of 0.1% to be an acceptable trade-off.
To provide a clearer understanding of our approach's effectiveness, we present heat map experiments in Figure 7. Figure 7a displays the input tampered image alongside its corresponding tampered region mask.Figure 7b-i represent heat maps generated from the 1st to 4th blocks of the backbone.Specifically, Figure 7b-e illustrate the feature heat maps produced using ICDF, MSPP, and ASFF, while Figure 7f-i depict the feature heat maps generated without these components.Notably, the shallowlayer features in Figure 7b,c exhibit pronounced highlighting in the forged area, in obvious contrast to Figure 7f,g.Interestingly, the feature map from the 4th block in Figure 7e reveals a distinctly low brightness with a clear boundary around the tampered region, which is in contrast to the highlighting in the shallow features.Although the underlying mechanism of this reversal remains further investigation, it nonetheless offers a valuable indicator for identifying forged regions.So, our method effectively contributes to the enhancement of feature representation across various layers, demonstrating its robust support for accurate localization.

Robustness Evaluation
We conducted experiments to assess the robustness of SDS-Net against various image manipulation techniques, as detailed in  Table 6.We applied manipulations such as Resize, Gaussian Blur, Gaussian Noise, and JPEG Compression to the NIST16 dataset using Python's OpenCV library.Our model's performance was benchmarked against SPAN [19] and PSCCNet [50] on this manipulated dataset, with their results sourced from their respective original papers.SDS-Net exhibited strong robustness, outperforming SPAN and PSCCNet in handling Gaussian Blur and Noise.However, it showed increased sensitivity to Resize and JPEG Compression manipulations.

Qualitative Result
We present a comparison of prediction results from various methods in Figure 6.Our method exhibits three key advantages: 1) It achieves more precise and distinct boundaries in detected regions, as highlighted by the red circles; 2) Smaller tampered areas are more effectively localized, indicated by the green markers; 3) The prediction outputs predominantly feature stark black or white pixels, with minimal gray areas, as shown in the blue annotations.This characteristic of SDS-Net ensures clear-cut decisions, reducing ambiguity in practical applications.

Conclusions
In this study, we introduced the SDS-Net, a novel approach for image manipulation localization.Our method innovatively fused features from dual streams, employing a unique ICDF strategy that enhances the feature extraction capabilities.Besides, the integration of the MSPP module further enriches the shallow features, providing a more robust and detailed analysis of potential manipulations.Our extensive experiments across various datasets, including CASIA, COVERAGE, NIST16, and Columbia, have demonstrated the superiority of SDS-Net over current state-of-the-art methods.Notably, our approach shows exceptional performance in detecting small tampered regions and maintaining clear boundaries, which is crucial for practical applications.Additionally, The robustness of SDS-Net was also validated through tests against different manipulation methods.
In conclusion, SDS-Net represents a significant advancement in the field of image manipulation detection.Its ability to effectively utilize multi-level features and perform precise localization of tampered regions sets a new benchmark for future research and development in this domain.We believe that our contributions will pave the way for more sophisticated and reliable image manipulation detection techniques, enhancing the integrity and trustworthiness of digital media.

Figure 1 .
Figure 1.Typical outcomes from certain previous methods are presented.It is evident that these methods tend to generate predictions with unclear edges, and miss small manipulated areas.

Figure 2 .
Figure 2. Detailed architecture of SDS-Net.This illustration presents the comprehensive structure of the synergistic dual-stream network (SDS-Net), showcasing its dual-stream fusion strategy, multi-scale feature processing for enhanced prediction accuracy in image forgery detection.

Figure 5 .
Figure 5. Adaptive spatial feature fusion (ASFF) schematic.This figure illustrates the modified ASFF approach tailored for the integration of four feature maps at same scales.

Table 4 .
AUC performance comparison on CASIA.The first column lists the fusion methods tested, while the second column presents the corresponding AUC scores achieved after 20 training epochs using the SDS-Net framework.

Figure 7 .
Figure 7. MSPP enhances the shallow features.a) is the tampered image and groud-truth mask.b-e) are heatmaps calculated on the backbone network when MSPP blocks are applied.f-i) are heatmaps when MSPP blocks removed.

Table 2 .
Training and testing split (number of images) for four standard datasets.

Table 3 .
AUC and F1 (%) performance comparison of our method with four SOTA methods on validation set of CASIA, NIST16, COLUMBIA and COVERAGE.

Table 6 .
Robustness analysis of SPAN on NIST16.# means AUC decrease compared with no manipulation applied.