Extended IMD2020: a large ‐ scale annotated dataset tailored for detecting manipulated images

Image forensic datasets need to accommodate a complex diversity of systematic noise and intrinsic image artefacts to prevent any overfitting of learning methods to a small set of camera types or manipulation techniques. Such artefacts are created during the image acquisition as well as the manipulating process itself (e.g., noise due to sensors, interpolation artefacts, etc.). Here, the authors introduce three datasets. First, we identified the majority of camera models on the market. Then, we collected a dataset of 35,000 real images captured by these cameras. We also created the same number of digitally manipulated images. Additionally, we also collected a dataset of 2,000 ‘real ‐ life’ (uncontrolled) manipulated images. They are made by unknown people and downloaded from the Internet. The real versions of these images are also provided. We also manually created binary masks localising the exact manipulated areas of these images. Moreover, we captured a set of 2,759 real images formed by 32 unique cameras (19 different camera models) in a controlled way by ourselves. Here, the processing history of all images is guaranteed. This set includes categorised images of uniform areas as well as natural images that can be used effectively for analysis of the sensor noise.


| INTRODUCTION
The histories of visual content manipulation and photography run practically in parallel [1]. In modern times, we face a plethora of manipulated images that create significant problems in our society. Advanced image editing techniques have become increasingly accessible in the form of user-friendly editing software and have resulted in manipulated visual content that appears convincingly realistic. Both classic image editing programs and an abundance of apps and software tools now use the latest advances in computer vision, for example generative adversarial networks (GAN) [2]. GAN methods can be used to create a fake but realistic visual content in no time at all. Deepfakes (artificial intelligence generated videos of people doing and saying fictional) are a popular form of such manipulations.
Clearly, we require technologies that permit us to assess the integrity of digital visual media to a reliable degree, yet the limitations of our current forensic technology result in low accuracy in real-life situations. The image forensic community seeks to apply the successes of deep nets in computer vision problems to the difficult problem of detecting manipulated imagery. But, we face a few obstacles while achieving this objective.
A major obstacle is that deep nets require large-scale datasets for training. For image classification, the ImageNet [2] released in 2009, yielded a large-scale annotated dataset containing 1,000 distinct object categories. Fei-Fei Li et al. [3] employed Google Image Search to pre-filter large candidate sets for each category, and Amazon Mechanical Turk crowdsourcing pipeline [4] to manually validate that each image belonged to its assigned category. This large dataset has advanced computer vision and machine learning research and improved the performance accuracy of classification models in relation to earlier methods. Today, the computer vision community benefits from several such publicly available datasets like: UCID [5] and ImageCLEF [6] for image retrieval; PASCAL [7], ImageNet [3], and Microsoft COCO [8] for tasks such as object detection, segmentation and recognition.
The above-mentioned datasets cannot serve the purposes of the forensic image community directly because they were not gathered with forensic research in mind and therefore This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. lack the desired diversity and annotations. To date, most image forensic authors have worked with small datasets that failed to capture the wide, complex image artefacts that appear in the lifecycle of real-life images. As a result, these methods fail in cross-data testing and generalisation. Some authors have tried to solve this problem by training their methods using only real images (e.g. [9]); others have tried to build internal limited datasets (e.g. [10]) and focus on domain adaptation.
The aim of the authors is to introduce a large, annotated dataset for detecting manipulated visual content. Inspired by the semi-automatic way that ImageNet has been built, we will build in a semi-automatic way a dataset that captures a large diversity of image and manipulation artefacts. This is a challenging task. Each camera brings into the image different kinds of artefacts. Some artefacts are unique to particular camera device and some are unique to camera model. A range of compression levels brings a range of quantization noise into the visual content. Different manipulation techniques yield different editing traces. In general, we can categorise intrinsic artefacts in visual content into three groups: (i) acquisition artefacts, see Figure 1 (e.g. sensor noise, demosaicking algorithms or gamma correction); (ii) format artefacts (e.g. JPEG and quantization noise); and (iii) manipulation artefacts (e.g. artefacts left by GAN in the image).
The artefacts mentioned above are essential to create image/video forensic methods. In fact, forensic methods that are based on high-pass filters and their resulting noise residuals, often seek to eliminate the image content to emphasise these intrinsic artefacts and so expose traces of image manipulation. Although the above-mentioned artefacts are often invisible by naked eye, dataset with lack of a high variety in them might result in overfitting of learning methods to a narrow set of cameras or types of manipulations causing those methods to perform poorly on new and unseen manipulations (e.g. [10]).

| Contribution
Extended IMD2020 introduces three datasets. The first dataset consists of 35,000 real images captured by 2,322 different camera models. These camera models form the majority of existing cameras in the market. The dataset provides a rich and diverse set of sensor noise-artefacts that various imaging software embedded in cameras bring into images-and compression artefacts. Moreover, we also synthetically created a set of manipulated images using a large variety of manipulation operations including core image processing techniques as well as advanced methods based on GAN or Inpanting. This F I G U R E 1 Individual steps and components forming a typical digital image [11] NOVOZÁMSKÝ ET AL.
-393 resulted in 70,000 images in total. In addition to this dataset, we also downloaded 2,000 'real-life' (uncontrolled) manipulated images created by random people from Internet. Real versions of these images also are also provided. Binary masks localising the manipulated areas have been created manually. The last part of the datasets consists of 2,759 real images formed by 19 camera models in a controlled way by ourselves. To this end, we used 32 different cameras. The processing history of all images is guaranteed. This set also includes images of uniform areas that can be used for analysis of sensor noise as well as other camera-dependent artefacts.
The dataset contributes to facilitating future research in: (i) classification of the manipulated image and localisation of the manipulated area; (ii) source camera identification and sensor noise (e.g. PRNU (photo response non-uniformity) analysis; and (iii) reverse search of visual content (the dataset includes tens of thousands of near-duplicates in the form of real and manipulated versions of the same image that can serve for train and test needs of image search engine).
In addition to the dataset, the authors study intrinsic artefacts in images and empirically demonstrate their presence. Also it provides a comprehensive review of existing image forensic datasets. Moreover, the authors bring a survey of existing CNN (convolutional neural network)-based methods for detecting image manipulation.
The work of the authors is organised as follows. Section 2 breaks down digital manipulation into different categories. The subsequent sections summarises artefacts brought into the image during their lifecycle. After this, we introduce previously published datasets and papers related to the topic discussed here. In Section 5, the dataset is introduced in details. The following section after Section 5 includes experimental results and the last section summarises the work that has been performed by the authors.

| TYPES OF MANIPULATION
Any kind of operation applied to an image or video that cause the visual content differs from its authentic version is a digital manipulation. However, there are types of image processing methods, such as rotation, down-sizing, application of global filters on images that manipulate the information represented by visual content in a very limited way. Therefore, today image forensic methods are rather interested in detection of visual contents manipulated in a malicious way.
There are three major types of malicious manipulation of digital images: (i) copy-paste (copying an area from the same image and pasting it to a different area of the same image); (ii) splicing (the manipulated image is created by combination of two or more images.) and (iii) and re-touching (locally editing an area of the image). Different types of malicious manipulations that can be applied to an image are shown in Figure 2. Such manipulations can be achieved by using basic image processing techniques, as well as advanced methods based, for instance, on GAN.

| ARTEFACTS BROUGHT INTO IMAGES IN THEIR LIFECYCLE
The journey of a digital image can be represented as a composition of several steps: (i) acquisition; (ii) coding and digital editing [11]. For the sake of simplicity, we model the image acquisition process in the following way: Here, I i,j denotes the image pixel at position (i, j) produced by the camera, I o i;j denotes the noise-free image (perfect image of the scene), Γ i,j is the multiplicative noise, such as PRNU and ϒ i,j stands for all additive noise components.
The following sections briefly describe the major types of artefacts brought into images during the acquisition process and in their later stages of the lifecycle.

| Artefacts associated with acquisition devices
Digital image acquisition devices introduce intrinsic artefacts or fingerprints in the final visual content output through their various components.
When an image is captured, the light from the actual scene is focused through the camera's optical system onto its sensor (usually CCD or CMOS). The sensor's pixels collect photons and convert these into voltages that are then sampled by a digital signal in an A/D converter. Before reaching the sensor, however, the light is usually filtered by the colour filter array (CFA). The CFA is a mosaic of tiny colour filters placed over pixels of the image sensor to capture particular colour information. The CFA is necessary because typical consumer cameras have a single sensor which is not capable of separating colour information. Each pixel captures only one main colour (red, green, or blue). During the demosaicking process, the sensor output is interpolated to produce the digital colour image [11]. The subsequent signal is then processed again for colour correction and white balance adjustment. Additional processing includes gamma correction to adjust for the linear response of the imaging sensor, noise reduction and filtering operations to visually enhance the image.
Among the artefacts we have mentioned, some are unique to the specific camera sensor, and others are common to all cameras sharing a model number or brand by virtue of the embedded software they share. For example, a specific image sensor will produce a unique pattern noise. As stated in [12], taking a photo of a uniform scene will still produce a digital image that exhibits variations in the intensity of the individual pixels, which is partly due to the pattern, readout or shot noise. Authors have used sensor pattern noise to identify the exact camera that captured an image [13]. To this end, typically, PRNU , a unique part of the sensor pattern noise has been used (the multiplicative component of Equation (1)): Figure 3 shows sensor pattern noise of two different cameras capturing the same scene, as apparent sensor noise of these two cameras differ. A light uniform scene with minimal number of edges that enables a more accurate extraction and modelling of the sensor noise have been used [13].
If we examine the demosaicking process on the other hand, we will find it is typically identical for all cameras belonging to the same model (since they share common embedded software and the same demosaicking algorithm). For example, Mahdian et al [14] shows that these interpolation techniques often bring into the image invisible periodic artefacts.

| Artefacts associated with lossy compression
The output of the camera is typically compressed and stored in JPEG which is the most commonly used image format. In JPEG, the image is first converted from RGB to YCbCr, consisting of one luminance component (Y) and two chrominance components (Cb and Cr). Mostly, the resolution of the chroma components is reduced (usually by a factor of two). Then each component is split into adjacent blocks of 8 � 8 pixels. Each block of each of the Y, Cb and Cr components undergoes a discrete cosine transform (DCT). Let f(x, y) denote a pixel (x, y) of an 8 � 8 block. Its DCT is: where u, v ∈{0⋯7}; CðuÞ; CðvÞ ¼ 1= ffi ffi ffi 2 p for u; v ¼ 0; otherwise C(u), C(v) = 1.

F I G U R E 2
Types of image manipulation. On the left copy-paste is shown, in the middle splicing is shown and on the right an example of a re-touching operation is demonstrated F I G U R E 3 (a) The extracted sensor pattern noise of a Nikon Coolpix L23 device is shown and (b) shows the same for Canon Powershot A495. Note that the apparent sensor noise of these two cameras differ NOVOZÁMSKÝ ET AL. -395 In the next step, all 64 F(u, v) coefficients are quantized. The quantization step is given by a 64-element quantization table (QT): where QT(u, v) defines the quantization step for each DCT frequency u and v. Commonly, there is one QT for Y and another single QT for both Cb and Cr. Quantization tables determine the quantization rate (compression rate). They bring into the image quantization noise and blocking artefacts that are typical for JPEG compressed images. Therefore an image forensic dataset should ideally cover a wide range of quantization tables (compression rates) to avoid overfitting of learning methods to specific kinds of JPEG artefacts and compression levels.

| Artefacts associated with various types of manipulation
Different image editing can be applied to an image during its life. This includes simple operations such as geometric transformation (rotation, scaling etc.), blurring, sharpening or more advanced and possibly malicious changes such as image splicing or cloning (copy-move), inpainting operations (e.g. [15,16]), or GAN (e.g. Cycle-GAN [17] or Style-GAN [18]). Obviously, image forensic community is mainly focused on detecting malicious types of manipulations. There are three major types such manipulation: (i) copy-paste (copying an area from the same image and pasting it to a different area of the same image); (ii) splicing (the manipulated image is created by combination of two or more images.); and (iii) and re-touching (locally editing an area of the image).
All such manipulations leave characteristic traces in the image. For instance, authors have noticed that GAN-based methods also leave distinct invisible artefacts in the image (e.g. [19]). There are two main components in GAN: discriminator and generator. The discriminator tries to distinguish real images of the target category from those generated by the generator. On the other hand, the generator takes an image of the source category as input and tries to generate an image similar to images of the target category and making them indistinguishable by the discriminator. Looking ar more details to the GAN pipeline (e.g. Figure 4) we can notice that typically generator contains two components: encoder and decoder.
The encoder contains a few down-sampling layers which aim to extract high-level information from the input image and generate a low-resolution feature tensor. The decoder, on the other hand contains a few up-sampling layers which take the low resolution feature tensor as the input and a high-resolution image as the output. According to Zhang et al. [19], although the structures of the GAN models are quite diverse, the upsampling modules used in different GAN models are consistent. The up-sampling bring into the image specific artefacts (e.g. interpolation based [14]). Zhang et al. [19] addressed these up-sampling related artefacts and used them to detect GANbased images. They showed that they are present in most of the commonly used GAN methods.
To summarise the work performed till now, we can say that a well-designed forensic dataset should capture changes brought into images by variety of acquisition devices, compression levels and types of manipulations. As pointed out some of these artefacts are unique per each particular camera (i.e. sensor), and some of them are unique per camera brand or model or software editor (e.g. demosaicking algorithm or JPEG compression parameters).

| RELATED WORKS
This section focuses on reviewing existing datasets as well as CNN-based methods dealing with detection of image and video manipulation.

| Related datasets
The work performed by authors here is an extended version of [20]. In addition to [20], the authors are introducing an innovative part into the dataset consisting of 2,759 real images formed by 32 unique cameras (19 different camera models). They have been captured manually in a controlled way by ourselves and so their processing history is guaranteed. Both images of uniform areas as well as natural images have been captured. This enables effective analysis of sensor noise as well as other camera-dependent artefacts. In the experimental part of our work, this resourceful part of the dataset is used to demonstrate the presence of hidden camera-dependent artefacts in images.
The CoMoFoD dataset [21] has been designed for copymove forgery detection. It consists of 260 forged images in two categories of small (512�512 pixels), and large (3000 � 2000 pixels). Each set includes a forged image, mask of the manipulated area and its original image. Images are divided into five groups according to applied manipulation: translation, rotation, scaling, combination and distortion etc. The MICC-F220 and MICC-F2000 [22] are another dataset focused on copy-paste. MICC-F220 is formed by 220 images: 110 are tampered images and 110 are originals. The resolution varies from 722 � 480 to 800 � 600 pixels. The Columbia spliced image database [23] has two parts. First, a grayscale image dataset with 933 authentic and 912 spliced grayscale image blocks, and a colour image dataset with 183 authentic uncompressed colour block images and 180 spliced uncompressed colour block images.
CASIA Image Tampering Detection Evaluation Database [24] is an image forensics dataset that focused on splicing. CASIA v1.0 has 800 authentic and 921 spliced 384�256 images. CASIA v2.0 contains 7,491 authentic and 5,123 tampered images. The First Image Forensics Challenge [25] collected thousands of images of various scenes, both indoors and outdoors. The dataset served for an international competition organised by the IEEE Information Forensics and Security Technical Committee and comprises of a total of 1,176 forged images. Wen et al. [26] introduced a small dataset called Coverage designed for copy-paste detection. The REWIND (REVerse engineering of audio-VIsual coNtent Data) [27] dataset contains 142 handmade manipulated images for the evaluation of image tampering detectors. Half of the images are original; the other half is a set of hand-made forgeries. There are also 4800 automatically manipulated images. Barni et al. [28] created a small dataset for detecting cut and paste splicing (ISCAS). Zhou et al. created a dataset of manipulated faces [29] by using FaceSwap [30] and SwapMe [31]. There are 1005 tampered images for each tampering technique (2010 tampered images in total) and 1400 authentic images for each subset. Realistic Tampering Dataset [32] proposes a dataset of realistic forgeries created manually by using editors such a GIMP and Affinity Photo. The National Institute of Standards and Technology (NIST) was presented with a large benchmark dataset-Nimble Challenge 2017 [33]. This dataset contains a total of 2,520 manipulated images. Moreover, NIST also has published additional datasets MFC2018 and MFC2019 [33] in subsequent years.
Most of the currently published datasets (see Table 1) are limited in size, acquisition device variety, content, attacks type and compression/post processing variety. Typically, they are created in a controlled environment.

| State-of-the-art methods
Early image forensic methods used hand-crafted features to detect individual types of manipulation. These traditional methods typically aim to detect some targeted inconsistencies among pixels. For example, Farid et al. [34] designed a method to detect composites created from JPEG images of varying quality. This method determines whether a section of the image was initially compressed more, to produce a lower quality than the rest of the image. In [35], Hany Farid described the specific correlations brought by the CFA interpolation into the image and proposed a method capable of detecting their inconsistency across the image.
Mahdian et al. [36] used estimates of local noise variance using wavelet transform to detect local image noise inconsistencies. Weiqi Luo et al. [37] used JPEG blocking artefact characteristics to detect recompressed image blocks. Wei Wang et al. [38] utilised grey level co-occurrence matrix (GLCM) of thresholded edge image of image chroma as an image splicing detection method. Sevinc Bayram et al. [39] used Fourier-Mellin transform to propose a clone detector. The Fourier-Mellin transform does not vary with respect to scale and rotation which permits stronger performance of the method when confronted with cloned areas that have been resized and rotated. In [40] a range of classic image forensic methods can be viewed.

| CNN-based image forensic methods
Deep neural networks have shown to be very effective in various image processing tasks and computer vision so there is no surprise that the image forensic community also has shifted its direction to utilise achievements of deep learning. In [41], Ghosh et al., assume that the spliced and host regions come from different camera-models and segment these regions using a Gaussian-mixture model. They learn high pass rich filters using constrained CNNs that compute residuals, highlighting low-level information over the semantics of the image. In [42], Bunk et al. used resampling features computed on overlapping image patches that are passed through a long short-term memory (LSTM) based network for classification and localisation of manipulation. In [43], Wu et al. introduced a novel deep neural architecture for image copy-move forgery detection. The method is based on a two-branch architecture followed by a fusion module. The two branches localise potential manipulation areas using visual discontinuities and copy-move regions via visual similarities, respectively. In [44], Zhang et al. used information of chrominance and saturation channels to develop a shallow convolutional neural network (SCNN) that learned to detect doctored areas in in low-resolution images. To this end, boundaries of modified areas have been used. In [10], Cozzolino et al. demonstrate limited generalisation capability of underlying CNN. They showed that CNN learn features that are highly discriminatory for the given dataset but lack of generalisation resulting in inaccurate results of today's CNN-based methods when performed in cross-dataset test scenarios. To avoid the underlying CNN to overfit to manipulation-specific, they introduced forensic-transfer (FT). They learn a forensic embedding based on an auto-encoder based architecture [45] that can be used to distinguish between real and fake imagery. An unseen manipulated image will be detected as fake if it gets mapped sufficiently far away from the cluster of real images. The authors show that only a few training samples of the target domain of tampering enable to finetune their model to achieve high accuracies.
In order to detect GAN generated images, in [46], Yu et al. used GAN-based fingerprints in order to use them to classify an image as real or GAN-generated. Their experiments show that even a small difference in GAN training (e.g. the difference in initialisation) can leave a distinct fingerprint that commonly exists over all its generated images. To avoid learning the semantic information in the image, in [47], Kim et al. used a deep learning approach that utilises a high-pass filter to acquire hidden features in the image. In [48], Mazaheri et al. developed an encoder-decoder based network. They assume that manipulated images commonly leave some traces near boundaries of manipulated areas such as blurred edges. In order to detect forgeries, they use representations from early layers in the encoder. In [49], Bappy et al. used manipulation localisation architecture which utilises resampling features, LSTM cells, and encoder-decoder network to segment manipulated areas of the image. Resampling features are used to capture artefacts like JPEG quality loss, up-sampling, downsampling, rotation, and shearing. In another work [50], Bappy et al. assumed manipulated areas often exhibit discriminative features in boundaries shared with neighbouring non-manipulated pixels. They focused on these characteristics and developed a unified framework for joint patch classification and segmentation to localise manipulated regions from an image. The proposed method learns the boundary discrepancy, that is, the spatial structure, between manipulated and nonmanipulated regions with the combination of LSTM and convolution layers.
In [51], Zhou et al. realised that they can use multiple modalities as the input to their CNN to increase the accuracy. They proposed a network using the RGB information. In addition to this, they also added a noise stream to the architecture. Authors observed that the fusion of the two streams leads to learning effective and rich features and higher accuracy. In [52], Rao et al. focused on eliminating the complex image content to detect manipulation. This enables them to achieve a faster time to accuracy when training the underlying CNN. Specifically, weights at the first layer of their network are initialised with the 30 basic high-pass filters used in spatial rich model for image steganalysis. The results obtained are promising. In [53], Cun et al. instead of classifying the spliced region by a local patch, authors leveraged the features from whole image and local patch together, calling this structure a semiglobal network. Furthernore, the work of Cozzolino et al. focused on eliminating the image content as proposed in [54]. Here, authors proposed a deep learning method to extract a noise residual, called noiseprint, where the image content is removed. Results shown in the paper signify this direction is promising in forgery localisation.
In [55], Bondi et al. proposed a method leveraging characteristic footprints left on images by different camera models. The rationale behind the method is that all pixels of pristine images should be detected as being shot with a single device. By contrast to such images, if a picture is obtained through image composition, traces of multiple devices can be detected. In [56], Bayar et al. have developed a new type of CNN layer called a constrained convolutional layer that is able to jointly suppress an image's content and adaptively learn manipulation detection features. Through a series of experiments, they show that the proposed constrained CNN is able to learn manipulation detection features directly from data and outperforms the existing state-of-the-art general purpose manipulation detectors. In [57], Liu et al. proposed to utilise CNNs and the segmentation-based multi-scale analysis to locate tampered areas in digital images. The authors observed that exploiting the benefits of the small scale and large-scale analyses, the segmentation-based multiscale analysis can lead to a performance leap in forgery localisation of CNNs.
In [58], Salloum et al. proposed a method based on VGG-16 which is a fully convolutional network (FCN). The authors introduced several modifications such as batch normalisation layers and class weighting to train VGG-19 to localise imagesplicing attacks. They demonstrate improvement in comparison to the state-of-the-art methods. In [9], Huh et al. proposed an algorithm that uses the automatically recorded photo EXIF metadata. EXIF stands for Exchangeable Image File Format and typically it is embedded into JPEG files by the camera. EXIF can include date, time, camera settings etc. The authors used EXF to train a model to determine whether an image is self-consistent. In other words, whether its content could have been produced by a single imaging pipeline. The method demonstrated superior results in comparison to other existing ones.
In [59], Le-Tien et al. proposed a low computational-cost and fully connected neural network to address the problem of image forgery detection. In [60], Bayar et al. tried to prevent the CNN from learning features that represent an image's content. They proposed a new covolutuional form specifically designed to suppress an image's content and learn manipulation detection features. In [61], Wu et al. showed that both image splicing detection as well as localisation can be jointly solved using a multitask network in an end-to-end manner. In [62], Marra et al. attempt to avoid downsizing of images before analysing them by CNNs. They propose a CNN-based image forgery detection framework which makes decisions based on full-resolution information gathered from the whole image.
In [19], Zhang et al. proposed a GAN simulator, which can simulate the artefacts produced by the common pipeline shared by several popular GAN models. They identified a unique artefact caused by the up-sampling component included in the common GAN pipeline. Without seeing the fake images produced by the targeted GAN models during training, the approach achieves a state-of-the-art performances on detecting fake images generated by the popular GAN models. In [63], Marra et al. observed that Xception-Net is capable to achieve superior accuracy in detecting image manipulation. For instance, authors demonstrate that this network accurately detects GAN-generated fake images that are published on social networks. To achieve this conclusion, authors studied the performance of various image forgery detectors against image-to-image translation, both in ideal conditions, and in the presence of high compression, routinely performed upon uploading on social networks. The winning architecture was XceptionNet. Another promising image manipulation detectors based on CNN was proposed in [64] by Wu et al., called ManTra-Net. ManTra-Net performs both detection and localisation. The network handles images of arbitrary sizes and various types of manipulation such as splicing, copy-move, removal, enhancement etc. (they learn robust image manipulation traces from 385 image manipulation types). In [64], authors formulated the forgery localisation problem as a local anomaly detection problem. The method extracts image manipulation trace features for a testing image, and identifies anomalous regions by assessing how different a local feature is from its reference features. They demonstrated a good improvement over the existing methods.

| THE EXTENDED IMD2020 DATASET
Image forensic methods often eliminate the image content and analyse the underlying (hidden) noise/artefacts component of the image to find inconsistencies. As pointed out earlier, some of the intrinsic artefacts are unique to sensor/camera and some others shared by images captured by cameras of the same brand/model.

| Flickr-based images
To prevent possible overfitting to a narrow range of camera models, we collected a list of the majority of camera models existing in the market. Subsequently, we searched for images captured by these devices on Flickr (Flickr enables a search based on camera information included in metadata). If available, 30 real images per camera model have been downloaded.
Most Flickr users are unlikely to publish maliciously manipulated visual content, but Flickr itself cannot guarantee and exactly identify the source of its images. The processing history of these images remains unknown. So, to reduce potential risk, we manually reviewed ('cleaned') all the images and eliminated those with obvious signs of digital manipulation. We were left with a set of 35,000 real images, some of which are shown in Figure 5 We also generated a same number of synthetically manipulated images using various methods. As pointed our earlier, advanced techniques such as GAN often bring characteristic artefacts into images [46]. Such kinds of artefacts might lead to overfitting of learning methods. This has also been empirically confirmed by Cozzolino et al. [10] where authors experimentally demonstrated CNN-based approaches for image forgery detection tend to overfit to the source training data and perform poorly on new and unseen manipulations. Therefore, to manipulate images we also used a high variety of core image processing techniques.
Specifically, a random area of a random shape of images has been manipulated, using one of the following types of manipulations: copy-paste, splicing and re-touching. Size of the manipulated area has been randomly selected to be from 5% to 30% of the image. Additionally, a random combination of image processing operations has been applied on the manipulated area. These operations are based on JPEG (random compression level), blurring (various kernels), contrast manipulation, various types of noise and resampling and interpolation using bilinear and bicubic kernels. About half of the images have been manipulated in this way. Some examples of such manipulated images are shown in Figure 6.
To synthetically manipulate the second half, we used advanced methods such as GAN or Inpainting. Specifically, the following methods have been used to manipulate images: builtin OpenCV inpainting function, inpainting method proposed in [16], and FaceApp [65] which is currently one of the most popular face manipulation mobile applications based on GAN in iOS and Android. Some examples of such manipulated images are shown in Figures 7 and 8.
To summarise, this dataset is formed by 70,000 images. Half of them are real and the second half has been manipulated in a controlled manner. Binary masks of all manipulated images localising the manipulated areas are also provided. NOVOZÁMSKÝ ET AL.

| Real-life manipulated images
We also collected a large set of real-life (uncontrolled) manipulated images from the Internet (for example, see Figure 9). Specifically, 2000 manipulated images created by random people have been downloaded (URL of most images were obtained from [66]). For all of the manipulated images, we also downloaded their real versions. Binary masks localizing the manipulated areas for all manipulated images have been created manually. Some examples of this dataset are shown in Figures 9 and 10.

| Guaranteed set of real images
In addition to above-mentioned data, we also created a set of real images captured by ourselves so their processing history is guaranteed. To collect this set, we used 32 unique cameras (19 different camera models). Table 2 shows cameras used and corresponding number of images acquired by each camera.
Using each camera, we captured images of natural scenes (for example, see Figure 11 (a)) as well as images of a uniform light scene with minimal number of edges (for example, see Figure 11 (b)). Images of uniform scenes enable an easier and more accurate estimation of the sensor noise and PRNU [13].

| Estimating camera sensor noise
As pointed out earlier, cameras bring into images different kinds of artefacts. Some artefacts are unique to the particular camera device and some are unique to camera model. For example, the demosaicking process which brings into the image specific hidden changes [14] is typically identical for all cameras of the same model (assuming these cameras use the same embedded software and demosaicking algorithm). On the other hand, the sensor pattern noise has that been widely studied by authors to identify the exact camera that captured the image is unique per camera.
To design an experiment that will demonstrate the presence of artefacts unique per camera as well as unique per camera model, let us to briefly point out the typical procedure of examining whether a digital image under investigation has been captured by an exact camera.

F I G U R E 7
On the left the real image is shown, in the middle the manipulated image (using an inpainting method [16]) and on the right the binary mask localising the manipulated area F I G U R E 8 On the left he real image is shown, in the middle the manipulated image (using FaceApp [65]) and on the right the binary mask localizing the manipulated area. It is interesting to note that although the visible area of manipulation of FaceApp is typically inside the face area, pixels of a larger rectangular area around the face gets modified as a result of face transform NOVOZÁMSKÝ ET AL.
To link a digital image to an exact camera, first the camera sensor fingerprint is needed to be constructed. Specifically, for a given camera, the corresponding sensor noise fingerprint is estimated by averaging multiple camera reference images I k , k = 1, …, N. Camera reference images are photos captured by the camera under examination. It is recommended to use photos of an uniformly illuminated surface.
The process is often sped up by suppressing the scene content from the image prior to averaging. This is achieved by using a denoising filter F and averaging the noise residuals instead. I o is approximated by denoising I that results in mentioned residuals as stated here: In the above equation, we omitted pixel indexes (i, j) in our denotations. Now, Γ can be approximated in the following way: Testing if an image has been captured by a particular camera is typically carried out by performing a similarity measure of two sensor fingerprints, Γ s 1 ; Γ s 2 . Here, Γ s 1 is obtained from the image under investigation and Γ s 2 corresponds to the camera and obtained by using the set of camera reference images.
Typically, a normalised correlation (a black-box method) is used to compare two estimated sensor fingerprints. Having available Γ s 1 and Γ s 2 , we measure their similarity by employing a normalised correlation: where X denotes mean of the vector X, ⊙ stands for dot product of vectors defined as X ⊙ Y ¼ ∑ N k¼1 XðkÞXðkÞ and ‖X‖ denotes L 2 norm of X defined as ‖X‖ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi X ⊙ X p . The estimated Γ N is the basic version of the camera sensor fingerprint and is not usable in practice for identifying the exact source camera. The reason is a strong presence of components non-unique to sensor in the estimated Γ N . They are caused by operations performed by embedded software in cameras such as gamma correction, CFA interpolation, colour enhancement, geometric deformation corrections, JPEG compression, invisible watermarks etc.
To minimise this problem, sensor fingerprint can be, for example, enhanced by Wiener filtering in the frequency domain to remove traces of periodic artefacts [13]. This is not used in the experiments carries out in the next section.

| EXPERIMENTS
Here, we demonstrate results of a few popular image forensic methods on the collected real-life dataset. Moreover, we perform an experiment using the guaranteed set of images to demonstrate the presence of camera-dependent artefacts.

| Methods detecting manipulation
We applied the following methods on our dataset: NOI1 [36], CFA1 [67], BLK [68], ADQ1 [69] and ManTraNet [64]. To evaluate methods, all images have been first resized to 480 � 480 pixels. We computed false and true positive rates (FPR and TPR) as a function of the detection threshold, going from 0 to 1 and obtained the corresponding receiver operating characteristic (ROC) curve. Moreover, we calculated the area under the receiver operating characteristic curve (AUC) [58]. Results are shown in Figure 12 and Table 3.
As suggested by results, current methods have considerable limitations in their accuracy when applied on real-life (unseen) image forgery. Typical undetected types of manipulations are small manipulated areas, heavily compressed images, images degraded with correlated noise, images with multiple areas manipulated differently etc. -403

| Camera-dependent artefacts
Here, we will experimentally validate the presence of artefacts unique to particular cameras and unique to each camera model. To this end, camera sensor noise fingerprint, Γ, of all cameras pointed out in Table 2 was estimated.
For the sake of simplicity, Γ was constructed by using residuals based on a simple median de-noise filter of size 3 � 3. Only the central part of images of size 976 � 976 has been used. For each camera, two different fingerprints have been constructed: (i) by using images of uniform areas, Γ uniform and (ii) by using images of natural scenes captured by the camera, Γ natural . Next, camera fingerprints have been compared to each other using Equation (4). Specifically, for each camera, we first measured the similarity of the fingerprint formed by images of uniform areas and the fingerprint of the same camera formed by images of the natural scenes, corr(Γ uniform , Γ natural ). Then, we calculated similarity of all uniform area fingerprints of all cameras with each other. Results are shown in Table 4. Figure 13 provides another view (a high-level view) on results obtained.
As it is apparent, the highest correlation values are obtained when comparing fingerprints of the same camera estimated using two different sets of images, Γ uniform and Γ natural . This signifies a strong presence of artefacts unique to each camera sensor in images. On the other hand, the lowest values correspond to comparing fingerprints of totally different camera makes and models. Also, it is interesting to note that comparing fingerprints of different cameras of the same model results in higher correlation values than comparing the same for cameras of different models. This signifies the presence of artefacts unique to camera model. Analogically, we can see that correlation values obtained by comparing cameras of same manufacturer (without considering camera models) are still slightly higher than comparing cameras produced by different manufacturers.

| CONCLUSION
In order to make possible deep nets to learn discriminatory features that well generalise to the unseen data, we need to F I G U R E 1 1 (a) and (d) show two real-life images captured by Nikon Coolpix L23 and Canon Powershot A495, respectively; In (b) and (e) two images of a uniform scene captured by these cameras; and (c) and (f) show visualisation of sensor noise of these two cameras extracted from images shown in (b) and (e), respectively. Note that the apparent sensor noise of these two cameras differ have large and diverse datasets available. Such datasets need to be designed to capture wide and complex types of systematic noise and intrinsic artefacts of images in order to avoid overfitting of learning methods to just a narrow set of camera types or types of manipulations. These artefacts are brought into visual content by various components of the image acquisition process as well as the manipulating process (e.g. sensor noise, JPEG quantization noise, demosaicking and interpolationrelated artefacts, image enhancement etc.). In the proposed and performed work, we collected three large-scale and diverse datasets with a high variety of artefacts. We have demonstrated results of a few popular methods of image forensics.

TA B L E 4
Similarity of camera fingerprints obtained by using Equation 4. Shown are (a) results of comparison of fingerprints obtained by using images captured by the same camera (corr(Γ uniform , Γ natural )); (b) comparison of uniform fingerprints (Γ uniform ) of cameras of the same make and model; (c) comparison of uniform fingerprints of cameras of same make; and (d) and fingerprints of cameras of different make and model Moreover, we empirically demonstrated the existence of different types of artefacts in the dataset. We hope that the dataset will contribute to facilitating future research on training and testing methods for detecting of manipulated visual content as well as source camera identification (PRNU and sensor noise analysis). F I G U R E 1 3 Shown is a high-level view on results pointed out in Table 4. As it is apparent, the highest correlation values are obtained when comparing fingerprints of the same exact camera estimated by using two different sets of images, Γ uniform and Γ natural . Comparing fingerprints of different cameras of the same model still results in higher correlation values than comparing the same for cameras of different models (but same make). Analogically, we can see that correlation values obtained by comparing cameras of the same make but different models are still slightly higher than comparing cameras produced by different manufacturers