The dataset is internally separated in one additional level: for each sub-case, we have isolated images where an obvious second splicing or content alteration has taken place (e.g. by adding a watermark), or where a large amount of cropping has occurred. The former case is misleading as most algorithms will detect the second modification -often placed on top of the first, thus accidentally giving a “correctly” localized output. The latter case means that the ground truth mask will not fit properly over the image, and is thus useless. By setting these images aside, the main corpus of the dataset contains 10,870 images.
Ground truth masks
We have manually created ground truth masks for each sub-case. As, in many sub-cases, the forgery concerned multiple areas of the image, we were unsure which one took place first and which area's trace might have remained detectable at the end of the forging process. Thus, for these sub-cases we created multiple different ground-truth masks to reflect this uncertainty. We believe that, for each such sub-case, if an algorithm produces an output that matches any of the masks, this should be considered a successful detection (and should suggest that the corresponding mask is the one that better reflects reality)
A second consideration when using the ground-truth masks is the fact that the images we have downloaded do not all have the same dimensions for each sub-case. Many resized versions exist, and it is nearly impossible to identify which one is the oldest. However, since we have removed the cropped versions, the binary mask can in each case be resized to match the image size -even in cases where the image has been rescaled nonuniformly, this resizing operation will still produce a valid ground-truth mask for the image.
The dataset root contains two folders: WildWeb and UnsplicedSources. The former contains 90 subfolders, each containing one subcase. The naming convention is, in all cases, the name of the case, followed by a number if multiple subcases exists. Thus, KerryLaVey is the only subcase of KerryLaVey, while ObamaSmoking1 is the first subcase of ObamaSmoking. Within each such folder are the images, plus two subdirectories. The first subdirectory, called Mask contains all the mask files for the subcase, in the form of PNG images, with white (255) corresponding to the tampered region and black (0) to the rest of the image pixels. The second subdirectory, called Crops – PostSplices, contains all cropped and re-spliced versions of the subcase.
Finally, the UnsplicedSources folder contains 71 subfolders, each named after a case, containing the source images from which the splice was performed and some text explaining the history of the forgery and the articles confirming the forgery. The reason there are fewer sources (71) than cases (80) is that we have certain cases originating from the same sources, but conducted at different times, and with different localizations, which we have registered as entirely separate cases.
Acquiring the Wild Web dataset
Due to copyright considerations, we cannot make the Wild Web dataset publicly available. However, if you would like to use it for research purposes, we will be happy to share the entire dataset. Email firstname.lastname@example.org
and we will provide you with a download link (~910MB).
Citing the dataset
If you use the dataset, please cite the following publication: