Fake Video Corpus

This is the first, to our knowledge, annotated dataset of debunked and verified user-generated videos (UGVs), along with multiple near-duplicate reposted versions of them.

Download

Initial dataset

The dataset comprises videos from a variety of event categories, such as politics, sports, natural disasters, accidents, wars, etc. Currently, it consists of 200 unique debunked videos (for simplicity also referred to as fake) and 180 unique verified videos (also referred to as real). In particular, different types of fake video are included:

Staged videos where actors perform scripted actions under direction.
Videos where contextual information is false (e.g. the claimed video location is wrong).
Past videos presented as UGV from breaking events.
Videos of which the visual or audio content has been altered through editing.
Computer-generated Imagery (CGI) posing as real.

Indicative cases of real and fake User-Generated Videos. Top: four real videos. a) A Greek army helicopter crashing into the sea in front of beach; b) US Airways Flight 1549 ditched in the Hudson River; c) A group of musicians playing in an Istanbul park while bombs explode outside the stadium behind them; d) A giant alligator crossing a Florida golf course. Bottom: four fake videos. a) “A man taking a selfie with a tornado” -CGI; b) “The artist Banksy caught in action” -staged; c) “Muslims destroying a Christmas tree in Italy” -out of context, there is no indication that the men are Muslim; d) “Bomb attack on Brussels airport” -out of context, the footage is from Moscow Domodedovo airport.

The dataset was initially formed after investigating important and viral events. Fact-checking sites such as snopes.com and others were consulted both to help with the annotation of the videos and to discover more debunked videos. Another source of content was the Context Aggregation and Analysis service Context Aggregation and Analysis service developed within the InVID project as a tool for video verification. The service, being one of the few publicly available tools for video metadata analysis, generally attracts traffic from verification experts who submit suspicious videos for verification, often as part of using the InVID verification plug-in. All videos submitted to the service between November 2017 and January 2018 resulted in an initial pool of approximately 1600 videos. This set was filtered to remove non-UGV and other irrelevant content, and consecutively, was annotated as real or fake. The dataset contains videos from three major video sharing platforms: YouTube, Facebook, and Twitter.

Extended dataset

The extended dataset was formed based on a largely automatic systematic process that combines text search and near-duplicate video retrieval, followed by manual annotation using a set of guidelines. More specifically:

For each video in the original set, the video title was used as input.
The title was reformulated to a more general form (called the “event title”). For example, a video with title “Video Tornado IRMA en Florida EEUU Video impactante” was assigned to event “Tornado IRMA at Florida”.
The event title was translated from English into four major languages: Russian, Arabic, French, and German using Google Translate. These languages were selected after preliminary tests indicated that near-duplicate videos appear with increased frequency in these languages.
The video title, event title, and the four translations were used as separate queries to the three target platforms: YouTube, Facebook, Twitter. All returned videos were aggregated in a common pool.
A near-duplicate retrieval algorithm was used to search within this pool for near-duplicates of the video.
After manual inspection, erroneous results were removed and only actual near-duplicates were retained.

Total dataset

The overall dataset consist of 3957 videos annotated as fake and 2458 annotated as real.

Categories for near-duplicates of fake videos include:

Fake: those that reproduce the same false claims
Uncertain: those that express doubts on the veracity of the claim
Debunk: those that attempt to debunk the original claim
Parody: those that use the content for fun/entertainment
Real: those that contain the earlier, original source from which the fake was made.

Categories for near-duplicates of real videos include:

Real: those that reproduce the same factual claims
Uncertain: those that express doubts on the veracity of the claim
Debunk: those that attempt to debunk their claims as false
Parody: those that use the content for fun/entertainment.

Facebook videos that were relevant to the dataset but were published by individual users (and thus could not be accessed through the API) were excluded from this dataset.

How to access the dataset

The video URLs and the associated annotations (fake/real) and near-duplicate video URLs are contained in csv files.

To obtain metadata about videos, you may use the respective platform API.

If you encounter any issues in this process, please get in touch with Olga Papadopoulou.

License and acknowledgement

The video dataset is provided under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

The video dataset is supported by the InVID project, which is funded by the European Commission under contract number 687786.

If you use this video dataset for your research, please include a citation to the following paper: Papadopoulou, O., Zampoglou, M., Papadopoulos, S., & Kompatsiaris, Y. (2018). A Corpus of Debunked and Verified User-Generated Videos. Online Information Review. Accepted for publication.