This is the first, to our knowledge, annotated dataset of debunked and verified user-generated videos (UGVs), along with multiple near-duplicate reposted versions of them.

Initial dataset

The dataset comprises videos from a variety of event categories, such as politics, sports, natural disasters, accidents, wars, etc. Currently, it consists of 200 unique debunked videos (for simplicity also referred to as fake) and 180 unique verified videos (also referred to as real). In particular, different types of fake video are included:

Indicative cases of real and fake User-Generated Videos. Top: four real videos. a) A Greek army helicopter crashing into the sea in front of beach; b) US Airways Flight 1549 ditched in the Hudson River; c) A group of musicians playing in an Istanbul park while bombs explode outside the stadium behind them; d) A giant alligator crossing a Florida golf course. Bottom: four fake videos. a) “A man taking a selfie with a tornado” -CGI; b) “The artist Banksy caught in action” -staged; c) “Muslims destroying a Christmas tree in Italy” -out of context, there is no indication that the men are Muslim; d) “Bomb attack on Brussels airport” -out of context, the footage is from Moscow Domodedovo airport.

The dataset was initially formed after investigating important and viral events. Fact-checking sites such as snopes.com and others were consulted both to help with the annotation of the videos and to discover more debunked videos. Another source of content was the Context Aggregation and Analysis service Context Aggregation and Analysis service developed within the InVID project as a tool for video verification. The service, being one of the few publicly available tools for video metadata analysis, generally attracts traffic from verification experts who submit suspicious videos for verification, often as part of using the InVID verification plug-in. All videos submitted to the service between November 2017 and January 2018 resulted in an initial pool of approximately 1600 videos. This set was filtered to remove non-UGV and other irrelevant content, and consecutively, was annotated as real or fake. The dataset contains videos from three major video sharing platforms: YouTube, Facebook, and Twitter.

Extended dataset

The extended dataset was formed based on a largely automatic systematic process that combines text search and near-duplicate video retrieval, followed by manual annotation using a set of guidelines. More specifically:

Total dataset

The overall dataset consist of 3957 videos annotated as fake and 2458 annotated as real.

Facebook videos that were relevant to the dataset but were published by individual users (and thus could not be accessed through the API) were excluded from this dataset.

How to access the dataset

The video URLs and the associated annotations (fake/real) and near-duplicate video URLs are contained in csv files.

To obtain metadata about videos, you may use the respective platform API.

If you encounter any issues in this process, please get in touch with Olga Papadopoulou.

License and acknowledgement

The video dataset is provided under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

The video dataset is supported by the InVID project, which is funded by the European Commission under contract number 687786.

If you use this video dataset for your research, please include a citation to the following paper: Papadopoulou, O., Zampoglou, M., Papadopoulos, S., & Kompatsiaris, Y. (2018). A Corpus of Debunked and Verified User-Generated Videos. Online Information Review. Accepted for publication.