Datasets
This page contains the open-source datasets that have been created and used in the MULTISENSOR Project.
Feel free to download and use the datasets as you like. We’d love to get your feedback.
No. | Title | Description | Creator |
---|---|---|---|
01 | WikiRef220 | 220 news articles, which are references to specific Wikipedia pages. The selected topics of the WikiRef220 dataset (and the number of articles per topic) are:
Paris Attacks November 2015 (36), Barack Obama (5), Premier League (37), Cypriot Financial Crisis 2012-2013 (5), Rolling Stones (1), Debt Crisis in Greece (5), Samsung Galaxy S5 (35), Greek Elections June 2012 (5), smartphone (5), Malaysia Airlines Flight 370 (39), Stephen Hawking (1), Michelle Obama (38), Tohoku earthquake and tsunami (5), NBA draft (1), U2 (1), Wall Street (1). The topics Barack Obama, Cypriot Financial Crisis 2012-2013, Rolling Stones, Debt Crisis in Greece, Greek Elections June 2012, smartphone, Stephen Hawking, Tohoku earthquake and tsunami, NBA draft, U2 and Wall Street appear no more than 5 times and therefore, they are regarded as noise. The remaining 5 topics of WikiRef220 are:
The WikiRef186 dataset (4 topics) is the WikiRef220 without 34 documents related to “Malaysia Airlines Flight 370” and the WikiRef150 dataset (3 topics) is the WikiRef186 without the 36 documents related to “Paris Attacks”. If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing |
CERTH |
02 | WikiRef150 | 150 web news articles, which are references to specific Wikipedia pages, so as to ensure reliable ground-truth. The selected topics and the corresponding number of articles per topic are:
If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing |
CERTH |
03 | ArticlesNewsSitesData_1043 | 1043 web pages/articles retrieved from three well known news sites (i.e. BBC, The Guardian and Reuter) and their annotation with the following four topics found in the IPTC news codes taxonomy:
It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article: D. Liparas, Y. Hacohen-Kerner, A. Moumtzidou, S. Vrochidis and I. Kompatsiaris, “News articles classification using Random Forests and weighted multimodal features”, 3rd Open Interdisciplinary MUMIA Conference and 7th Information Retrieval Facility Conference (IRFC2014), Copenhagen, Denmark, November 10-12, 2014. |
CERTH |
04 | ArticlesNewsSitesData_2382 | 2382 web pages/articles retrieved from several sites. The web pages were annotated with the following six topics found in the IPTC news codes taxonomy:
It should be noted that the articles are classified to a single topic. |
CERTH |
05 | NewsArticlesData_12073 | 12073 news articles retrieved from several sites. The news articles were annotated with the following six topics found in the IPTC news codes taxonomy:
It should be noted that the articles are classified to a single topic. |
CERTH |
06 | YahooNewsQualityDataset | The News Quality Dataset provides over 500 news articles annotated with 14 editorial quality aspects. | EURECAT |
07 | Event_Detection_Dataset_MS | This dataset is the example set for the Multimedia concept and event detection available on the code-page.
The dataset contains 106 videos from news reports. Videos are categorised into nine concepts/events. Keyframes for the concept and event detection are extracted. The total number of key frames in this dataset is 2826. DCNN features are extracted from the key frames based on the Caffe models trained in the work of (Markatopoulou et al., 2016). Using a random balanced split on the dataset for each concept/event, where the videos are divided into three chunks, a three-fold CV is performed using two chunks for training purposes and the remaining chunk for testing. The classification algorithm used in this code is SVM, where the “c” parameter is tuned using grid search. Output of this module is the evaluation per concept/event on videos in terms of accuracy and F-score. |
CERTH |