Datasets | Multisensor Project EU

Datasets

Home
/
Datasets

Datasets

This page contains the open-source datasets that have been created and used in the MULTISENSOR Project.

Feel free to download and use the datasets as you like. We’d love to get your feedback.

No.	Title	Description	Creator
01	WikiRef220	220 news articles, which are references to specific Wikipedia pages. The selected topics of the WikiRef220 dataset (and the number of articles per topic) are: Paris Attacks November 2015 (36), Barack Obama (5), Premier League (37), Cypriot Financial Crisis 2012-2013 (5), Rolling Stones (1), Debt Crisis in Greece (5), Samsung Galaxy S5 (35), Greek Elections June 2012 (5), smartphone (5), Malaysia Airlines Flight 370 (39), Stephen Hawking (1), Michelle Obama (38), Tohoku earthquake and tsunami (5), NBA draft (1), U2 (1), Wall Street (1). The topics Barack Obama, Cypriot Financial Crisis 2012-2013, Rolling Stones, Debt Crisis in Greece, Greek Elections June 2012, smartphone, Stephen Hawking, Tohoku earthquake and tsunami, NBA draft, U2 and Wall Street appear no more than 5 times and therefore, they are regarded as noise. The remaining 5 topics of WikiRef220 are: Paris Attacks November 2015 Premier League Malaysia Airlines Flight 370 Samsung Galaxy S5 Michelle Obama The WikiRef186 dataset (4 topics) is the WikiRef220 without 34 documents related to “Malaysia Airlines Flight 370” and the WikiRef150 dataset (3 topics) is the WikiRef186 without the 36 documents related to “Paris Attacks”. If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing	CERTH
02	WikiRef150	150 web news articles, which are references to specific Wikipedia pages, so as to ensure reliable ground-truth. The selected topics and the corresponding number of articles per topic are: Barack Obama(5), Premier League(37), Cypriot Financial Crisis 2013(5), Rolling Stones(1), Debt Crisis in Greece(5), Samsung Galaxy S5(35), Greek Elections June 2012(5), smartphone(5), Malaysia Airlines Flight 370(5), Stephen Hawking(1), Michelle Obama(38), Tohoku earthquake and tsunami(5), NBA draft(1), U2(1), Wall Street(1) If you use this dataset, please cite: Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing	CERTH
03	ArticlesNewsSitesData_1043	1043 web pages/articles retrieved from three well known news sites (i.e. BBC, The Guardian and Reuter) and their annotation with the following four topics found in the IPTC news codes taxonomy: Economy-Business-Finance, Lifestyle-Leisure, Science-Technology and Sports. It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article: D. Liparas, Y. Hacohen-Kerner, A. Moumtzidou, S. Vrochidis and I. Kompatsiaris, “News articles classification using Random Forests and weighted multimodal features”, 3rd Open Interdisciplinary MUMIA Conference and 7th Information Retrieval Facility Conference (IRFC2014), Copenhagen, Denmark, November 10-12, 2014.	CERTH
04	ArticlesNewsSitesData_2382	2382 web pages/articles retrieved from several sites. The web pages were annotated with the following six topics found in the IPTC news codes taxonomy: Nature_Environment, Politics, Science_Technology, Economy_Business_Finance, Health and Lifestyle_leisure. It should be noted that the articles are classified to a single topic.	CERTH
05	NewsArticlesData_12073	12073 news articles retrieved from several sites. The news articles were annotated with the following six topics found in the IPTC news codes taxonomy: Nature_Environment, Politics, Science_Technology, Economy_Business_Finance, Health and Lifestyle_Leisure. It should be noted that the articles are classified to a single topic.	CERTH
06	YahooNewsQualityDataset	The News Quality Dataset provides over 500 news articles annotated with 14 editorial quality aspects.	EURECAT
07	Event_Detection_Dataset_MS	This dataset is the example set for the Multimedia concept and event detection available on the code-page. The dataset contains 106 videos from news reports. Videos are categorised into nine concepts/events. Keyframes for the concept and event detection are extracted. The total number of key frames in this dataset is 2826. DCNN features are extracted from the key frames based on the Caffe models trained in the work of (Markatopoulou et al., 2016). Using a random balanced split on the dataset for each concept/event, where the videos are divided into three chunks, a three-fold CV is performed using two chunks for training purposes and the remaining chunk for testing. The classification algorithm used in this code is SVM, where the “c” parameter is tuned using grid search. Output of this module is the evaluation per concept/event on videos in terms of accuracy and F-score.	CERTH

Legal & Funding Information

For further information regarding responsability for the content of this website please click here

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 610411.
Find us on...
Read what we tweet:

Tweets by @multisensor
Archive
March 2026

M T W T F S S

1

2 3 4 5 6 7 8

9 10 11 12 13 14 15

16 17 18 19 20 21 22

23 24 25 26 27 28 29

30 31

« Nov