Web News Article Dataset
There three datasets with annotated 'News Articles' web pages.
The first dataset (ArticlesNewsSitesData.7z) contains the URLs and the annotation for 1043 web pages/ articles retrieved from three well known sites with news (i.e. BBC, The Guardian, and Reuter). The web pages were annotated with the following four topics found in the IPTC news codes taxonomy: Economy-Business-Finance, Lifestyle-Leisure, Science-Technology, and Sports. It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article:
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using Random Forests and weighted multimodal features. In Information Retrieval Facility Conference (pp. 63-75). Springer International Publishing.
The second dataset (ArticlesNewsSitesData_2382.7z) contains the URLs and the annotation for 2382 web pages/ articles retrieved from several sites. The web pages were annotated with the following six topics found in the IPTC news codes taxonomy: Nature_Environment, Politics, Science_Technology, Economy-Business_Finance, Health and Lifestyle_leisure. It should be noted that the articles are classified to a single topic. If you use this dataset in your research, please cite the following article:
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014). News articles classification using Random Forests and weighted multimodal features. In Information Retrieval Facility Conference (pp. 63-75). Springer International Publishing.
The third dataset (WikiRef_dataset.zip) contains more specific topics, which are references to specific Wikipedia webpages. The selected topics of the WikiRef220 dataset (and the number of articles per topic) are:
Paris Attacks November 2015 (36), Barack Obama (5), Premier League (37), Cypriot Financial Crisis 2012-2013 (5), Rolling Stones (1), Debt Crisis in Greece (5), Samsung Galaxy S5 (35), Greek Elections June 2012 (5), smartphone (5), Malaysia Airlines Flight 370 (39), Stephen Hawking (1), Michelle Obama (38), Tohoku earthquake and tsunami (5), NBA draft (1), U2 (1), Wall Street (1). The topics Barack Obama, Cypriot Financial Crisis 2012-2013, Rolling Stones, Debt Crisis in Greece, Greek Elections June 2012, smartphone, Stephen Hawking, Tohoku earthquake and tsunami, NBA draft, U2 and Wall Street appear no more than 5 times and therefore, they are regarded as noise.
The remaining 5 topics of WikiRef220 are:
• Paris Attacks November 2015 [1]
• Premier League [2]
• Malaysia Airlines Flight 370 [3]
• Samsung Galaxy S5 [4]
• Michelle Obama [5]
The WikiRef186 dataset (4 topics) is the WikiRef220 without 34 documents related to “Malaysia Airlines Flight 370” and the WikiRef150 dataset (3 topics) is the WikiRef186 without the 36 documents related to “Paris Attacks”.
References
1. https://en.wikipedia.org/wiki/November_2015_Paris_attacks
2. https://en.wikipedia.org/wiki/Premier_League
3. https://en.wikipedia.org/wiki/Malaysia_Airlines_Flight_370
4. https://en.wikipedia.org/wiki/Samsung_Galaxy_S5
5. https://en.wikipedia.org/wiki/Michelle_Obama
If you use the third dataset, please cite:
Gialampoukidis, I., Vrochidis, S., & Kompatsiaris, I. (2016). A Hybrid Framework for News Clustering Based on the DBSCAN-Martingale and LDA. In Machine Learning and Data Mining in Pattern Recognition (pp. 170-184). Springer International Publishing.