Lecturer: Dr. Fabrizio Sebastiani

Text Classification (TC) is a basic enabling technology in nowadays’ IR, since many text-related prediction tasks can be framed in terms of classification. As a result, scores of applications (ranging from webpage/website classification under folksonomies to author identification for texts of uncertain paternity) have a TC engine under the hood. Modern text classification methods rely on supervised machine learning; according to this paradigm, a general-purpose learning algorithms learns the characteristics a text should have in order to be classified under class X, by analysing a set of texts which were previously classified as belonging or not belonging to X by a human. This tutorial will discuss the main steps towards the construction of a text classifier, from the generation of vectorial representations of the texts, to training a classifier from examples, to evaluating its accuracy on benchmark datasets.

Until 15 years ago, text classification was almost a synonym of “classification by topic”, i.e., classifying textual documents according to what they are about. More recently, the classification of texts according to dimensions other than topic (e.g., by language, as in language identification; by author, as in authorship attribution) has also been investigated. The most important among these dimensions is certainly sentiment, as when classifying a product review according to whether it expresses a positive or a negative opinion towards the topic. Sentiment classification is an instance of a more general task called “opinion mining”, which encompasses all tasks having to do with the analysis of text according to the sentiments and opinions expressed therein. The key difference between classification by topic and classification by sentiment lies in the way vectorial representations of the texts. This tutorial will explore these key differences by discussing the text representation techniques adopted in state-of-the-art sentiment classification systems, with particular emphasis on systems that tackle text arising within social media.