The Bag-of-words model has been the dominant approach for IR and Text mining for many years assuming the word independence and the frequencies as the main feature for feature selection and for query to document similarity. Although the long and successful usage, bag-of-words ignores words’ order and distance within the document – weakening thus the expressive power of the distance metrics. We propose graph-of-word, an alternative approach that capitalizes on a graph representation of documents and challenges the word independence assumption by taking into account words’ order and distance. We applied graph-of-word in various tasks such as ad-hoc Information Retrieval, Single-Document Keyword Extraction, Text Categorization and Sub-event Detection in Textual Streams. In all cases the the graph of word approach, assisted by degeneracy at times, outperforms the state of the art base lines in all cases.
Lecturer: PROFESSOR MICHALIS VAZIRGIANNIS