In this tutorial we elaborate on recent advances in semantic processing of language and more specifically on entity and event recognition. We go deeper into the joint processing of language and visual data, where language data forms weak annotations to recognize information in images and video (e.g., recognitions of persons and events). We also consider examples where visual data provide context for disambiguating language. The methods regard unsupervised alignment techniques as well as weakly supervised graphical models, structured support vector machines and neural networks. A substantial focus is on representation learning and the use of topic models and neural embeddings. We will study how the recognized semantics will be integrated in retrieval models (such as vector and language models for retrieval). We discuss tasks, algorithms and evaluation.

The tutorial is composed of 5 parts:
1. Introduction and problem setting
2. Semantic processing of language
3. Semantic processing of multimedia
4. Integration of semantic recognitions in retrieval models
5. Ideas for future research.

Lecturer: Professor Marie-Francine Moens