To help structure the lectures, I will take the canonical architecture of a modern search engine as my starting point and organize the important concepts, results and key research issues around this architecture. A fundamental distinction that we will encounter is between offline and online stages within this architecture. Stages that receive their input sequentially are said to operate in an online modality. The key difference with the offline (or “batch”) stages is that in online stages we update our knowledge after the arrival of every new datapoint, whereas offline techniques are used when we have access to all training examples at once. Online approaches could be used in the case of a process occurring in time, for example an evolving search session of an individual user, in which case a ranker might update as time goes on and we get more and more samples of the user’s queries and interactions.

The first lecture will focus on the offline stages of the canonical information retrieval architecture. This includes crawling, document enrichment, aggregation of external sources related to a given document (anchor texts, click features, …).

In the second lecture, I will focus on online stages of the canonical information retrieval architecture. This includes query auto-completion, query understanding, ranking and retrieval, and result page generation. Because the search engine does not know the whole input, during the online stages we are forced to make decisions that may later turn out not to be optimal; a solid experimental framework is essential for informing and controlling this decision-making process.

Lecturer: Professor Maarten de Rijke