10. February 2016
Who am I? And if so, why?
Named Entity Recognition in MULTISENSOR
What is it?
‘Named Entity’ is a funny name, invented by computational scientists, basically referring to proper names. Proper names differ from other words, or concepts, of a language in that they have a unique referent in the real world: There is just one ‘Eiffel Tower’, just one ‘Barrack Obama’, just one ‘third symphony of Beethoven’ etc. (Details are left to philosophers). And while normal words of a language are collected in dictionaries (like Collins or Oxford), proper names are collected, and described, in encyclopaedias (like Encyclopaedia Britannica or Wikipedia).
The property to provide a point of reference to the real world in a discourse makes Named Entities interesting for the computational treatment of texts. The challenge, however, is that there are so many of them.
In a multilingual context, it can be observed that many names are invariant in different languages (except for different scripts of course): ‘Los Angeles’ is found in English, Spanish, French or German texts; others however do change (e.g. ‘Nürnberg’ in German; ‘Nuremberg’ I English; ‘Norimberga’in Italian; ‘Núremberg’ in Spanish)
What types of named-entities are there?
Named Entities can be grouped into classes of semantic types. Some texts are interested in chemical substances, others look for genomes. Some look for bank transfer details, others for airports, others for temporal expressions; all this depends on a profile of interest under which a text is looked at.
A frequently used set of Named Entity types, also used in MULTISENSOR, is designed for the analysis of news articles. It covers persons (person names), locations (place names), organisations (be it private or public: ‘EU Commission’, ‘British Telecom’, ‘Camorra’), dates, prices (like ‘250M€’), and the like.
The task of a Named Entity recogniser is not just to find out that something is a name but also to assign a semantic type to it.
How to identify Named Entities?
For decades, computational linguists have tried to identify Named Entities in texts, as they interfere with the standard text analysis. The standard procedure for text analysis is to split a given text into sentences and tokens (single words), then to assign information items to these tokens (like: part of speech, semantic type), mostly by dictionary lookup (called ‘gazetteers’ by these computational scientists), and then to try to build more complex structures from these elements.
- Lexicon lookup: Special proper name lexicons contain e.g.: the most important person names, the names of the countries and biggest cities of the world, the most important companies, government institutions, football teams etc.; often supplemented by Named Entities of the special domain which the recogniser intends to cover (like in MULTISENSOR: all off-shore windparks of Europe). Such lexicons are several thousand entries strong. However, their intrinsic weakness is that they can never be complete.
- Contextual analysis: Like humans do, machines can make use of certain indicators, found in the context of an unknown string (i.e. a NE candidate): In the context of ‘Peter Skiggs’ or ‘Dr. Skiggs’, the string ‘Skiggs’ denotes a person, while it denotes a company in ‘Peter Skiggs Ltd.’, and a place in the context of ‘He lives in Skiggs’. Indicators like ‘Institute of ___’ tend to denote institutions, ‘west of ___’ often denote place names.
- Such contextual indicators can be collected by corpus analysis, and represented in a recognition grammar, containing rules like ‘if there is an unknown string after an academic title then this unknown is likely to be a person name’ (e.g. ‘Dr. Skiggs’). The list of possible academic title, like all indicators, is then stored in the proper name lexicon.
- Recent technologies apply machine learning techniques, whereby the machines are supposed to find ‘rules’ like the ones just mentioned by themselves, analysing (partly) annotated corpus data. They report comparable recognition results; the effort is moved from writing the grammars to annotating the corpora. Once such corpora are available the effort for learning procedures is significantly lower for learning techniques. The MULTISENSOR Named Entity Recogniser is a hybrid of those two approaches. Moreover, it is designed in such a way that it can be extended into a part of a more sophisticated text analysis, defining relations between such entities and other text elements.
- A special phenomenon to be considered in NE recognition is coreference: Different wordings refer to the same object. There are several means of a language to refer to an object which was previously introduced in the discourse, like: ‘We met Peter Skiggs …. Skiggs had been previously known … He was the father of Donald Skaggs, the later US president …’. We know that the second ‘Skiggs’ must be a person, following the first occurrence. We also know that the ‘father_of’ relation to ‘Donald Skaggs’ refers to ‘Skiggs’, because of the pronoun. It is important for building a good representation of the discourse structure to understand that in all three sentences, the same unique reference is addressed, namely ‘Peter Skiggs’.
Coreference analysis, including pronoun resolution, is one of the most difficult analysis topics; many systems restrict their scope to the recognition of coreference based on common name parts, like ‘Ford Motor Company Inc.’ vs. ‘Ford Motors’ vs. ‘Ford’ to just ‘Ford’, or ‘International Energy Agency’ to ‘IEA’.
It should be mentioned that all these recognition strategies are just heuristics, and exploit the chance to fail in some contexts: ‘He lived in Napoleonic times’ meets the location-indicating pattern ‘[live in ____]’ but ‘Napoleonic’ is not a location.
Recognition quality
The Named Entity recognisers fight against two types of errors: not identifying candidates which in fact are named entities, and assigning a Named Entity status to candidates which in fact are not (like the ‘Napoleonic’ above). The first factor is measured by recall (How many candidates of them all have been found?), the second one by precision (How many of the found ones are correct)?
The challenge is to increase both scores. Often, relaxation of recognition heuristics increases recall (as more candidates are found) but lowers precision (as many false hits are produced). This would not count as a step ahead.
While in research, recognition rates of 0.9 and higher for recall and precision are supported, in industrial setups, rates between 0.75 and 0.85 are more common. Such results are difficult to generalise and depend on many factors: language, in-domain vs. out-of-domain texts, text types, training data size, test corpus selection, etc.
Applications
There are a number of applications for Named Entity recognisers. Some of them are the following:
Information Extraction and Opinion mining: Press clippings whereby all media articles on specific persons, companies, produtcs etc. are collected, rely strongly on Named Entity recognisers. Examples like ‘Electrolux in China’, ‘development of offshore windparks’, ‘dairy products in Thailand’ are examples of topics in MULTISENSOR which need support by Named Entity recognisers. Opinion Mining would inspect the relevant contexts to find out which opinion a given text expresses with regards to a specific entity: ‘Electrolux fails to save jobs in Orange’.
Text Understanding and Context Linking: Based on such results, information fusion would take the results of such analysis steps, merge them into a coherent picture, and generate abstracts and reports from there. This is a direction towards which MULTISENSOR wants to make a contribution.
Summary
So in future texts, all candidates for Named Entities will have to ask themselves:
Who am I, and if so, why?
Photo Credit:
- Cover image “Who am I” from iStockPhoto, licence for publication by Linguatec GmbH
- All other images are taken from Pixabay.com and have been published under CC0 Public Domain