MSc Thesis Defense-Department of Computer Science-Svanhvít Lilja Ingólfsdóttir
Named Entity Recognition for Icelandic: Annotated Corpus and Neural Models
Monday the 15th of June 2020, Svanhvít Lilja Ingólfsdóttir will defend her 60 ECTS thesis in Language Technology.
Candidate: Svanhvít Lilja Ingólfsdóttir
Supervisor: Dr. Hrafn Loftsson, Associate Professor, Department of Computer Science
Title: Named Entity Recognition for Icelandic: Annotated Corpus and Neural Models
Date and Time: June 15th at 10:00 in Room M104
Abstract: Named entity recognition (NER) is the task of automatically extracting and classifying the names of people, places, companies, etc. from text. NER is an important preprocessing step in various different language technology tasks, such as question answering, speech recognition, search engine optimization and data anonymization, but can prove difficult, especially in highly-inflected languages like Icelandic. Named entity recognizers are usually trained on text corpora in which the named entities have been annotated, but no such corpus has been available for Icelandic.
In this thesis, we present the first annotated NER corpus for Icelandic, along with neural models trained on the data. The corpus, containing over 48,000 named entities in one million tokens, was annotated with eight named entity types using a semi-automatic approach, and then manually reviewed. A bidirectional LSTM recurrent neural network was trained on the annotated corpus, using pre-trained word embeddings as external input. We report an F1 score of 83.65% for all eight entity types when trained on the whole corpus.