MSc Thesis Defense-Department of Computer Science-Ásmundur Alma Guðjónsson

Named Entity Recognition for Icelandic: Comparing and combining different Machine Learning methods

  • 15.1.2021, 13:00 - 14:00

Friday, January 15th 2021, Ásmundur Alma Guðjónsson will defend his 60 ECTS Thesis in Language Technology

Candidate: Ásmundur Alma Guðjónsson
Title: Named Entity Recognition for Icelandic: Comparing and combining different Machine Learning methods
Supervisor: Dr. Hrafn Loftsson, Associate Professor, Reykjavík University, Iceland
Date and Time: January 15th 2021 at 13:00 on Zoom: https://eu01web.zoom.us/j/64296349550

Abstract: Named Entity Recognition (NER) is the task of identifying person names, places, organizations, and other Named Entities in text. This can also include some numerical entities like dates, amounts of money and percentages. NER is often an important step in other Natural Language Processing tasks, like in question answering or machine translation. NER is a subtask of Information Extraction.

A neural model for NER has already been implemented for Icelandic (NeuroNER), but this is as far as we know, the only previous machine learning model for the task in the Icelandic language. The goal of this project was to develop other machine learning methods that could then be compared with the neural model. The purpose of this was to provide a better knowledge on the status of NER in the Icelandic language, for helping the task move forward in the future.

The first model that was picked was a semi-supervised model that combined both shallow language features with unsupervised word clusters (ixa-pipes). The second model was a Conditional Random Field model that used word features, but also made use of gazetteers (CRF). These models, in addition to the neural model, were then combined in a single NER system, where a vote between the three decided the output (CombiTagger). We trained these methods on training sets of varying sizes, but the evaluation was done on a fixed and identical set throughout all the experiments.

These methods were then tested on a dataset we created with texts provided by Nasdaq Iceland. These texts mostly included news announcements and corporate reports, and would be a good way to test how the models would perform in a real world scenario. Moreover, how well the models would generalize what they have learned by measuring their performance on data that is of considerable difference from the training data.

Our evaluation shows that it is possible to come very close to the performance of a neural model like NeuroNER with non-neural models like the CRF and the ixa-pipes models, when tested on a dataset from the same corpus as the training data. However, when tested on the Nasdaq data, the non-neural models seemed to fall behind, the neural model seems to generalize better. We showed that with using systems like CombiTagger, models can be combined together with a simple voting system, that would perform better than the individual models combined in it as CombiTagger obtained the F1-score of 86.18 on our test-set, which at this time would be the best published result of any NER system in Icelandic. 



Vinsamlegast athugið að á viðburðum Háskólans í Reykjavík (HR) eru teknar ljósmyndir og myndbönd sem notuð eru í markaðsstarfi HR. Hægt er að nálgast frekari upplýsingar á ru.is eða með því að senda tölvupóst á netfangið: personuvernd@ru.is
//
Please note that at events hosted at Reykjavik University (RU), photographs and videos are taken which might be used for RU marketing purposes. Read more about this on our ru.is or send an e-mail: personuvernd@ru.is