Disentangling homonyms- using artificial neural networks to separate the cream from the crop in large text corpora

Uri Roll, Ricardo Correia, Oded Berger-Tal

Research output: Contribution to conferencePaperpeer-review


Recent years have seen a great influx in scientific publications as well other sources of text corpora that are used for conservation research. This surge holds much promise in promoting great advancements in science, but also presents new challenges. One of the great issues of utilizing this plethora of information is how to efficiently sort through it and retain only its relevant sections. Homonyms - terms that share spelling but differ in meaning - present a unique challenge within this respect as they do not contain inherent information that can aid in their classification across narratives. This issue is of relevance for an array of different conservation culturomics studies, as homonyms add a lot of noise to results which cannot be easily identified. In this work we constructed a semi-automated approach that can aid in the classification of homonyms between narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explore the use of the word 'reintroduction' in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat, however an 'ISI' search using this word returns thousands of publications that use this term with other meanings and contexts. Using our method, we were able to quickly and correctly classify thousands of academic texts with more than 99% accuracy between conservation related and unrelated publications. Our approach can be easily used with any other homonym terms and can greatly facilitate sorting data in cases where homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in the combination of automated content analyses and machine learning methods in handling and screening big data for relevant information.
Original languageEnglish
StatePublished - 12 Jun 2018
Event5th European Congress of Conservation Biology - University of Jyväskylä, Finland
Duration: 12 Jun 201815 Jun 2018


Conference5th European Congress of Conservation Biology


Dive into the research topics of 'Disentangling homonyms- using artificial neural networks to separate the cream from the crop in large text corpora'. Together they form a unique fingerprint.

Cite this