Will it unblend?

Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as innoventor, are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT’s processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publicationEMNLP 2020
PublisherAssociation for Computational Linguistics (ACL)
Pages1525-1535
Number of pages11
ISBN (Electronic)9781952148903
StatePublished - 1 Jan 2020
Externally publishedYes
EventFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online
Duration: 16 Nov 202020 Nov 2020

Publication series

NameFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
CityVirtual, Online
Period16/11/2020/11/20

Fingerprint

Dive into the research topics of 'Will it unblend?'. Together they form a unique fingerprint.

Cite this