Lost in Space Marking

Research output: Working paper/PreprintPreprint

2 Downloads (Pure)

Abstract

We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.
Original languageEnglish
PublisherarXiv
StatePublished - 2 Aug 2022

Keywords

  • cs.CL

Fingerprint

Dive into the research topics of 'Lost in Space Marking'. Together they form a unique fingerprint.

Cite this