TY - UNPB
T1 - Lost in Space Marking
AU - Jacobs, Cassandra L.
AU - Pinter, Yuval
N1 - Submission to SIGMORPHON 2021
PY - 2022/8/2
Y1 - 2022/8/2
N2 - We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.
AB - We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.
KW - cs.CL
M3 - Preprint
BT - Lost in Space Marking
PB - arXiv
ER -