TY - GEN
T1 - Greed is All You Need
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Uzan, Omri
AU - Schmidt, Craig W.
AU - Tanner, Chris
AU - Pinter, Yuval
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
AB - While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
UR - http://www.scopus.com/inward/record.url?scp=85203816588&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85203816588
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 813
EP - 822
BT - Short Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -