TY - GEN
T1 - Incorporating Context into Subword Vocabularies
AU - Yehezkel, Shaked
AU - Pinter, Yuval
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SAGE, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SAGE does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SAGE improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.
AB - Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SAGE, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SAGE does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SAGE improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.
UR - http://www.scopus.com/inward/record.url?scp=85159853566&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.eacl-main.45
DO - 10.18653/v1/2023.eacl-main.45
M3 - Conference contribution
AN - SCOPUS:85159853566
T3 - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
SP - 623
EP - 635
BT - EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Y2 - 2 May 2023 through 6 May 2023
ER -