TY - GEN
T1 - Enhancing Data Annotation for Student Models
T2 - 34th ACM Web Conference, WWW Companion 2025
AU - Fuchs, Gilad
AU - Nus, Alex
AU - Eshel, Yotam
AU - Shapira, Bracha
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/5/23
Y1 - 2025/5/23
N2 - In this paper we introduce a methodology that utilizes Large Language Models (LLMs) for efficient data annotation which is essential for student models training. We specifically focus on classification tasks, assessing the impact of various annotation methods. We employ LLMs, such as GPT-4 and 7B parameter open-source models, as teacher models to facilitate the annotation process. We proceed to evaluate the efficiency of smaller, more deployment-efficient student models like BERT in production environments. We show two alternative methods. One using a cascade of LLMs distillation, starting from GPT-4 and following by 7B parameter LLM and BERT. And a second method, which shows the feasibility of annotating data without relying on proprietary model like GPT-4, by exclusively using a 7B parameter open-source model, leveraging self-training methods. This strategic use of smaller LLMs in conjunction with smaller student models presents an efficient and cost-effective solution for enhancing classification tasks, offering insights into the potential of utilizing large-scale language models in E-Commerce production settings.
AB - In this paper we introduce a methodology that utilizes Large Language Models (LLMs) for efficient data annotation which is essential for student models training. We specifically focus on classification tasks, assessing the impact of various annotation methods. We employ LLMs, such as GPT-4 and 7B parameter open-source models, as teacher models to facilitate the annotation process. We proceed to evaluate the efficiency of smaller, more deployment-efficient student models like BERT in production environments. We show two alternative methods. One using a cascade of LLMs distillation, starting from GPT-4 and following by 7B parameter LLM and BERT. And a second method, which shows the feasibility of annotating data without relying on proprietary model like GPT-4, by exclusively using a 7B parameter open-source model, leveraging self-training methods. This strategic use of smaller LLMs in conjunction with smaller student models presents an efficient and cost-effective solution for enhancing classification tasks, offering insights into the potential of utilizing large-scale language models in E-Commerce production settings.
KW - Data Annotation
KW - Large Language Models
KW - Model Distillation
UR - https://www.scopus.com/pages/publications/105009236249
U2 - 10.1145/3701716.3717862
DO - 10.1145/3701716.3717862
M3 - Conference contribution
AN - SCOPUS:105009236249
T3 - WWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
SP - 2706
EP - 2712
BT - WWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
PB - Association for Computing Machinery, Inc
Y2 - 28 April 2025 through 2 May 2025
ER -