Enhancing Data Annotation for Student Models: A Self-Training Approach with Large Language Models

Gilad Fuchs, Alex Nus, Yotam Eshel, Bracha Shapira

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper we introduce a methodology that utilizes Large Language Models (LLMs) for efficient data annotation which is essential for student models training. We specifically focus on classification tasks, assessing the impact of various annotation methods. We employ LLMs, such as GPT-4 and 7B parameter open-source models, as teacher models to facilitate the annotation process. We proceed to evaluate the efficiency of smaller, more deployment-efficient student models like BERT in production environments. We show two alternative methods. One using a cascade of LLMs distillation, starting from GPT-4 and following by 7B parameter LLM and BERT. And a second method, which shows the feasibility of annotating data without relying on proprietary model like GPT-4, by exclusively using a 7B parameter open-source model, leveraging self-training methods. This strategic use of smaller LLMs in conjunction with smaller student models presents an efficient and cost-effective solution for enhancing classification tasks, offering insights into the potential of utilizing large-scale language models in E-Commerce production settings.

Original languageEnglish
Title of host publicationWWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
PublisherAssociation for Computing Machinery, Inc
Pages2706-2712
Number of pages7
ISBN (Electronic)9798400713316
DOIs
StatePublished - 23 May 2025
Event34th ACM Web Conference, WWW Companion 2025 - Sydney, Australia
Duration: 28 Apr 20252 May 2025

Publication series

NameWWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025

Conference

Conference34th ACM Web Conference, WWW Companion 2025
Country/TerritoryAustralia
CitySydney
Period28/04/252/05/25

Keywords

  • Data Annotation
  • Large Language Models
  • Model Distillation

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Enhancing Data Annotation for Student Models: A Self-Training Approach with Large Language Models'. Together they form a unique fingerprint.

Cite this