Skip to main navigation Skip to search Skip to main content

Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities

  • Fengjun Wang
  • , Moran Beladev
  • , Ofri Kleinfeld
  • , Elina Frayerman
  • , Tal Shachar
  • , Eran Fainman
  • , Karen Lastmann Assaraf
  • , Sarai Mizrachi
  • , Benjamin Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Multi-label text classification is a critical task in the industry. It helps to extract structured information from large amount of textual data. We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance by employing a Bi-Encoder Transformer architecture that utilizes concatenation, subtraction, and multiplication of embeddings on both text and topic. Text2Topic also supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference with high throughput. The final model achieves accurate and comprehensive results compared to state-of-the-art baselines, including large language models (LLMs). In this study, a total of 239 topics are defined, and around 1.6 million text-topic pairs annotations (in which 200K are positive) are collected on approximately 120K texts from 3 main data sources on Booking.com. The data is collected with optimized smart sampling and partial labeling. The final Text2Topic model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP, as well as a 75.8% macro mAP score. We summarize the modeling choices which are extensively tested through ablation studies, and share detailed in-production decision-making steps.

Original languageEnglish
Title of host publicationEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track
EditorsMingxuan Wang, Imed Zitouni
PublisherAssociation for Computational Linguistics (ACL)
Pages93-103
Number of pages11
ISBN (Electronic)9788891760684
DOIs
StatePublished - 1 Jan 2023
Externally publishedYes
Event2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, EMNLP 2023 - Singapore, Singapore
Duration: 6 Dec 202310 Dec 2023

Publication series

NameEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Industry Track

Conference

Conference2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, EMNLP 2023
Country/TerritorySingapore
CitySingapore
Period6/12/2310/12/23

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Text2Topic: Multi-Label Text Classification System for Efficient Topic Detection in User Generated Content with Zero-Shot Capabilities'. Together they form a unique fingerprint.

Cite this