PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Roei Herzig, Ofir Abramovich, Elad Ben Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy"approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: https://ofir1080.github.io/PromptonomyViT

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
PublisherInstitute of Electrical and Electronics Engineers
Pages6789-6801
Number of pages13
ISBN (Electronic)9798350318920
DOIs
StatePublished - 3 Jan 2024
Externally publishedYes
Event2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024 - Waikoloa, United States
Duration: 4 Jan 20248 Jan 2024

Publication series

NameProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024

Conference

Conference2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
Country/TerritoryUnited States
CityWaikoloa
Period4/01/248/01/24

Keywords

  • Algorithms
  • Algorithms
  • Image recognition and understanding
  • Video recognition and understanding

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data'. Together they form a unique fingerprint.

Cite this