TY - GEN
T1 - PromptonomyViT
T2 - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
AU - Herzig, Roei
AU - Abramovich, Ofir
AU - Ben Avraham, Elad
AU - Arbelle, Assaf
AU - Karlinsky, Leonid
AU - Shamir, Ariel
AU - Darrell, Trevor
AU - Globerson, Amir
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/3
Y1 - 2024/1/3
N2 - Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy"approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: https://ofir1080.github.io/PromptonomyViT
AB - Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy"approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: https://ofir1080.github.io/PromptonomyViT
KW - Algorithms
KW - Algorithms
KW - Image recognition and understanding
KW - Video recognition and understanding
UR - http://www.scopus.com/inward/record.url?scp=85191944123&partnerID=8YFLogxK
U2 - 10.1109/WACV57701.2024.00666
DO - 10.1109/WACV57701.2024.00666
M3 - Conference contribution
AN - SCOPUS:85191944123
T3 - Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
SP - 6789
EP - 6801
BT - Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
PB - Institute of Electrical and Electronics Engineers
Y2 - 4 January 2024 through 8 January 2024
ER -