TY - GEN
T1 - REFVNLI
T2 - 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
AU - Slobodkin, Aviv
AU - Taitelbaum, Hagai
AU - Bitton, Yonatan
AU - Gordon, Brian
AU - Sokolik, Michal
AU - Guetta, Nitzan Bitton
AU - Gueta, Almog
AU - Rassin, Royi
AU - Lischinski, Dani
AU - Szpektor, Idan
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability—ranging from enhanced personalization in image generation to consistent character representation in video rendering—progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce REFVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, REFVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.1
AB - Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability—ranging from enhanced personalization in image generation to consistent character representation in video rendering—progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce REFVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, REFVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.1
UR - https://www.scopus.com/pages/publications/105028965851
U2 - 10.18653/v1/2025.findings-emnlp.447
DO - 10.18653/v1/2025.findings-emnlp.447
M3 - Conference contribution
AN - SCOPUS:105028965851
T3 - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
SP - 8420
EP - 8438
BT - EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
A2 - Christodoulopoulos, Christos
A2 - Chakraborty, Tanmoy
A2 - Rose, Carolyn
A2 - Peng, Violet
PB - Association for Computational Linguistics (ACL)
Y2 - 4 November 2025 through 9 November 2025
ER -