TY - GEN
T1 - WorldMedQA-V
T2 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
AU - Matos, João
AU - Chen, Shan
AU - Placino, Siena
AU - Li, Yingya
AU - Climent Pardo, Juan Carlos
AU - Idan, Daphna
AU - Tohyama, Takeshi
AU - Restrepo, David
AU - Nakayama, Luis F.
AU - Pascual-Leone, Jose M.M.
AU - Savova, Guergana
AU - Aerts, Hugo
AU - Celi, Leo A.
AU - Wong, A. Ian
AU - Bitterman, Danielle S.
AU - Gallifant, Jack
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.1
AB - Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.1
UR - https://www.scopus.com/pages/publications/105028725264
U2 - 10.18653/v1/2025.findings-naacl.402
DO - 10.18653/v1/2025.findings-naacl.402
M3 - Conference contribution
AN - SCOPUS:105028725264
T3 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025
SP - 7218
EP - 7231
BT - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
A2 - Chiruzzo, Luis
A2 - Ritter, Alan
A2 - Wang, Lu
PB - Association for Computational Linguistics (ACL)
Y2 - 29 April 2025 through 4 May 2025
ER -