MedConceptsQA: Open source medical concepts QA benchmark

    Research output: Contribution to journalArticlepeer-review

    7 Scopus citations

    Abstract

    Background: Clinical data often includes both standardized medical codes and natural language texts. This highlights the need for Clinical Large Language Models to understand these codes and their differences. We introduce a benchmark for evaluating the understanding of medical codes by various Large Language Models. Methods: We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conduct evaluations of the benchmark using various Large Language Models. Results: Our findings show that most of the pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of 9-11% (9% for few-shot learning and 11% for zero-shot learning) compared to Llama3-OpenBioLLM-70B, the clinical Large Language Model that achieved the best results. Conclusion: Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.

    Original languageEnglish
    Article number109089
    JournalComputers in Biology and Medicine
    Volume182
    DOIs
    StatePublished - 1 Nov 2024

    Keywords

    • Benchmark
    • Clinical knowledge
    • Health care
    • LLM
    • Large Language Models
    • Machine learning

    ASJC Scopus subject areas

    • Health Informatics
    • Computer Science Applications

    Fingerprint

    Dive into the research topics of 'MedConceptsQA: Open source medical concepts QA benchmark'. Together they form a unique fingerprint.

    Cite this