Abstract
Background: Clinical data often includes both standardized medical codes and natural language texts. This highlights the need for Clinical Large Language Models to understand these codes and their differences. We introduce a benchmark for evaluating the understanding of medical codes by various Large Language Models. Methods: We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conduct evaluations of the benchmark using various Large Language Models. Results: Our findings show that most of the pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of 9-11% (9% for few-shot learning and 11% for zero-shot learning) compared to Llama3-OpenBioLLM-70B, the clinical Large Language Model that achieved the best results. Conclusion: Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.
Original language | English |
---|---|
Article number | 109089 |
Journal | Computers in Biology and Medicine |
Volume | 182 |
DOIs | |
State | Published - 1 Nov 2024 |
Keywords
- Benchmark
- Clinical knowledge
- Health care
- LLM
- Large Language Models
- Machine learning
ASJC Scopus subject areas
- Health Informatics
- Computer Science Applications