Abstract
This study evaluates Large Language Models (LLMs) compared to experienced medical teachers. The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot. The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human responses. In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical Teacher. Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting potential limitations in using AI alone for medical education.
Original language | English |
---|---|
Article number | 443 |
Journal | BMC Medical Education |
Volume | 25 |
Issue number | 1 |
DOIs | |
State | Published - 1 Dec 2025 |
Keywords
- Generative AI
- LLM
- Machine learning
- Medical education
ASJC Scopus subject areas
- Education