Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher

Vinaytosh Mishra, Yotam Lurie, Shlomo Mark

Research output: Contribution to journalArticlepeer-review

Abstract

This study evaluates Large Language Models (LLMs) compared to experienced medical teachers. The analysis examines the performance of three prominent LLMs—ChatGPT, Gemini, and Copilot. The study employs Fleiss’ Kappa Test to statistically analyze the concordance between LLMs and human responses. In discordance, Cohen’s Kappa test was used to find agreement between three Gen AI tools and a Medical Teacher. Results reveal a significant difference in the performance between LLMs and medical teachers, highlighting potential limitations in using AI alone for medical education.

Original languageEnglish
Article number443
JournalBMC Medical Education
Volume25
Issue number1
DOIs
StatePublished - 1 Dec 2025

Keywords

  • Generative AI
  • LLM
  • Machine learning
  • Medical education

ASJC Scopus subject areas

  • Education

Fingerprint

Dive into the research topics of 'Accuracy of LLMs in medical education: evidence from a concordance test with medical teacher'. Together they form a unique fingerprint.

Cite this