Introduction
Artificial intelligence (AI) and large langue models (LLMs) have significantly transformed numerous fields, including medical education. However, their application in specialized areas such as otolaryngology presents unique challenges. This study evaluates the performance of advanced LLMs, including models from OpenAI, Google, and Anthropic, in answering domain-specific otolaryngology board examination questions.
Material and methods
A total of 2,576 questions from a German otolaryngology board exam question bank were used to test 11 different LLMs, including GPT variants, Gemini models, and Claude models. The questions were divided into multiple-choice and single-choice formats. Python scripts were utilized to interact with LLMs via Application Programming Interfaces, collecting responses for further statistical analysis.
Results
The GPT-4o model achieved the highest accuracy at 55.6% among all models, excelling in categories such as allergology and head and neck tumors. Notably, GPT-3.5 Turbo showed a significant performance decline from 57% to 52.6% over the past year. Performance was consistently better on single-choice questions than on multiple-choice questions across all models. The GPT-4 models showed significant improvements in handling negated questions, while other models revealed mixed results.
Discussion
This study demonstrates the advanced capabilities of LLMs, particularly GPT-4 variants, in specialized medical fields like otolaryngology. However, the variability in performance across different models and question types highlights the need for continued refinement. While promising, LLMs require ongoing advancements to fully meet the demands of specialized medical education and certification.
Nein