Title : Assessing GPT-4o and GPT-4 in answering and explaining ophthalmology exam questions from Taiwan’s medical licensing test
Abstract:
Purpose: This study aims to evaluate and compare the performance of GPT-4o and GPT-4 in answering Taiwan's National Medical Licensing Examination ophthalmology questions from 2014 to 2023, focusing on both answer accuracy and explanation quality.
Materials and Methods: A total of 169 ophthalmology questions from Taiwan's National Medical Licensing Examination over the past decade were selected. GPT-4o and GPT-4 were tested on each question, and their performance was measured by correct answers and explanations. The results were categorized by ophthalmologic subspecialty and analyzed using statistical methods to determine significant differences between the two models.
Results: GPT-4o achieved a significantly higher overall correct answer rate (92.9%) compared to GPT-4 (69.2%) across all ophthalmology questions from 2014 to 2023 (p < 0.01). GPT-4o outperformed GPT-4 in most subspecialties, including Retina (95.8% vs. 58.3%, p < 0.01), External Disease and Cornea (96.3% vs. 77.8%, p = 0.04), and Neuro-Ophthalmology (87.5% vs. 50%, p = 0.02). GPT-4o and GPT-4 performed similarly in Glaucoma and Uveitis, with no significant differences observed. In terms of explanation quality, GPT-4o provided accurate explanations for 90.7% of the questions, with the highest accuracy in Pediatric Ophthalmology and Strabismus (100%) and the lowest in Uveitis (83.3%).
Conclusion: GPT-4o exhibited superior performance in both answering and explaining ophthalmology questions from Taiwan's NMLE compared to GPT-4. These results suggest that GPT-4o may be a more reliable tool for educational and diagnostic purposes in ophthalmology.