HYBRID EVENT: Join us in person in Singapore or attend virtually from anywhere.

5th Edition of

International Ophthalmology Conference

Accuracy of multimodal interpretation by large language models for ophthalmology clinical vignettes

Dhruva Drew Gupta
Harvard Medical School, United States
Title: Accuracy of multimodal interpretation by large language models for ophthalmology clinical vignettes

Abstract:

Purpose: Large language models (LLMs) can accept text and image prompts as multimodal input. With implications for ophthalmology as a field, we evaluated the performance of several LLMs on relevant clinical vignette questions.
Methods: We collected 100 cases of JAMA Ophthalmology Clinical Challenges from July 2021 to August 2024, where each case had a case description, medical image, and associated multiple-choice questions options. To assess the contribution of clinical images in improving performance of LLMs, we compared the accuracy of multimodal LLMs such as GPT4o, GPT4o-mini, GPT4o-Turbo, and Gemini 1.5 with text-only vs. text- and image-based input. Furthermore, we also determined the accuracy of unimodal LLMs such as GPT-3.5 and LLaMA 3 with text-only input. For accuracy, we calculated the percentage of correct responses in answering clinical vignettes for each LLM and compared across LLMs using Chi-Square analysis.
Results: For text-only input, GPT-4o had an accuracy of 69.5% (95% CI 64.0%-74.5%) vs. 50.7% (95% CI 45.0%-56.3%) for GPT-4o Mini, 56.9% (95% CI 51.2%-62.4%) for Gemini-1.5, 63.1% (95% 57.5%-68.5%) for GPT-4 Turbo, 56.9% (95% CI 51.2%-62.4%) for GPT-3.5, and 52.5% (95% CI 46.6%-58.4%) for LLaMA 3. For image- and text-based input, GPT-4o had an accuracy of 66.3% (95% CI 60.8%-71.5%) vs. 45.9% (95% CI 40.4%-51.6%) for GPT-4o Mini, 55.6% (95% CI 49.9%-61.1%) for Gemini-1.5, 56.0% (95% CI 50.3%-61.6%) for GPT-4 Turbo. For comparing text-only vs. image- and text-based input in multimodal LLMs, there was no significant difference in GPT-4o (69.5% vs. 66.3%, p=0.46), GPT-4o Mini (50.7% vs. 45.9%, p=0.29), Gemini-1.5 (56.9% vs. 55.6%, p=0.80), or GPT-4 Turbo (63.1% vs. 56.0%, p=0.095). Comparing multimodal and unimodal LLMs for text-only input, GPT-4o performed significantly better than GPT-3.5 (p=0.03), LLaMA (p<0.001), GPT-4o Mini (p<0.001), and Gemini-1.5 (p=0.03), but was equivalent to GPT4-Turbo (p=1.00). Comparing multimodal LLMs for text- and image-based input, GPT-4o performed significantly better than GPT-4o Mini (p<0.001) and Gemini 1.5 (p=0.049) but was also equivalent to GPT-4 Turbo (p=0.07).
Conclusions: Multimodal LLMs did not improve substantially with the inclusion of imaging data when answering ophthalmology clinical vignettes. GPT-4o outperformed all other LLMs except for GPT-4 Turbo for text-only and text- and imaging-based inputs. These results suggest that off-the-shelf LLMs can reasonably assess clinical presentations in ophthalmology, even without the inclusion of clinical images. This impressive performance highlights the possible utility of LLMs as adjuncts for clinical decision-making in ophthalmology.

Biography:

Dhruva (Drew) Gupta is a MD candidate at Harvard who will graduate in May 2025. He earned his BS in Neuroscience from Yale College and served as a Gliklich Healthcare Innovation Fellow at Massachusetts Eye and Ear where he led a project on identifying reversal of vision loss in glaucoma. He currently works at Mass General Brigham to understand the applications of large language models in ophthalmology.

YouTube
WhatsAppWhatsApp