Title : Accuracy of multimodal interpretation by large language models for ophthalmology clinical vignettes
Abstract:
Purpose: Large language models (LLMs) can accept text and image prompts as multimodal input. With implications for ophthalmology as a field, we evaluated the performance of several LLMs on relevant clinical vignette questions.
Methods: We collected 100 cases of JAMA Ophthalmology Clinical Challenges from July 2021 to August 2024, where each case had a case description, medical image, and associated multiple-choice questions options. To assess the contribution of clinical images in improving performance of LLMs, we compared the accuracy of multimodal LLMs such as GPT4o, GPT4o-mini, GPT4o-Turbo, and Gemini 1.5 with text-only vs. text- and image-based input. Furthermore, we also determined the accuracy of unimodal LLMs such as GPT-3.5 and LLaMA 3 with text-only input. For accuracy, we calculated the percentage of correct responses in answering clinical vignettes for each LLM and compared across LLMs using Chi-Square analysis.
Results: For text-only input, GPT-4o had an accuracy of 69.5% (95% CI 64.0%-74.5%) vs. 50.7% (95% CI 45.0%-56.3%) for GPT-4o Mini, 56.9% (95% CI 51.2%-62.4%) for Gemini-1.5, 63.1% (95% 57.5%-68.5%) for GPT-4 Turbo, 56.9% (95% CI 51.2%-62.4%) for GPT-3.5, and 52.5% (95% CI 46.6%-58.4%) for LLaMA 3. For image- and text-based input, GPT-4o had an accuracy of 66.3% (95% CI 60.8%-71.5%) vs. 45.9% (95% CI 40.4%-51.6%) for GPT-4o Mini, 55.6% (95% CI 49.9%-61.1%) for Gemini-1.5, 56.0% (95% CI 50.3%-61.6%) for GPT-4 Turbo. For comparing text-only vs. image- and text-based input in multimodal LLMs, there was no significant difference in GPT-4o (69.5% vs. 66.3%, p=0.46), GPT-4o Mini (50.7% vs. 45.9%, p=0.29), Gemini-1.5 (56.9% vs. 55.6%, p=0.80), or GPT-4 Turbo (63.1% vs. 56.0%, p=0.095). Comparing multimodal and unimodal LLMs for text-only input, GPT-4o performed significantly better than GPT-3.5 (p=0.03), LLaMA (p<0.001), GPT-4o Mini (p<0.001), and Gemini-1.5 (p=0.03), but was equivalent to GPT4-Turbo (p=1.00). Comparing multimodal LLMs for text- and image-based input, GPT-4o performed significantly better than GPT-4o Mini (p<0.001) and Gemini 1.5 (p=0.049) but was also equivalent to GPT-4 Turbo (p=0.07).
Conclusions: Multimodal LLMs did not improve substantially with the inclusion of imaging data when answering ophthalmology clinical vignettes. GPT-4o outperformed all other LLMs except for GPT-4 Turbo for text-only and text- and imaging-based inputs. These results suggest that off-the-shelf LLMs can reasonably assess clinical presentations in ophthalmology, even without the inclusion of clinical images. This impressive performance highlights the possible utility of LLMs as adjuncts for clinical decision-making in ophthalmology.