Title : Evaluating the quality and readability of AI-generated ophthalmic surgery education: A four-model comparison
Abstract:
Background: Artificial intelligence (AI) tools, particularly large language models (LLMs), are increasingly utilised to provide health information, with patients seeking simplified explanations of surgical procedures. In ophthalmology, the readability and reliability of AI-generated content remain under-explored. This study evaluates the quality and readability of educational materials produced by four LLMs—ChatGPT-4 (OpenAI), Grok 3 (xAI), DeepSeek R1 (DeepSeek Inc.), and Gemini 2.5 Flash (Google)—for three common eye operations: cataract surgery, LASIK, and vitrectomy.
Methods: The four LLMs were queried with three patient-oriented prompts requesting simplified explanations of each procedure. Responses were assessed using the DISCERN instrument for quality and Flesch-Kincaid metrics for readability. Two independent reviewers scored each response. Results were analysed with descriptive statistics and visualised in RStudio.
Results: ChatGPT produced the most readable content, with Flesch-Kincaid Grade Levels of 5.0–6.5 and Reading Ease Scores of 68.5–77.7, suitable for secondary school reading levels. DeepSeek performed similarly, while Grok and Gemini generated more complex outputs, often at A-level or early university levels. Gemini’s “simplified” segments paradoxically had poorer readability scores. DISCERN scores were comparable across models (56–58.7), indicating moderate reliability. However, all models lacked source citations, undermining credibility and transparency.