Title : Evaluation of artificial intelligence Chatbots’ reponses regarding questions on common ophthalmic conditions
Abstract:
Purpose: While AI chatbots are increasingly used for patient education, their effectiveness in providing accurate, comprehensive, and understandable information about ophthalmologic conditions remains understudied. We performed an observational, cross-sectional study to evaluate the ability of five AI chatbots (Chat GPT 3.5, Bing Chat, Google Gemini, Perplexity AI, and YouChat) to educate patients on common ophthalmologic conditions by assessing the accuracy, quality, and comprehensiveness of their responses as rated by participants with varying levels of ophthalmic knowledge.
Methods: There were fifteen participants stratified by ophthalmic knowledge, ranging from college- educated adults to practicing ophthalmologists. Ten questions were submitted to each AI chatbot, and de-identified chatbot responses were sent to the respondents. Using a weighted scale, respondents were asked to evaluate the overall quality and five metrics of each chatbot’s response: scientific accuracy, comprehensiveness, balanced explanation, financial considerations, and understandability. Scores from 150 evaluations were averaged, and comparative statistics using mixed-effects models were performed to evaluate significant differences.
Results: Chat GPT 3.5 received the highest overall quality score, while Bing Chat received the lowest (p<0.0001). No significant difference was found in scientific accuracy. Chat GPT 3.5 received the highest comprehensiveness (4.2; p=0.0002) and understandability scores (4.3; p=0.004), while Bing Chat received the lowest scores of 3.4 and 2.7, respectively. Chat GPT 3.5, Perplexity AI, and YouChat had higher scores for balanced explanation than Bing Chat (p <0.0001). For financial considerations, Chat GPT 3.5, Perplexity AI, and YouChat had higher scores than Bing Chat and Google Gemini (p<0.0001). Only ophthalmology residents, optometrists, and ophthalmologists could distinguish scientific accuracy among the chatbots.
Conclusion: Participants graded particular chatbots (e.g., ChatGPT 3.5) with higher scores than others in several of the studied metrics regarding questions about common ophthalmologic diagnoses. However, as the quality of these responses varies across chatbots, eye care professionals remain an authoritative source for patient education.