Comparison of ChatGPT-4, Copilot, Bard and Gemini Ultra on an Otolaryngology Question Bank.

Rashi Ramchandani, Eddie Guo, Michael Mostowy, Jason Kreutz, Nick Sahlollbey, Michele M Carr, Janet Chung, Lisa Caulley
Author Information
  1. Rashi Ramchandani: Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada. ORCID
  2. Eddie Guo: Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
  3. Michael Mostowy: Department of Otolaryngology, Jacobs School of Medicine and Biomedical Sciences at the University of Buffalo, Buffalo, New York, USA.
  4. Jason Kreutz: Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
  5. Nick Sahlollbey: Department of Otolaryngology-Head and Neck Surgery, University of Calgary, Calgary, Alberta, Canada.
  6. Michele M Carr: Department of Otolaryngology, Jacobs School of Medicine and Biomedical Sciences at the University of Buffalo, Buffalo, New York, USA.
  7. Janet Chung: Department of Otolaryngology-Head and Neck Surgery, University of Toronto, Toronto, Ontario, Canada.
  8. Lisa Caulley: Department of Otolaryngology-Head and Neck Surgery, University of Ottawa, Ottawa, Ontario, Canada.

Abstract

OBJECTIVE: To compare the performance of Google Bard, Microsoft Copilot, GPT-4 with vision (GPT-4) and Gemini Ultra on the OTO Chautauqua, a student-created, faculty-reviewed otolaryngology question bank.
STUDY DESIGN: Comparative performance evaluation of different LLMs.
SETTING: N/A.
PARTICIPANTS: N/A.
METHODS: Large language models (LLMs) are being extensively tested in medical education. However, their accuracy and effectiveness remain understudied, particularly in otolaryngology. This study involved inputting 350 single-best-answer multiple choice questions, including 18 image-based questions, into four LLMS. Questions were sourced from six independent question banks related to (a) rhinology, (b) head and neck oncology, (c) endocrinology, (d) general otolaryngology, (e) paediatrics, (f) otology, (g) facial plastics, reconstruction and (h) trauma. LLMs were instructed to provide an output reasoning for their answers, the length of which was recorded.
RESULTS: Aggregate and subgroup analysis revealed that Gemini (79.8%) outperformed the other LLMs, followed by GPT-4 (71.1%), Copilot (68.0%), and Bard (65.1%) in accuracy. The LLMs had significantly different average response lengths, with Bard (x̄ = 1685.24) being the longest and no difference between GPT-4 (x̄ = 827.34) and Copilot (x̄ = 904.12). Gemini's longer responses (x̄ =1291.68) included explanatory images and links. Gemini and GPT-4 correctly answered image-based questions (n = 18), unlike Copilot and Bard, highlighting their adaptability and multimodal capabilities.
CONCLUSION: Gemini outperformed the other LLMs in terms of accuracy, followed by GPT-4, Copilot and Bard. GPT-4, although it has the second-highest accuracy, provides concise and relevant explanations. Despite the promising performance of LLMs, medical learners should cautiously assess accuracy and decision-making reliability.

Keywords

References

  1. H. C. Lucas, J. S. Upperman, and J. R. Robinson, “A Systematic Review of Large Language Models and Their Implications in Medical Education,” Medical Education 58, no. 11 (2024): 1276–1285, https://doi.org/10.1111/medu.15402.
  2. J. Clusmann, F. R. Kolbinger, H. S. Muti, et al., “The Future Landscape of Large Language Models in Medicine,” Communications Medicine 3, no. 1 (2023): 141, https://doi.org/10.1038/s43856‐023‐00370‐1.
  3. R. Ali, O. Y. Tang, I. D. Connolly, et al., “Congress of Neurological Surgeons Systematic Review and Evidence‐Based Guidelines for Patients With Chiari Malformation: Symptoms,” Neurosurgery 93, no. 6 (2023): 1353–1365, https://doi.org/10.1227/neu.0000000000002634.
  4. “OTO Chautauqua Question Bank,” accessed May 6, 2024, https://otoqbank.com/.
  5. D. Brin, V. Sorin, A. Vaid, et al., “Comparing ChatGPT and GPT‐4 Performance in USMLE Soft Skill Assessments,” Scientific Reports 13, no. 1 (2023): 16492, https://doi.org/10.1038/s41598‐023‐43436.
  6. A. Danesh, H. Pazouki, K. Danesh, F. Danesh, and A. Danesh, “The Performance of Artificial Intelligence Language Models in Board‐Style Dental Knowledge Assessment: A Preliminary Study on ChatGPT,” Journal of the American Dental Association (1939) 154, no. 11 (2023): 970–974, https://doi.org/10.1016/j.adaj.2023.07.016.
  7. R. C. T. Cheong, K. P. Pang, S. Unadkat, et al., “Performance of Artificial Intelligence Chatbots in Sleep Medicine Certification Board Exams: ChatGPT Versus Google Bard,” European Archives of Oto‐Rhino‐Laryngology 281, no. 4 (2024): 2137–2143, https://doi.org/10.1007/s00405‐023‐08381‐3.
  8. A. Warrier, R. Singh, A. Haleem, H. Zaki, and J. A. Eloy, “The Comparative Diagnostic Capability of Large Language Models in Otolaryngology,” Laryngoscope 134, no. 9 (2024), https://doi.org/10.1002/lary.31434.
  9. C. C. Hoch, B. Wollenberg, J. C. Lüers, et al., “ChatGPT's Quiz Skills in Different Otolaryngology Subspecialties: An Analysis of 2576 Single‐Choice and Multiple‐Choice Board Certification Preparation Questions,” European Archives of Oto‐Rhino‐Laryngology 280, no. 9 (2023): 4271–4278, https://doi.org/10.1007/s00405‐023‐08051‐4.
  10. M. J. Urban, A. Shimomura, S. Shah, T. Losenegger, J. Westrick, and A. A. Jagasia, “Rural Otolaryngology Care Disparities: A Scoping Review,” Otolaryngology—Head and Neck Surgery 166, no. 6 (2022): 1219–1227, https://doi.org/10.1177/01945998211068822.
  11. E. Klang, S. Portugez, R. Gross, et al., “Advantages and Pitfalls in Utilizing Artificial Intelligence for Crafting Medical Examinations: A Medical Education Pilot Study With GPT‐4,” BMC Medical Education 23, no. 1 (2023): 772, https://doi.org/10.1186/s12909‐023‐04752‐w.
  12. M. Ostrowska, P. Kacała, D. Onolememen, et al., “To Trust or Not to Trust: Evaluating the Reliability and Safety of AI Responses to Laryngeal Cancer Queries,” European Archives of Oto‐Rhino‐Laryngology 281, no. 11 (2024): 6069–6081, https://doi.org/10.1007/s00405‐024‐08643‐8.

Word Cloud

Created with Highcharts 10.0.0GPT-4LLMsBardCopilotGeminiaccuracyotolaryngologyperformancequestionmedicalquestionsUltradifferentN/Alanguageeducationimage-basedoutperformedfollowed1%68BankOBJECTIVE:compareGoogleMicrosoftvisionOTOChautauquastudent-createdfaculty-reviewedbankSTUDYDESIGN:ComparativeevaluationSETTING:PARTICIPANTS:METHODS:LargemodelsextensivelytestedHowevereffectivenessremainunderstudiedparticularlystudyinvolvedinputting350single-best-answermultiplechoiceincluding18fourLLMSQuestionssourcedsixindependentbanksrelatedrhinologybheadneckoncologycendocrinologydgeneralepaediatricsfotologygfacialplasticsreconstructionhtraumainstructedprovideoutputreasoninganswerslengthrecordedRESULTS:Aggregatesubgroupanalysisrevealed798%710%65significantlyaverageresponselengthsx̄ = 168524longestdifferencex̄ = 82734x̄ = 90412Gemini'slongerresponses=1291includedexplanatoryimageslinkscorrectlyansweredn = 18unlikehighlightingadaptabilitymultimodalcapabilitiesCONCLUSION:termsalthoughsecond-highestprovidesconciserelevantexplanationsDespitepromisinglearnerscautiouslyassessdecision-makingreliabilityComparisonChatGPT-4OtolaryngologyQuestionartificialintelligencelargemodelmachinelearning

Similar Articles

Cited By