Evaluating the Performance of ChatGPT, Gemini, and Bing Compared with Resident Surgeons in the Otorhinolaryngology In-service Training Examination.

Utku Mete
Author Information
  1. Utku Mete: Bursa Uluda�� University Faculty of Medicine, Department of Otorhinolaryngology, Bursa, T��rkiye. ORCID

Abstract

Objective: Large language models (LLMs) are used in various fields for their ability to produce human-like text. They are particularly useful in medical education, aiding clinical management skills and exam preparation for residents. To evaluate and compare the performance of ChatGPT (GPT-4), Gemini, and Bing with each other and with otorhinolaryngology residents in answering in-service training exam questions and provide insights into the usefulness of these models in medical education and healthcare.
Methods: Eight otorhinolaryngology in-service training exams were used for comparison. 316 questions were prepared from the Resident Training Textbook of the Turkish Society of Otorhinolaryngology Head and Neck Surgery. These questions were presented to the three artificial intelligence models. The exam results were evaluated to determine the accuracy of both models and residents.
Results: GPT-4 achieved the highest accuracy among the LLMs at 54.75% (GPT-4 vs. Gemini p=0.002, GPT-4 vs. Bing p<0.001), followed by Gemini at 40.50% and Bing at 37.00% (Gemini vs. Bing p=0.327). However, senior residents outperformed all LLMs and other residents with an accuracy rate of 75.5% (p<0.001). The LLMs could only compete with junior residents. GPT- 4 and Gemini performed similarly to juniors, whose accuracy level was 46.90% (p=0.058 and p=0.120, respectively). However, juniors still outperformed Bing (p=0.019).
Conclusion: The LLMs currently have limitations in achieving the same medical accuracy as senior and mid-level residents. However, they outperform in specific subspecialties, indicating the potential usefulness in certain medical fields.

Keywords

References

  1. Cureus. 2023 Mar 17;15(3):e36272 [PMID: 37073184]
  2. J Acad Ophthalmol (2017). 2023 Sep 11;15(2):e184-e187 [PMID: 37701862]
  3. Neurosurgery. 2023 Nov 1;93(5):1090-1098 [PMID: 37306460]
  4. JMIR Med Educ. 2019 Dec 3;5(2):e16048 [PMID: 31793895]
  5. JMIR Med Educ. 2024 Jan 16;10:e49970 [PMID: 38227351]
  6. PLOS Digit Health. 2023 Feb 9;2(2):e0000198 [PMID: 36812645]
  7. Int Forum Allergy Rhinol. 2024 Jun;14(6):1123-1130 [PMID: 38268099]
  8. Eye (Lond). 2023 Dec;37(17):3530-3533 [PMID: 37161074]
  9. Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-4278 [PMID: 37285018]
  10. Cureus. 2023 Feb 20;15(2):e35237 [PMID: 36968864]
  11. Radiology. 2023 Jun;307(5):e230922 [PMID: 37310252]
  12. Philos Trans A Math Phys Eng Sci. 2024 Apr 15;382(2270):20230254 [PMID: 38403056]
  13. Front Oncol. 2023 Dec 01;13:1256459 [PMID: 38107064]
  14. JMIR Med Educ. 2023 Feb 8;9:e45312 [PMID: 36753318]

Word Cloud

Created with Highcharts 10.0.0residentsGeminiBingLLMsaccuracyp=0modelsmedicalGPT-4otorhinolaryngologyeducationexamChatGPTin-servicequestionsvsHoweverusedfieldstrainingusefulnessResidentTrainingOtorhinolaryngologyintelligencep<0001senioroutperformedjuniorsObjective:Largelanguagevariousabilityproducehuman-liketextparticularlyusefulaidingclinicalmanagementskillspreparationevaluatecompareperformanceansweringprovideinsightshealthcareMethods:Eightexamscomparison316preparedTextbookTurkishSocietyHeadNeckSurgerypresentedthreeartificialresultsevaluateddetermineResults:achievedhighestamong5475%002followed4050%3700%327rate755%competejuniorGPT-4performedsimilarlywhoselevel4690%058120respectivelystill019Conclusion:currentlylimitationsachievingmid-leveloutperformspecificsubspecialtiesindicatingpotentialcertainEvaluatingPerformanceComparedSurgeonsIn-serviceExaminationArtificialexaminationresident

Similar Articles

Cited By