Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.

Adi Lahat, Kassem Sharif, Narmin Zoabi, Yonatan Shneor Patt, Yousra Sharif, Lior Fisher, Uria Shani, Mohamad Arow, Roni Levin, Eyal Klang
Author Information
  1. Adi Lahat: Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel. ORCID
  2. Kassem Sharif: Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel. ORCID
  3. Narmin Zoabi: Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Ramat Gan, Israel. ORCID
  4. Yonatan Shneor Patt: Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel. ORCID
  5. Yousra Sharif: Department of Internal Medicine C, Hadassah Medical Center, Jerusalem, Israel. ORCID
  6. Lior Fisher: Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel. ORCID
  7. Uria Shani: Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel. ORCID
  8. Mohamad Arow: Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel. ORCID
  9. Roni Levin: Department of Internal Medicine B, Sheba Medical Centre, Tel Aviv, Israel. ORCID
  10. Eyal Klang: Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, United States. ORCID

Abstract

BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement.
OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors' and residents' ratings, and specific question types.
METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications.
RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions.
CONCLUSIONS: ChatGPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.

Keywords

References

  1. JMIR Form Res. 2022 Dec 2;6(12):e39443 [PMID: 36327383]
  2. Int J Environ Res Public Health. 2023 Feb 15;20(4): [PMID: 36834073]
  3. JMIR Med Educ. 2023 Jun 29;9:e48002 [PMID: 37384388]
  4. Graefes Arch Clin Exp Ophthalmol. 2023 Oct;261(10):3041-3043 [PMID: 37129631]
  5. JMIR Med Educ. 2023 Mar 6;9:e46885 [PMID: 36863937]
  6. Nat Med. 2019 Jan;25(1):44-56 [PMID: 30617339]
  7. Cureus. 2023 Jun 22;15(6):e40822 [PMID: 37485215]
  8. Obes Surg. 2023 Jun;33(6):1790-1796 [PMID: 37106269]
  9. J Telemed Telecare. 2023 Feb 9;:1357633X231155520 [PMID: 36760131]
  10. J Telemed Telecare. 2023 Jun 22;:1357633X231181922 [PMID: 37350055]
  11. Stroke Vasc Neurol. 2017 Jun 21;2(4):230-243 [PMID: 29507784]
  12. Sci Rep. 2023 Mar 13;13(1):4164 [PMID: 36914821]
  13. Clin Mol Hepatol. 2023 Jul;29(3):721-732 [PMID: 36946005]
  14. Patterns (N Y). 2024 Mar 01;5(3):100943 [PMID: 38487804]
  15. Hepatol Commun. 2023 Mar 24;7(4): [PMID: 36972383]
  16. Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2081-2086 [PMID: 37405455]
  17. Healthcare (Basel). 2023 Aug 29;11(17): [PMID: 37685452]
  18. Aesthetic Plast Surg. 2023 Oct;47(5):1985-1993 [PMID: 37095384]
  19. Front Artif Intell. 2023 Apr 05;6:1166014 [PMID: 37091303]
  20. JNCI Cancer Spectr. 2023 Mar 1;7(2): [PMID: 36929393]
  21. J Am Med Inform Assoc. 2018 Sep 1;25(9):1248-1258 [PMID: 30010941]

MeSH Term

Humans
Clinical Decision-Making
Artificial Intelligence

Word Cloud

Created with Highcharts 10.0.045GPT-3GPT-40clinicalethicalacrossmedicineemergencymodelsmeanSD1vs3predictivephysicians8residentsaccuracyinternalethicsseniorsscoresrespectivelyP<001completenessintelligenceparticularlychatbothealthcaredecision-makingdilemmaspotentialratingsquestiontypestotalquestionsresponsesGPTreceivedratingdimensionsbeneficial67physicianpracticalmodelBACKGROUND:ArtificialsystemsbecominginstrumentaltoolaidingpatientengagementOBJECTIVE:studyaimsanalyzeperformanceChatGPT-3ChatGPT-4addressingcomplexillustraterolecomparingseniors'residents'specificMETHODS:specializedformulated176real-worldseniorassessed1-5scalecategories:relevanceclarityutilitycomprehensivenessEvaluationsconductedwithinComparisonsmadegloballyclassificationsRESULTS:highoutperformedconsistentlyhigherSpecificallyratedcompletesimilarly9EthicalquerieshighestreflectingconsistencycriteriaDistinctionsamongsignificant25'sCONCLUSIONS:ChatGPT'sassistmedicalissuespromisingprospectsenhancediagnosticstreatmentsintegrationworkflowsmayvaluablemustcomplementreplacehumanexpertiseContinuedresearchessentialensuresafeeffectiveimplementationenvironmentsAssessingGenerativePretrainedTransformersClinicalDecision-Making:ComparativeAnalysisAIChatGPTEDEMMLNLPalgorithmalgorithmsartificialbioethicschat-GPTchat-botchat-botschatbotsdoctordilemmamachinelearningnaturallanguageprocessinganalyticssystem

Similar Articles

Cited By (8)