Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy.

Paula Muhr, Yating Pan, Charlotte Tumescheit, Ann-Kathrin K��bler, Hatice K��bra Parmaksiz, Cheng Chen, Pablo Sebasti��n Bola��os Orozco, Soeren S Lienkamp, Janna Hastings
Author Information
  1. Paula Muhr: Faculty of Medicine, Institute for Implementation Science in Health Care, University of Zurich, Zurich, CHE.
  2. Yating Pan: Digital Society Initiative, University of Zurich, Zurich, CHE.
  3. Charlotte Tumescheit: Faculty of Medicine, Institute for Implementation Science in Health Care, University of Zurich, Zurich, CHE.
  4. Ann-Kathrin K��bler: Digital Society Initiative, University of Zurich, Zurich, CHE.
  5. Hatice K��bra Parmaksiz: Digital Society Initiative, University of Zurich, Zurich, CHE.
  6. Cheng Chen: Digital Society Initiative, University of Zurich, Zurich, CHE.
  7. Pablo Sebasti��n Bola��os Orozco: Digital Society Initiative, University of Zurich, Zurich, CHE.
  8. Soeren S Lienkamp: Faculty of Medicine, Institute for Anatomy, University of Zurich, Zurich, CHE.
  9. Janna Hastings: Faculty of Medicine, Institute for Implementation Science in Health Care, University of Zurich, Zurich, CHE.

Abstract

BACKGROUND: Generative artificial intelligence (AI) models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and the generation of synthetic data. However, it can be challenging to evaluate their heterogeneous outputs and to compare between different models. There is a need for a systematic approach enabling image and model comparisons.
METHOD: To address this gap, we developed an error classification system for annotating errors in AI-generated photorealistic images of humans and applied our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL, and Stable Cascade) using 10 prompts with eight images per prompt.
RESULTS: The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assessed inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compared results across the three models and 10 prompts quantitatively using a cumulative score per image. The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts, is available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed that DALL-E 3 performed consistently better than Stable Diffusion; however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models.
CONCLUSION: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.

Keywords

References

  1. Cureus. 2024 Jan 22;16(1):e52748 [PMID: 38384621]
  2. Commun Med (Lond). 2023 Oct 10;3(1):141 [PMID: 37816837]
  3. Anat Sci Educ. 2024 Jul-Aug;17(5):979-983 [PMID: 37694692]
  4. Lancet Digit Health. 2024 Jan;6(1):e12-e22 [PMID: 38123252]
  5. Lancet Digit Health. 2024 Jun;6(6):e379-e381 [PMID: 38664108]
  6. JAMA Surg. 2024 Jan 1;159(1):87-95 [PMID: 37966807]
  7. Resuscitation. 2023 Aug;189:109893 [PMID: 37406759]
  8. Int J Dermatol. 2023 Oct;62(10):e521-e523 [PMID: 37057623]
  9. Aesthetic Plast Surg. 2024 May;48(9):1874-1883 [PMID: 38238569]
  10. Lancet Digit Health. 2024 Jan;6(1):e2-e3 [PMID: 38123253]
  11. J Plast Reconstr Aesthet Surg. 2023 Jun;81:94-96 [PMID: 37137194]
  12. Med Sci Educ. 2023 Nov 14;34(1):5-7 [PMID: 38510393]
  13. Am J Speech Lang Pathol. 2024 Jan 3;33(1):443-451 [PMID: 37856083]
  14. J Stomatol Oral Maxillofac Surg. 2024 Oct;125(5S2):101874 [PMID: 38615707]
  15. Science. 2023 Sep 15;381(6663):adk6139 [PMID: 37708283]
  16. J Med Internet Res. 2023 Mar 16;25:e43110 [PMID: 36927634]
  17. JMIR Med Educ. 2024 Feb 22;10:e52155 [PMID: 38386400]
  18. Clin Ophthalmol. 2023 Oct 03;17:2889-2899 [PMID: 37808001]
  19. J Imaging. 2023 Mar 16;9(3): [PMID: 36976120]
  20. Lancet Digit Health. 2024 Jul;6(7):e441-e443 [PMID: 38906607]

Word Cloud

Created with Highcharts 10.0.0modelsimagesimageerrorphotorealisticdifferentclassificationgeneratedanatomicalcanmedicalchallengingmodelsystemmethodthreeStablepromptsperapplicationsmedicineeducationgenerationsyntheticsystematiccomparisonserrorsAI-generatedhumansDALL-E3Diffusionusing10fiveacrossassociatedagreementresultsreflectingImagesaiBACKGROUND:GenerativeartificialintelligenceAIproducetextdescriptionsmanyincludingdataHoweverevaluateheterogeneousoutputscompareneedapproachenablingMETHOD:addressgapdevelopedannotatingappliedcorpus240XLCascadeeightpromptRESULTS:identifiestypesseveritiesregionsspecifiesquantitativescoringbasedaggregatedproportionsexpectedcountcomponentsassessedinter-raterdouble-annotating25%calculatingKrippendorf'salphacomparedquantitativelycumulativescoreaccompanyingtrainingmanualcollectionannotationsscriptsavailableGitHubrepositoryhttps://githubcom/hastingslab-org/ai-human-imagesInter-raterrelativelypoorsubjectivitytaskModelrevealedperformedconsistentlybetterhoweverlatterdiversitypersonalattributesgroupspeopleindividualspairsCONCLUSION:enablescomparisonservecatalyseimprovementsEvaluatingText-to-ImageGeneratedPhotorealisticHumanAnatomydiffusiongenerativelargemulti-modaltext-to-image

Similar Articles

Cited By