Vision-language models for medical report generation and visual question answering: a review.

Iryna Hartsock, Ghulam Rasool
Author Information
  1. Iryna Hartsock: Department of Machine Learning, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.
  2. Ghulam Rasool: Department of Machine Learning, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.

Abstract

Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on publicly available models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs, with visual and language data often fused using Transformer-based architectures to enable effective learning from multimodal data. Key areas we address include the exploration of 18 public medical vision-language datasets, in-depth analyses of the architectures and pre-training strategies of 16 recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and VQA. We also highlight current challenges facing medical VLM development, including limited data availability, concerns with data privacy, and lack of proper evaluation metrics, among others, while also proposing future directions to address these obstacles. Overall, our review summarizes the recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.

Keywords

References

  1. Lab Invest. 2023 Nov;103(11):100255 [PMID: 37757969]
  2. Comput Biol Med. 2022 Oct;149:105939 [PMID: 36037629]
  3. J Am Med Inform Assoc. 2016 Mar;23(2):304-10 [PMID: 26133894]
  4. Nature. 2015 May 28;521(7553):436-44 [PMID: 26017442]
  5. Artif Intell Med. 2020 Sep;109:101964 [PMID: 34756216]
  6. Front Artif Intell. 2024 Jul 25;7:1408843 [PMID: 39118787]
  7. Neural Comput. 1997 Nov 15;9(8):1735-80 [PMID: 9377276]
  8. Nat Med. 2022 Sep;28(9):1773-1784 [PMID: 36109635]
  9. Sci Data. 2022 Jun 18;9(1):350 [PMID: 35717401]
  10. Sci Data. 2018 Nov 20;5:180251 [PMID: 30457565]
  11. Cureus. 2023 Jun 24;15(6):e40895 [PMID: 37492832]
  12. Arch Comput Methods Eng. 2023;30(4):2761-2775 [PMID: 36713767]
  13. J Imaging Inform Med. 2024 Jul 9;: [PMID: 38980626]
  14. Sci Data. 2019 May 10;6(1):52 [PMID: 31076572]
  15. Nat Med. 2024 Mar;30(3):863-874 [PMID: 38504017]
  16. IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5362-5383 [PMID: 38407999]
  17. Nature. 2024 Oct;634(8033):466-473 [PMID: 38866050]
  18. IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8704-8716 [PMID: 31135351]
  19. NPJ Digit Med. 2020 Apr 14;3:57 [PMID: 32337372]
  20. Sci Data. 2019 Dec 12;6(1):317 [PMID: 31831740]
  21. NPJ Digit Med. 2022 Dec 26;5(1):194 [PMID: 36572766]
  22. Biomed Eng Online. 2023 May 18;22(1):48 [PMID: 37202803]
  23. Artif Intell Med. 2020 Jun;106:101878 [PMID: 32425358]
  24. Implement Sci. 2024 Mar 15;19(1):27 [PMID: 38491544]
  25. Patterns (N Y). 2023 Aug 03;4(9):100802 [PMID: 37720336]
  26. Pac Symp Biocomput. 2020;25:295-306 [PMID: 31797605]
  27. Nature. 2023 Aug;620(7972):172-180 [PMID: 37438534]
  28. Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:3876-3887 [PMID: 39144675]
  29. IEEE Trans Med Imaging. 2023 May;42(5):1532-1545 [PMID: 37015503]
  30. J Digit Imaging. 2020 Aug;33(4):988-995 [PMID: 32472318]
  31. Bioengineering (Basel). 2023 Mar 20;10(3): [PMID: 36978771]
  32. Nat Med. 2024 Mar;30(3):850-862 [PMID: 38504018]
  33. Insights Imaging. 2018 Aug;9(4):611-629 [PMID: 29934920]
  34. Comput Biol Med. 2023 May;157:106791 [PMID: 36958234]
  35. Future Healthc J. 2021 Jul;8(2):e188-e194 [PMID: 34286183]
  36. Artif Intell Med. 2023 Sep;143:102611 [PMID: 37673579]
  37. J Am Med Inform Assoc. 2011 Sep-Oct;18(5):544-51 [PMID: 21846786]
  38. Sensors (Basel). 2024 Mar 02;24(5): [PMID: 38475170]
  39. IEEE J Biomed Health Inform. 2022 Dec;26(12):6070-6080 [PMID: 36121943]
  40. AMIA Jt Summits Transl Sci Proc. 2018 May 18;2017:188-196 [PMID: 29888070]

Word Cloud

Created with Highcharts 10.0.0medicaldataVLMsvisualmodelsreportgenerationvision-languagerecenthealthcarequestionevaluationmetricsCVlanguageNLPdevelopingansweringVQAarchitecturesmultimodaladdressdatasetsalsoreviewMedicalcombinecomputervisionnaturalprocessinganalyzetextualpaperreviewsadvancementsspecializedfocusingpubliclyavailabledesignedprovidebackgroundexplainingtechniquesfieldsintegratedoftenfusedusingTransformer-basedenableeffectivelearningKeyareasincludeexploration18publicin-depthanalysespre-trainingstrategies16noteworthycomprehensivediscussionassessingVLMs'performancehighlightcurrentchallengesfacingVLMdevelopmentincludinglimitedavailabilityconcernsprivacylackproperamongothersproposingfuturedirectionsobstaclesOverallsummarizesprogressharnessimprovedapplicationsVision-languageanswering:

Similar Articles

Cited By (2)