The Data Artifacts Glossary: a community-based repository for bias on health datasets.

Rodrigo R Gameiro, Naira Link Woite, Christopher M Sauer, Sicheng Hao, Chrystinne Oliveira Fernandes, Anna E Premo, Alice Rangel Teixeira, Isabelle Resli, An-Kwok Ian Wong, Leo Anthony Celi
Author Information
  1. Rodrigo R Gameiro: Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
  2. Naira Link Woite: Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
  3. Christopher M Sauer: Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
  4. Sicheng Hao: Division of Pulmonary, Allergy, and Critical Care Medicine, Duke University, Durham, NC, USA.
  5. Chrystinne Oliveira Fernandes: Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
  6. Anna E Premo: Learning Research and Development Center, University of Pittsburgh, Pittsburgh, PA, USA.
  7. Alice Rangel Teixeira: Department of Philosophy, Universitat Aut��noma de Barcelona, Barcelona, Spain.
  8. Isabelle Resli: School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
  9. An-Kwok Ian Wong: Division of Pulmonary, Allergy, and Critical Care Medicine, Duke University, Durham, NC, USA.
  10. Leo Anthony Celi: Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA. lceli@mit.edu. ORCID

Abstract

BACKGROUND: The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups.
OBJECTIVE: This paper introduces the "Data Artifacts Glossary", a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities.
METHODS: Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary's structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure.
RESULTS: The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding.
CONCLUSION: The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.

Keywords

References

  1. J R Coll Physicians Edinb. 2018 Sep;48(3):225-232 [PMID: 30191910]
  2. J Natl Med Assoc. 2009 May;101(5):430-7 [PMID: 19476196]
  3. Plast Reconstr Surg. 2010 Aug;126(2):619-625 [PMID: 20679844]
  4. Sci Data. 2023 Jan 3;10(1):1 [PMID: 36596836]
  5. JAMA Netw Open. 2021 Nov 1;4(11):e2131674 [PMID: 34730820]
  6. BMC Med Educ. 2023 Sep 22;23(1):689 [PMID: 37740191]
  7. Ann Am Thorac Soc. 2024 Feb;21(2):287-295 [PMID: 38029405]
  8. J Am Med Inform Assoc. 2020 Dec 9;27(12):2020-2023 [PMID: 32574353]
  9. Aust J Gen Pract. 2023 Jul;52(7):439-442 [PMID: 37423238]
  10. Indian J Dermatol. 2020 Sep-Oct;65(5):365-370 [PMID: 33165420]
  11. Medicine (Baltimore). 2023 Feb 17;102(7):e32829 [PMID: 36800594]
  12. J Biomed Inform. 2024 May;153:104643 [PMID: 38621640]
  13. Lancet Digit Health. 2022 Dec;4(12):e893-e898 [PMID: 36154811]
  14. J Electrocardiol. 2024 Mar-Apr;83:30-40 [PMID: 38301492]
  15. N Engl J Med. 2023 Aug 31;389(9):833-838 [PMID: 37646680]
  16. Nat Med. 2021 Dec;27(12):2176-2182 [PMID: 34893776]
  17. J Gen Intern Med. 2019 Aug;34(8):1626-1630 [PMID: 31090027]
  18. Proc Natl Acad Sci U S A. 2016 Apr 19;113(16):4296-301 [PMID: 27044069]
  19. Afr Health Sci. 2023 Jun;23(2):753-763 [PMID: 38223594]
  20. J Gen Intern Med. 2013 Nov;28(11):1504-10 [PMID: 23576243]

MeSH Term

Humans
Artificial Intelligence
Bias
Datasets as Topic

Word Cloud

Created with Highcharts 10.0.0ArtifactsAIDataGlossaryhealthcaredatasetsbiasesunderstandinghealthbiasdatasetArtificialpotentialresourceoutputsdisparitiesopen-sourcedesignedcomprehensiveaccuracyapplicationscontributesaddressingdiverseteamstructureMIMIC-IVutilitycollaborativeBACKGROUND:deploymentIntelligencetransformpatientcareimproveddiagnosticspersonalizedtreatmentplansefficientmanagementHowevereffectivenessfairnesscriticallydependentdatalearnsBiasedcanleadperpetuateparticularlyaffectingsocialminoritiesmarginalizedgroupsOBJECTIVE:paperintroduces"DataGlossary"dynamicframeworksystematicallydocumentupdateaimprovidetoolenhancestransparencyinequitiesMETHODS:UtilizingmethodologyinspiredDelphimethodexpertsconductediterativeroundsdiscussionsliteraturereviewssynthesizedinsightsdeveloplistcategoriesglossary'spilotedusingvalidateRESULTS:adoptsapproachmodeledsuccessfulprojectslikeLinuxPythonHostedGitHubutilizesrobustversioncontrolfeaturesallowingstakeholdersbackgroundscontributerigorouspeerreviewprocessmanagedcommunitymembersglossaryensurescontinualrefinementcontentsimplementationillustratescategorizesfacilitatesidentificationCONCLUSION:servesvitalenhancingintegrityprovidingmechanismrecognizemitigateimpactaidsavoidingmodeldevelopmentalsorootcausesGlossary:community-basedrepositoryintelligenceBiasDatasetHealthequityMachinelearning

Similar Articles

Cited By

No available data.