A harmonized public resource of deeply sequenced diverse human genomes.

Zan Koenig, Mary T Yohannes, Lethukuthula L Nkambule, Xuefang Zhao, Julia K Goodrich, Heesu Ally Kim, Michael W Wilson, Grace Tiao, Stephanie P Hao, Nareh Sahakian, Katherine R Chao, Mark A Walker, Yunfei Lyu, gnomAD Project Consortium, Heidi L Rehm, Benjamin M Neale, Michael E Talkowski, Mark J Daly, Harrison Brand, Konrad J Karczewski, Elizabeth G Atkinson, Alicia R Martin
Author Information
  1. Zan Koenig: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  2. Mary T Yohannes: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  3. Lethukuthula L Nkambule: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  4. Xuefang Zhao: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  5. Julia K Goodrich: Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.
  6. Heesu Ally Kim: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  7. Michael W Wilson: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  8. Grace Tiao: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  9. Stephanie P Hao: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  10. Nareh Sahakian: Broad Genomics, The Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA, 02141, USA.
  11. Katherine R Chao: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  12. Mark A Walker: Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA.
  13. Yunfei Lyu: Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
  14. Heidi L Rehm: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. ORCID
  15. Benjamin M Neale: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. ORCID
  16. Michael E Talkowski: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  17. Mark J Daly: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. ORCID
  18. Harrison Brand: Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  19. Konrad J Karczewski: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  20. Elizabeth G Atkinson: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. ORCID
  21. Alicia R Martin: Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. ORCID

Abstract

Underrepresented populations are often excluded from genomic studies due in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high quality set of 4,094 whole genomes from HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also demonstrate substantial added value from this dataset compared to the prior versions of the component resources, typically combined via liftover and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared to previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.

References

  1. Am J Hum Genet. 2013 Aug 8;93(2):278-88 [PMID: 23910464]
  2. Nat Genet. 2016 Oct;48(10):1279-83 [PMID: 27548312]
  3. Science. 2008 Feb 22;319(5866):1100-4 [PMID: 18292342]
  4. Nature. 2020 May;581(7809):444-451 [PMID: 32461652]
  5. Science. 2017 Nov 3;358(6363):652-655 [PMID: 28971970]
  6. Nat Commun. 2019 Dec 16;10(1):5732 [PMID: 31844061]
  7. Science. 2002 Dec 20;298(5602):2381-5 [PMID: 12493913]
  8. Bioinformatics. 2010 Nov 15;26(22):2867-73 [PMID: 20926424]
  9. PLoS Genet. 2019 Dec 23;15(12):e1008500 [PMID: 31869403]
  10. Nature. 2021 Feb;590(7845):290-299 [PMID: 33568819]
  11. Genomics. 1991 Oct;11(2):490-1 [PMID: 1769670]
  12. Politics Life Sci. 1999 Mar;18(1):15-23 [PMID: 11660815]
  13. PLoS Comput Biol. 2015 Dec 01;11(12):e1004572 [PMID: 26625158]
  14. Houst Law Rev. 1997;33(5):1431-74 [PMID: 12627556]
  15. Nature. 2012 Nov 1;491(7422):56-65 [PMID: 23128226]
  16. Cell. 2022 Sep 1;185(18):3426-3440.e19 [PMID: 36055201]
  17. Nature. 2015 Oct 1;526(7571):75-81 [PMID: 26432246]
  18. Nat Rev Genet. 2005 Apr;6(4):333-40 [PMID: 15803201]
  19. Bioinformatics. 2017 Feb 15;33(4):594-595 [PMID: 27742697]
  20. Proc Natl Acad Sci U S A. 2011 Mar 29;108(13):5154-62 [PMID: 21383195]
  21. Genome Res. 2017 Nov;27(11):1916-1929 [PMID: 28855259]
  22. Hum Genomics. 2005 Mar;2(1):4-19 [PMID: 15814064]
  23. Nat Genet. 2012 Jul 22;44(8):955-9 [PMID: 22820512]
  24. Am J Hum Genet. 2016 Jan 7;98(1):127-48 [PMID: 26748516]
  25. Nature. 2020 May;581(7809):434-443 [PMID: 32461654]
  26. Nat Genet. 2023 Sep;55(9):1589-1597 [PMID: 37604963]
  27. Nucleic Acids Res. 2017 Jan 4;45(D1):D840-D845 [PMID: 27899611]
  28. Gigascience. 2015 Feb 25;4:7 [PMID: 25722852]
  29. Bioinformatics. 2020 Feb 1;36(3):930-933 [PMID: 31393554]
  30. Science. 2020 Mar 20;367(6484): [PMID: 32193295]
  31. Nature. 2019 Nov;575(7784):652-657 [PMID: 31748747]
  32. Politics Life Sci. 1999 Sep;18(2):297-99 [PMID: 12557893]
  33. Nucleic Acids Res. 2012 May;40(9):e69 [PMID: 22302147]
  34. Nature. 2016 Oct 13;538(7624):201-206 [PMID: 27654912]
  35. Bioinformatics. 2016 Apr 15;32(8):1220-2 [PMID: 26647377]
  36. Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
  37. PLoS Genet. 2020 Nov 16;16(11):e1009049 [PMID: 33196638]
  38. Nat Genet. 2023 Jul;55(7):1243-1249 [PMID: 37386248]
  39. Bioinformatics. 2016 Sep 15;32(18):2817-23 [PMID: 27283948]
  40. Nature. 2015 Oct 1;526(7571):68-74 [PMID: 26432245]
  41. J Mol Diagn. 2021 May;23(5):651-657 [PMID: 33631350]
  42. BMJ Open. 2019 Feb 19;9(2):e025469 [PMID: 30782936]
  43. Cell Genom. 2022 Oct 12;2(10):100192 [PMID: 36777996]
  44. Science. 2021 Apr 2;372(6537): [PMID: 33632895]
  45. J Multivar Anal. 2019 Sep;173:145-164 [PMID: 32831421]
  46. Am J Hum Genet. 2021 Apr 1;108(4):656-668 [PMID: 33770507]
  47. Nat Commun. 2019 Apr 16;10(1):1784 [PMID: 30992455]
  48. Proc Natl Acad Sci U S A. 2005 Nov 1;102(44):15942-7 [PMID: 16243969]
  49. Genome Res. 2009 Sep;19(9):1655-64 [PMID: 19648217]
  50. Nat Genet. 2021 Jan;53(1):120-126 [PMID: 33414550]

Grants

  1. P30 DK043351/NIDDK NIH HHS
  2. R00 MH117229/NIMH NIH HHS
  3. R01 DE031261/NIDCR NIH HHS
  4. R01 MH115957/NIMH NIH HHS

Word Cloud

Created with Highcharts 10.0.0populationsresourcesdataresourcegenomicanalysesProject1kGPGenomeHGDPsequencedhighglobalharmonizedqualitygenomesdetailedancestryvariantcomparednewpublicreleasediverseUnderrepresentedoftenexcludedstudiesduepartlacksupporting1000GenomesHumanDiversityrecentlycoveragevaluablediversitycaptureopensharingpoliciesset4094wholeAggregationDatabasegnomADidentified153millionhigh-qualitySNVsindelsSVsperformedanalysiscohortcharacterizingpopulationstructurepatternsadmixtureacrossanalyzingsitefrequencyspectrameasuringcountssubcontinentallevelsalsodemonstratesubstantialaddedvaluedatasetpriorversionscomponenttypicallycombinedvialiftoverintersectionexamplecatalogmillionsgeneticvariantsmostlyrarepreviousreleasesadditionunrestrictedindividual-levelprovidetutorialsconductingmanycommoncontrolstepsscalablecloud-computingenvironmentpubliclyphasedjointcallsetusehaplotypephasingimputationpipelinesjointlycalledreferencepanelwillservekeysupportresearchdeeplyhuman

Similar Articles

Cited By

No available data.