Novel sequences (NSs), not present in the human reference genome, are abundant and remain largely unexplored. Here, we utilize de novo assembly to study NS in 1,000 Swedish individuals first sequenced as part of the SweGen project revealing a total of 46 Mb in 61,044 distinct contigs of sequences not present in GRCh38. The contigs were aligned to recently published catalogs of Icelandic and Pan-African NSs, as well as the chimpanzee genome, revealing a great diversity of shared sequences. Analyzing the positioning of NS across the chimpanzee genome, we find that 2,807 NS align confidently within 143 chimpanzee orthologs of human genes. Aligning the whole genome sequencing data to the chimpanzee genome, we discover ancestral NS common throughout the Swedish population. The NSs were searched for repeats and repeat elements: revealing a majority of repetitive sequence (56%), and enrichment of simple repeats (28%) and satellites (15%). Lastly, we align the unmappable reads of a subset of the thousand genomes data to our collection of NS, as well as the previously published Pan-African NS: revealing that both the Swedish and Pan-African NS are widespread, and that the Swedish NSs are largely a subset of the Pan-African NS. Overall, these results highlight the importance of creating a more diverse reference genome and illustrate that significant amounts of the NS may be of ancestral origin.
F1000Res. 2017 May 10;6:664
[PMID:
28781756]
Cell. 2019 Jan 24;176(3):663-675.e19
[PMID:
30661756]
Genome Res. 2017 May;27(5):665-676
[PMID:
28360232]
BMC Bioinformatics. 2005 Feb 15;6:31
[PMID:
15713233]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6
[PMID:
14681465]
Bioinformatics. 2015 Nov 15;31(22):3694-6
[PMID:
26220959]
Nature. 2016 Aug 17;536(7616):285-91
[PMID:
27535533]
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D514-7
[PMID:
15608251]
Eur J Hum Genet. 2017 Nov;25(11):1253-1260
[PMID:
28832569]
Nat Genet. 2011 Aug 14;43(9):838-46
[PMID:
21841781]
Bioinformatics. 2012 Apr 15;28(8):1086-92
[PMID:
22368243]
Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761
[PMID:
29155950]
Nature. 2001 Feb 15;409(6822):860-921
[PMID:
11237011]
Bioinformatics. 2010 Mar 15;26(6):841-2
[PMID:
20110278]
Bioinformatics. 2007 Feb 15;23(4):500-1
[PMID:
17158514]
Bioinformatics. 2006 Jul 1;22(13):1658-9
[PMID:
16731699]
Brief Bioinform. 2010 Sep;11(5):457-72
[PMID:
20724458]
Bioinformatics. 2013 Feb 1;29(3):389-90
[PMID:
23233656]
Nat Genet. 2017 Apr;49(4):588-593
[PMID:
28250455]
PLoS Genet. 2018 Nov 12;14(11):e1007780
[PMID:
30419018]
Bioinformatics. 2013 Jan 15;29(2):275-6
[PMID:
23172865]
Nature. 2016 Oct 13;538(7624):201-206
[PMID:
27654912]
Nat Genet. 2018 Aug;50(8):1161-1170
[PMID:
30038395]
Bioinformatics. 2012 Jul 15;28(14):1838-44
[PMID:
22569178]
Bioinformatics. 2015 May 15;31(10):1569-76
[PMID:
25609798]
Clin Mol Hepatol. 2015 Jun;21(2):105-14
[PMID:
26157746]
Genes (Basel). 2018 Oct 09;9(10):null
[PMID:
30304863]
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D590-8
[PMID:
16381938]
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944
[PMID:
29373581]
Brain Res Mol Brain Res. 1992 Mar;13(1-2):155-7
[PMID:
1349721]
Nat Biotechnol. 2011 Jan;29(1):24-6
[PMID:
21221095]
Bioinformatics. 2009 Aug 15;25(16):2078-9
[PMID:
19505943]
Nat Methods. 2013 Jun;10(6):563-9
[PMID:
23644548]
Genome Res. 2008 May;18(5):810-20
[PMID:
18340039]
Nat Genet. 2019 Jan;51(1):30-35
[PMID:
30455414]
Nat Rev Genet. 2015 Nov;16(11):627-40
[PMID:
26442640]
Genome Res. 2009 Jun;19(6):1117-23
[PMID:
19251739]
Environ Mol Mutagen. 2015 Jun;56(5):419-36
[PMID:
25892534]
Nature. 2017 Aug 3;548(7665):87-91
[PMID:
28746312]
J Mol Biol. 1990 Oct 5;215(3):403-10
[PMID:
2231712]