Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs.

Giovanni Madrigal, Bushra Fazal Minhas, Julian Catchen
Author Information
  1. Giovanni Madrigal: Department of Evolution, Ecology, and Behavior, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.
  2. Bushra Fazal Minhas: Informatics Program, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.
  3. Julian Catchen: Department of Evolution, Ecology, and Behavior, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. ORCID

Abstract

The improvement and decreasing costs of third-generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g. genes) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy's utility by investigating antifreeze glycoprotein (afgp) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able identify an unmappable locus in the mudskipper reference genome and identify a putative repetitive element shared among several species of bees.

Keywords

Associated Data

RefSeq | GCA_004786185.1; GCF_026225935.1; GCF_024542735.1; GCF_024516045.1; GCA_905332935.1; GCA_911387925.2; GCA_930367275.1; GCA_911622165.2; GCF_910591885.1

References

  1. Genome Res. 2014 Apr;24(4):688-96 [PMID: 24418700]
  2. J Adv Res. 2024 Apr;58:93-104 [PMID: 37220853]
  3. Genome Biol. 2010;11(3):R28 [PMID: 20219098]
  4. BMC Bioinformatics. 2015 Nov 16;16:386 [PMID: 26573684]
  5. BMC Genomics. 2013 Dec 12;14:874 [PMID: 24330608]
  6. Wellcome Open Res. 2023 Feb 15;8:78 [PMID: 37881254]
  7. Microb Genom. 2018 May;4(5): [PMID: 29708484]
  8. J Econ Entomol. 2015 Jun;108(3):873-9 [PMID: 26470206]
  9. BMC Genomics. 2023 Mar 16;24(1):117 [PMID: 36927511]
  10. Proc Natl Acad Sci U S A. 1997 Apr 15;94(8):3817-22 [PMID: 9108061]
  11. Proc Natl Acad Sci U S A. 1997 Apr 15;94(8):3811-6 [PMID: 9108060]
  12. Ecol Evol. 2018 Jul 13;8(16):7849-7864 [PMID: 30250668]
  13. J Fish Biol. 2014 Mar;84(3):774-93 [PMID: 24588642]
  14. Brief Bioinform. 2019 May 21;20(3):866-876 [PMID: 29112696]
  15. Animals (Basel). 2018 Feb 07;8(2): [PMID: 29414871]
  16. G3 (Bethesda). 2020 Mar 5;10(3):899-906 [PMID: 31969427]
  17. Biology (Basel). 2020 Sep 18;9(9): [PMID: 32962098]
  18. Wellcome Open Res. 2021 Oct 14;6:270 [PMID: 34778570]
  19. Environ Entomol. 2019 Aug 5;48(4):799-806 [PMID: 31175358]
  20. Mol Ecol Resour. 2021 Jul;21(5):1416-1421 [PMID: 33629477]
  21. Genome Biol. 2020 Feb 7;21(1):30 [PMID: 32033565]
  22. Bioinformatics. 2015 Apr 15;31(8):1305-6 [PMID: 25481007]
  23. Genomics. 2011 Sep;98(3):194-201 [PMID: 21684327]
  24. Nat Rev Genet. 2004 May;5(5):389-96 [PMID: 15168696]
  25. PLoS One. 2011 Apr 18;6(4):e18911 [PMID: 21533117]
  26. Front Genet. 2019 Feb 07;9:672 [PMID: 30792737]
  27. Mol Ecol Resour. 2025 Jan;25(1):e13982 [PMID: 38800997]
  28. Bioinformatics. 2015 Oct 1;31(19):3210-2 [PMID: 26059717]
  29. Nat Biotechnol. 2019 May;37(5):540-546 [PMID: 30936562]
  30. PLoS Comput Biol. 2020 Jul 31;16(7):e1008104 [PMID: 32735589]
  31. J Genet Genomics. 2023 Oct;50(10):747-754 [PMID: 37245652]
  32. Wellcome Open Res. 2023 Apr 12;8:161 [PMID: 38283327]
  33. Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26 [PMID: 34850941]
  34. Genome Biol. 2022 Dec 15;23(1):258 [PMID: 36522651]
  35. Nature. 1977 Feb 24;265(5596):687-95 [PMID: 870828]
  36. Insects. 2021 Jul 09;12(7): [PMID: 34357286]
  37. Pigment Cell Melanoma Res. 2015 Sep;28(5):545-58 [PMID: 26079969]
  38. Nature. 2018 Feb 1;554(7690):50-55 [PMID: 29364872]
  39. BMC Genomics. 2012 Jul 02;13:293 [PMID: 22747999]
  40. Quant Plant Biol. 2022 Mar 11;3:e5 [PMID: 37077982]
  41. Nat Commun. 2023 Jun 9;14(1):3412 [PMID: 37296119]
  42. Trends Genet. 2023 Sep;39(9):649-671 [PMID: 37230864]
  43. Open Biol. 2023 Feb;13(2):220235 [PMID: 36789536]
  44. G3 (Bethesda). 2022 Nov 4;12(11): [PMID: 35904764]
  45. Nat Methods. 2023 Jan;20(1):6-11 [PMID: 36635542]
  46. G3 (Bethesda). 2023 Aug 9;13(8): [PMID: 37336593]
  47. Gigascience. 2013 Jul 22;2(1):10 [PMID: 23870653]
  48. Bioinformatics. 2007 Feb 15;23(4):414-20 [PMID: 17204465]
  49. Gigascience. 2020 Jan 1;9(1): [PMID: 31895413]
  50. J Exp Biol. 2006 May;209(Pt 10):1791-802 [PMID: 16651546]
  51. Trends Genet. 2023 Mar;39(3):175-186 [PMID: 36402623]
  52. NAR Genom Bioinform. 2021 Jan 06;3(1):lqaa108 [PMID: 33575650]
  53. PLoS Pathog. 2019 Sep 12;15(9):e1007901 [PMID: 31513692]
  54. BMC Genomics. 2019 Dec 19;20(1):1000 [PMID: 31856728]
  55. Mol Ecol Resour. 2022 Jul;22(5):1954-1971 [PMID: 35146928]
  56. Nucleic Acids Res. 2002 Jun 1;30(11):2478-83 [PMID: 12034836]
  57. Nucleic Acids Res. 2009 Jan;37(1):289-97 [PMID: 19042974]
  58. Nat Ecol Evol. 2019 Mar;3(3):469-478 [PMID: 30804520]
  59. BMC Genom Data. 2023 Apr 14;24(1):21 [PMID: 37060047]
  60. Mar Drugs. 2017 Nov 22;15(11): [PMID: 29165344]
  61. Mol Biol Rep. 2022 Nov;49(11):11133-11148 [PMID: 36151399]
  62. Nat Methods. 2021 Feb;18(2):170-175 [PMID: 33526886]
  63. Nucleic Acids Res. 2019 Dec 2;47(21):10994-11006 [PMID: 31584084]
  64. J Exp Biol. 2020 Jan 23;223(Pt 2): [PMID: 31836650]
  65. Sci Rep. 2015 Dec 10;5:18087 [PMID: 26657562]
  66. BMC Genomics. 2018 Sep 14;19(1):675 [PMID: 30217147]
  67. Bioinformatics. 2018 Jan 1;34(1):24-32 [PMID: 28961789]
  68. Gigascience. 2017 Apr 1;6(4):1-5 [PMID: 28327946]
  69. Genome Biol. 2013 May 27;14(5):R47 [PMID: 23710727]
  70. Bioinformatics. 2023 Oct 3;39(10): [PMID: 37758247]
  71. J Econ Entomol. 2018 Feb 9;111(1):26-32 [PMID: 29272434]
  72. Front Genet. 2021 Sep 16;12:697477 [PMID: 34603370]
  73. G3 (Bethesda). 2021 Feb 9;11(2): [PMID: 33598708]
  74. Wellcome Open Res. 2023 Mar 29;8:143 [PMID: 37954924]
  75. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  76. Drug Discov Today. 2002 Jun 1;7(11):S70-6 [PMID: 12047883]
  77. Nucleic Acids Res. 2002 Oct 1;30(19):4103-17 [PMID: 12364589]
  78. Proc Natl Acad Sci U S A. 2016 May 3;113(18):5053-8 [PMID: 27035985]
  79. Bioinformatics. 2018 Sep 15;34(18):3094-3100 [PMID: 29750242]
  80. Front Genet. 2023 Jan 04;13:1114542 [PMID: 36685894]
  81. Mol Biol Evol. 2023 Mar 4;40(3): [PMID: 36806940]
  82. BMC Genomics. 2023 Sep 13;24(1):543 [PMID: 37704968]
  83. Nat Commun. 2014 Dec 02;5:5594 [PMID: 25463417]
  84. Brief Bioinform. 2013 Mar;14(2):178-92 [PMID: 22517427]
  85. BMC Bioinformatics. 2009 Dec 15;10:421 [PMID: 20003500]
  86. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7 [PMID: 15034147]
  87. Bioinformatics. 2023 Jan 1;39(1): [PMID: 36321867]
  88. Animals (Basel). 2021 Feb 23;11(2): [PMID: 33672418]
  89. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  90. G3 (Bethesda). 2023 Aug 30;13(9): [PMID: 37395764]
  91. Genome Biol. 2022 Nov 14;23(1):242 [PMID: 36376928]

Grants

  1. 1645087/Division of Antarctic Sciences

MeSH Term

Animals
Bees
Computational Biology
Sequence Analysis, DNA
Antifreeze Proteins
High-Throughput Nucleotide Sequencing
Genome
Genomics
Software
Fishes

Chemicals

Antifreeze Proteins

Word Cloud

Created with Highcharts 10.0.0genomereferenceassemblygeneassembliesgenomesintegritytoolmisassembledregionsafgptwoicefishessnakeheadmudskipperablelocusidentifyimprovementdecreasingcoststhird-generationsequencingtechnologieswidenedscopebiologicalquestionsresearcherscanaddressdenovoincreasingnumbervalidatingminimaloverheadvitalestablishingconfidentresultsapplicationspresentKlumpydetectingvisualizinggeneticelementseggenesinterestsetsequencesleveraginginitialrawreadscombinationrespectiveillustrateKlumpy'sutilityinvestigatingantifreezeglycoproteinlociacrosssearchingreportedabsentnorthernfishscanningbumblebeeformercasesprovidesupportnoncanonicalplacementlocatemissingFurthermorescansunmappableputativerepetitiveelementsharedamongseveralspeciesbeesKlumpy:evaluatelong-readillusivesequencemotifsbioinfomaticsfinding/annotationgenomicslong���read

Similar Articles

Cited By (3)