rCRUX: A Rapid and Versatile Tool for Generating Metabarcoding Reference libraries in R.

Emily E Curd, Luna Gal, Ramon Gallego, Katherine Silliman, Shaun Nielsen, Zachary Gold
Author Information
  1. Emily E Curd: Vermont Biomedical Research Network, University of Vermont, VT, USA.
  2. Luna Gal: Landmark College, VT, USA.
  3. Ramon Gallego: Departamento de Biolog��a, Universidad Aut��noma de Madrid, Cantoblanco, Madrid, Spain.
  4. Katherine Silliman: Northern Gulf Institute, Mississippi State University, Starkville, MS, USA.
  5. Shaun Nielsen: Independent Researcher.
  6. Zachary Gold: California Cooperative Oceanic Fisheries Investigations (CalCOFI), Scripps Institution of Oceanography, University of California San Diego (UCSD), La Jolla, CA, USA.

Abstract

The sequencing revolution requires accurate taxonomic classification of DNA sequences. Key to making accurate taxonomic assignments are curated, comprehensive reference barcode databases. However, the generation and curation of such databases has remained challenging given the large and continuously growing volumes of both DNA sequence data and novel reference barcode targets. Monitoring and research applications require a greater diversity of specialized gene regions and targeted taxa then are currently curated by professional staff. Thus there is a growing need for an easy to implement computational tool that can generate comprehensive metabarcoding reference libraries for any bespoke locus. We address this need by reimagining CRUX from the Anacapa Toolkit and present the rCRUX package in R which, like it's predecessor, relies on sequence homology and PCR primer compatibility instead of keyword-searches to avoid limitations of user-defined metadata. The typical workflow involves searching for plausible seed amplicons (() or ()) by simulating PCR to acquire a set of sequences analogous to PCR products containing a user-defined set of primer sequences. Next, these seeds are used to iteratively blast search seed sequences against a local copy of the National Center for Biotechnology Information (NCBI) formatted database using a taxonomic-rank based stratified random sampling approach ( () ). This results in a comprehensive set of sequence matches. This database is dereplicated and cleaned (()) by identifying identical reference sequences and collapsing the taxonomic path to the lowest taxonomic agreement across all matching reads. This results in a curated, comprehensive database of primer-specific reference barcode sequences from NCBI. Databases can then be compared (()) to determine read and taxonomic overlap. We demonstrate that rCRUX provides more comprehensive reference databases for the MiFish Universal Teleost 12S, Taberlet trnl, fungal ITS, and Leray CO1 loci than CRABS, MetaCurator, RESCRIPt, and ecoPCR reference databases. We then further demonstrate the utility of rCRUX by generating 24 reference databases for 20 metabarcoding loci, many of which lack dedicated reference database curation efforts. The rCRUX package provides a simple to use tool for the generation of curated, comprehensive reference databases for user-defined loci, facilitating accurate and effective taxonomic classification of metabarcoding and DNA sequence efforts broadly.

References

  1. ISME Commun. 2023 Apr 20;3(1):33 [PMID: 37081077]
  2. Nature. 2017 Nov 23;551(7681):457-463 [PMID: 29088705]
  3. Mol Biol Evol. 2023 Mar 4;40(3): [PMID: 36857197]
  4. Syst Biol. 2009 Aug;58(4):445-51 [PMID: 20525598]
  5. Sci Rep. 2019 Mar 14;9(1):4476 [PMID: 30872700]
  6. R Soc Open Sci. 2015 Jul 22;2(7):150088 [PMID: 26587265]
  7. Nucleic Acids Res. 2013 Jan;41(Database issue):D597-604 [PMID: 23193267]
  8. FEMS Microbiol Ecol. 2012 Dec;82(3):666-77 [PMID: 22738186]
  9. Database (Oxford). 2020 Jan 1;2020: [PMID: 32016319]
  10. Proc Assoc Inf Sci Technol. 2020;57(1): [PMID: 38577577]
  11. Curr Opin Biotechnol. 2023 Jun;81:102936 [PMID: 37060640]
  12. Mol Ecol. 2016 May;25(10):2302-11 [PMID: 26945527]
  13. Mol Ecol Resour. 2021 Oct;21(7):2546-2564 [PMID: 34235858]
  14. Sci Total Environ. 2023 May 15;873:162322 [PMID: 36801404]
  15. Proc Biol Sci. 2003 Aug 7;270 Suppl 1:S96-9 [PMID: 12952648]
  16. Genes (Basel). 2019 Mar 01;10(3): [PMID: 30832286]
  17. Mol Ecol Resour. 2023 Feb;23(2):511-518 [PMID: 36239541]
  18. BMC Genomics. 2010 Jul 16;11:434 [PMID: 20637073]
  19. PLoS One. 2013 Apr 22;8(4):e61217 [PMID: 23630581]
  20. Mol Ecol Resour. 2023 Apr;23(3):725-738 [PMID: 36437603]
  21. Front Genet. 2021 Sep 07;12:706907 [PMID: 34557218]
  22. Sci Data. 2018 Aug 07;5:180156 [PMID: 30084847]
  23. Plant Mol Biol. 1991 Nov;17(5):1105-9 [PMID: 1932684]
  24. Mol Ecol Resour. 2021 Oct;21(7):2565-2579 [PMID: 34002951]
  25. Cell. 2019 Aug 8;178(4):779-794 [PMID: 31398336]
  26. Nucleic Acids Res. 2007;35(3):e14 [PMID: 17169982]
  27. Mol Ecol Resour. 2023 May;23(4):818-832 [PMID: 36695156]
  28. Ecology. 2023 Feb;104(2):e3906 [PMID: 36320096]
  29. Sci Data. 2016 Mar 15;3:160018 [PMID: 26978244]
  30. Mol Ecol Resour. 2023 Nov;23(8):1880-1904 [PMID: 37602732]
  31. Proc Natl Acad Sci U S A. 2022 Jan 25;119(4): [PMID: 35042805]
  32. Ecol Evol. 2020 Jul 23;10(18):9721-9739 [PMID: 33005342]
  33. Nat Commun. 2020 Oct 15;11(1):5206 [PMID: 33060586]
  34. Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6 [PMID: 23193283]
  35. Sci Data. 2020 Jul 3;7(1):209 [PMID: 32620910]
  36. PeerJ. 2018 Apr 18;6:e4652 [PMID: 29682424]
  37. Mol Ecol Resour. 2023 May;23(4):742-755 [PMID: 36478393]
  38. Evol Appl. 2019 Nov 18;13(2):245-262 [PMID: 31993074]
  39. BMC Bioinformatics. 2012 Jun 18;13:134 [PMID: 22708584]
  40. PLoS One. 2013 Jun 27;8(6):e67818 [PMID: 23826345]
  41. Mol Ecol Resour. 2021 Oct;21(7):2190-2203 [PMID: 33905615]
  42. Nucleic Acids Res. 2022 Jan 7;50(D1):D161-D164 [PMID: 34850943]
  43. Mol Ecol. 2016 Feb;25(4):929-42 [PMID: 26479867]
  44. Mol Ecol Resour. 2015 Nov;15(6):1403-14 [PMID: 25732605]
  45. PLoS Comput Biol. 2021 Nov 8;17(11):e1009581 [PMID: 34748542]
  46. Sci Rep. 2016 Mar 11;6:22908 [PMID: 26965911]
  47. Mol Syst Biol. 2013;9:640 [PMID: 23340846]
  48. Chemosphere. 2022 Jul;298:134239 [PMID: 35292278]
  49. Mol Ecol. 2017 Nov;26(21):5872-5895 [PMID: 28921802]
  50. Mol Ecol Resour. 2023 May;23(4):933-945 [PMID: 36656075]
  51. Nucleic Acids Res. 2022 Jan 7;50(D1):D106-D110 [PMID: 34850158]
  52. PLoS One. 2013;8(1):e51629 [PMID: 23505408]
  53. PLoS One. 2019 Jun 19;14(6):e0217084 [PMID: 31216285]
  54. Microbiome. 2018 May 17;6(1):90 [PMID: 29773078]
  55. Mol Ecol Notes. 2007 May 1;7(3):355-364 [PMID: 18784790]
  56. Mol Ecol Resour. 2021 Feb;21(2):573-583 [PMID: 33000878]
  57. PeerJ. 2023 Mar 24;11:e14993 [PMID: 36992947]
  58. Genome Biol. 2021 Sep 20;22(1):270 [PMID: 34544477]
  59. PeerJ. 2016 Sep 13;4:e2444 [PMID: 27672503]
  60. Sci Am. 2008 Oct;299(4):82-6, 88 [PMID: 18847089]
  61. BMC Bioinformatics. 2009 Dec 15;10:421 [PMID: 20003500]
  62. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  63. Nat Biotechnol. 2019 Aug;37(8):852-857 [PMID: 31341288]
  64. R Soc Open Sci. 2021 Feb 10;8(2):201388 [PMID: 33972853]
  65. Proc Natl Acad Sci U S A. 2019 Nov 5;116(45):22651-22656 [PMID: 31636175]
  66. Mol Ecol Resour. 2021 Oct;21(7):2187-2189 [PMID: 34268901]

Grants

  1. P20 GM103449/NIGMS NIH HHS

Word Cloud

Created with Highcharts 10.0.0referencetaxonomicsequencescomprehensivedatabasescuratedsequencerCRUXdatabaseaccurateDNAbarcodemetabarcodingPCRuser-definedsetlociclassificationgenerationcurationgrowingneedtoolcanlibrariespackageRprimerseedNCBIresultsdemonstrateprovideseffortssequencingrevolutionrequiresKeymakingassignmentsHoweverremainedchallenginggivenlargecontinuouslyvolumesdatanoveltargetsMonitoringresearchapplicationsrequiregreaterdiversityspecializedgeneregionstargetedtaxacurrentlyprofessionalstaffThuseasyimplementcomputationalgeneratebespokelocusaddressreimaginingCRUXAnacapaToolkitpresentlikepredecessorrelieshomologycompatibilityinsteadkeyword-searchesavoidlimitationsmetadatatypicalworkflowinvolvessearchingplausibleampliconssimulatingacquireanalogousproductscontainingNextseedsusediterativelyblastsearchlocalcopyNationalCenterBiotechnologyInformationformattedusingtaxonomic-rankbasedstratifiedrandomsamplingapproachmatchesdereplicatedcleanedidentifyingidenticalcollapsingpathlowestagreementacrossmatchingreadsprimer-specificDatabasescompareddeterminereadoverlapMiFishUniversalTeleost12STaberlettrnlfungalITSLerayCO1CRABSMetaCuratorRESCRIPtecoPCRutilitygenerating2420manylackdedicatedsimpleusefacilitatingeffectivebroadlyrCRUX:RapidVersatileToolGeneratingMetabarcodingReference

Similar Articles

Cited By (2)