Disambiguating a Soft Metagenomic Clustering.

Rahul Nihalani, Jaroslaw Zola, Srinivas Aluru
Author Information
  1. Rahul Nihalani: Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA. ORCID
  2. Jaroslaw Zola: Department of Computer Science and Engineering, University at Buffalo, Buffalo, New York, USA.
  3. Srinivas Aluru: Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.

Abstract

Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences () to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is -Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.

Keywords

Word Cloud

Created with Highcharts 10.0.0clustersclusteringClusteringusedanalyzingmetagenomicsassignsequencesspeciesmakeambiguousproposedatasetspopulartechniqueampliconsequencingdataSpecificallyclusterrepresentinghigherleveltaxonomicunitReadsmultipleoftensharingsubsequencescombinedlackperfectsimilaritymeasuredifficultcorrectlyreadsThusmetagenomicmethodsmusteitherresortambiguitybestavailablechoicereadassignmentstageleadincorrectpotentiallycascadingerrorsarticlearguefirstgeneratingresolvingambiguitiescollectivelyrigorousformulationproblemshow-Hardefficientheuristicsolvepracticevalidateapproachseveralsyntheticallygeneratedtwoconsisting16SrDNAmicrobiomeratgutsDisambiguatingSoftMetagenomicNP-completenessalgorithm

Similar Articles

Cited By

No available data.