Systematic bias in malaria parasite relatedness estimation.

Somya Mehra, Daniel E Neafsey, Michael White, Aimee R Taylor
Author Information
  1. Somya Mehra: Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA. ORCID
  2. Daniel E Neafsey: Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, Massachusetts 02115, USA.
  3. Michael White: Infectious Disease Epidemiology and Analytics G5 Unit, Institut Pasteur, Universit�� Paris Cit��, Paris 75015, France.
  4. Aimee R Taylor: Infectious Disease Epidemiology and Analytics G5 Unit, Institut Pasteur, Universit�� Paris Cit��, Paris 75015, France.

Abstract

Genetic studies of Plasmodium parasites increasingly feature relatedness estimates. However, various aspects of malaria parasite relatedness estimation are not fully understood. For example, relatedness estimates based on whole-genome-sequence (WGS) data often exceed those based on sparser data types. Systematic bias in relatedness estimation is well documented in the literature geared towards diploid organisms, but largely unknown within the malaria community. We characterise systematic bias in malaria parasite relatedness estimation using three complementary approaches: theoretically, under a non-ancestral statistical model of pairwise relatedness; numerically, under a simulation model of ancestry; and empirically, using data on parasites sampled from Guyana and Colombia. We show that allele frequency estimates encode, locus-by-locus, relatedness averaged over the set of sampled parasites used to compute them. Plugging sample allele frequencies into models of pairwise relatedness can lead to systematic underestimation. However, systematic underestimation can be viewed as population-relatedness calibration, i.e., a way of generating measures of relative relatedness. Systematic underestimation is unavoidable when relatedness is estimated assuming independence between genetic markers. It is mitigated when relatedness is estimated using WGS data under a hidden Markov model (HMM) that exploits linkage between proximal markers. The extent of mitigation is unknowable when a HMM is fit to sparser data, but downstream analyses that use high relatedness thresholds are relatively robust regardless. In summary, practitioners can either resolve to use relative relatedness estimated under independence, or try to estimate absolute relatedness under a HMM. We propose various tools to help practitioners evaluate their situation on a case-by-case basis.

Keywords

Word Cloud

Created with Highcharts 10.0.0relatednessmalariadatamodelestimationbiasparasitesestimatesparasiteSystematicsystematicusingcanunderestimationestimatedindependenceHMMHowevervariousbasedWGSsparserpairwisesampledallelerelativemarkershiddenMarkovusepractitionersGeneticstudiesPlasmodiumincreasinglyfeatureaspectsfullyunderstoodexamplewhole-genome-sequenceoftenexceedtypeswelldocumentedliteraturegearedtowardsdiploidorganismslargelyunknownwithincommunitycharacterisethreecomplementaryapproaches:theoreticallynon-ancestralstatisticalnumericallysimulationancestryempiricallyGuyanaColombiashowfrequencyencodelocus-by-locusaveragedsetusedcomputePluggingsamplefrequenciesmodelsleadviewedpopulation-relatednesscalibrationiewaygeneratingmeasuresunavoidableassuminggeneticmitigatedexploitslinkageproximalextentmitigationunknowablefitdownstreamanalyseshighthresholdsrelativelyrobustregardlesssummaryeitherresolvetryestimateabsoluteproposetoolshelpevaluatesituationcase-by-casebasisidentitydescent

Similar Articles

Cited By (1)