Documentation for GWH Tools

In this tutorial, we demonstrate a case study (Niu, J., Wang, W., Wang, Z. et al. Tagging large CNV blocks in wheat boosts digitalization of germplasm resources by ultra-low-coverage sequencing. Genome Biol 25, 171 (2024). https://doi.org/10.1186/s13059-024-03315-6) using GWH data together with other wheat genome sequence data to conduct a cost-effective genotyping approach that uses large copy number variation blocks (CNVbs) to digitalize and manage wheat germplasm resources with ultra-low-coverage sequencing, providing a platform for variety identification and modern breeding applications. The authors used previously published de novo assembled wheat genome sequences including GWHANRF00000000 to construct a pan-genome reference by iteratively mapping whole-genome resequencing data against 16 assembled reference genomes. This approach allowed them to identify novel genomic regions absent in the Chinese Spring reference genome, which helped construct a more representative pan-genome. These reference sequences were essential for tagging large copy number variation blocks (CNVbs) across wheat accessions, enabling accurate profiling even at low sequencing coverage. Here’s a detailed step-by-step guide to maximize GWH wheat genome sequence use:


Step 1: Download wheat reference genomes (from NCBI or Ensembl Plants) and collect additional genome assemblies like GWHANRF00000000 from GWH.


Step 2: Construct a wheat pan-genome. The constructed pan-genome allows the identification of structural variations across different accessions, ensuring that the complete diversity is represented. Start with the core reference (e.g., Chinese Spring). Add sequences missing from the reference but present in other assemblies. This is done by mapping and iteratively identifying new genomic blocks.


Step 3: Map whole-genome resequencing data against the pan-genome. With a comprehensive pan-genome ready, map the resequencing data of wheat accessions against it to identify CNVs or other variations. In the study, the authors used a 100 Kb window size to calculate read depths for CNV identification.


Step 4: Filter and tag CNV bocks. To make the identified CNVs useful for research, refine raw CNVs into markers: Apply Hidden Markov Models (HMMs) to reduce noise and identify consistent CNV blocks across varieties. Group CNVs that are closely linked to form CNV markers. Ensure that these markers are stable by validating their detection at low sequencing depths (ultra-low-coverage sequencing). The authors merged overlapping CNV blocks and used a filtering criterion based on the proportion of overlapping regions to refine the markers.


Step 5: Link CNV markers to traits. Identify regions associated with traits such as disease resistance or yield. Use data from trait studies and in silico analyses to associate specific CNV markers with beneficial alleles. Validate these findings using traditional wet-lab techniques like PCR for the presence of markers linked with specific traits.


Step 6: Develop an interactive database. The WheatCNVb platform provides a QR-code-like representation of each wheat variety’s CNV fingerprint.


Step 7: Utilize ultra-low-coverage sequencing for cost-effective analysis. One key point of the case study is the use of ultra-low-coverage sequencing for quick and cost-effective CNV analysis. The study used ultra-low-coverage sequencing to validate CNV markers with 99.3% accuracy in recalling markers, highlighting cost-effectiveness and reliability.