| URL: | http://csbio.unc.edu/CCstatus/index.py?run=Pseudo |
| Full name: | |
| Description: | Here we provide genomic sequences for the Collaborative Cross (CC) mouse strains and the eight CC founder strains in the form of FASTA files for the 19 autosomes, sex chromosomes (X and Y), and mitochondria (M). These sequences can be used as reference sequences for high-throughput short-read alignments, or for any other comparative genomic analyses. Each genome comes with a companion MOD file, which can be used to remap coordinates from the FASTA sequences back to reference coordinates. This is necessary since, in general, all gene and genomic annotations are specified relative to the reference. MOD files are genome and version specific, and therefore should always be downloaded together as a set with their associated FASTA sequence. We supply two types of genomes, sequenced and imputed. Sequenced genomes result from direct DNA sequencing at a minimum of 30x coverage, and an iterative alignment process. Imputed genomes are derived from genotype data, where we first construct a haplotype mosaic using MegaMUGA genotypes and then assemble an imputed genome using segments of DNA sequence from the inferred founders |
| Year founded: | 2014 |
| Last update: | |
| Version: | |
| Accessibility: |
Accessible
|
| Country/Region: | United States |
| Data type: | |
| Data object: | |
| Database category: | |
| Major species: | |
| Keywords: |
| University/Institution: | University of California Los Angeles |
| Address: | Department of Computer Science, University of California, Los Angeles, CA 90095, USA |
| City: | |
| Province/State: | |
| Country/Region: | United States |
| Contact name (PI/Team): | Wang W |
| Contact email (PI/Helpdesk): | weiwang@cs.ucla.edu. |
|
A novel multi-alignment pipeline for high-throughput sequencing data. [PMID: 24948510]
Mapping reads to a reference sequence is a common step when analyzing allele effects in high-throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending on the genetic distances of the target sequences from the reference. To avoid this bias, researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single-reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins. Database URL: http://csbio.unc.edu/CCstatus/index.py?run=Pseudo. |