GWAS Atlas is a manually curated resource of
genome-wide
genotype-phenotype (G2P) associations for a wide range of species.
The current release of GWAS Atlas features a comprehensive collection 278109
curated G2Ps for 1444 traits across 10 plants
and
5
animals,
which were manually curated from 830
publications.
More importantly, all traits were annotated and organized based on a suite of reference ontology
(PTO, Plant
Trait
Ontology;
ATOL, Animal Trait
Ontology
for Livestock)
and our customized ontology (PPTO, Plant
Phenotype and Trait Ontology;
APTO, Animal Phenotype
and
Trait
Ontology) terms.
Taken together, GWAS Atlas integrates high-quality curated GWAS associations for animals and
plants
and
provides user-friendly web interfaces for data browsing and downloading,
accordingly serving as a valuable resource for genetic research of important traits and breeding
application.
To provide high-quality information curated from GWAS publications, we set up a standardized
curation
process involving literature search, information retrieval, integration & annotation and
database
construction.
Figure 1 Overview of GWAS
Atlas
Curation Processes
We performed literature search in PubMed using species name and GWAS as keywords. Publications
are
eligible for inclusion in GWAS Atlas if they contain significant GWAS associations with
necessary
description on biological traits.
We manually curate the study and G2P information from publications.
As one publication may contain multiple studies with different experimental designs,
we record species name, sampling spot, year, condition, population, sample size, genotyping
technology,
association model, association number,
and PMID for each study. Regarding GWAS association, we collect species name, genome version,
genomic
position,
variant ID, traits, GWAS association P-value, R2 and mapped genes.
Table 1: The curation model for genome-wide genotype-phenotype associations
Data type |
Examples |
Genomic Location |
chr1:129845 |
Reference Genome |
Wm82.a2.v1 |
Environment |
field |
Sampling Spot |
Beijing, China |
Sampling Year |
2016 |
Condition |
salt stress |
Population |
319 landrace, 245 cultivared soybean accessions |
Sample Size |
127 |
Tissue |
leaf |
Trait |
plant height |
Genotype Technology |
Controlled vocabulary in table3
|
Association Model |
Controlled vocabulary in table4
|
P-value |
0.00000077 |
R2(%) |
18.1 |
Gene ID |
Glyma.01G003300 |
Gene Symbol |
ET2 |
PMID |
29081789 |
Journal |
Frontiers in Plant Science |
Title |
Genetic architecture of natural variation in rice nonphotochemical quenching capacity
revealed by genome-wide association study
|
Table 2: The curation model for causal variants
Data type |
Examples |
Genomic Location |
chr3:45301350 |
Reference Genome |
Wm82.a2.v1 |
Gene Symbol |
MYB4 |
Gene ID |
Glyma.03G258700 |
Reference allele |
A, T, C, G |
Alternative allele |
A, T, C, G |
Area |
Controlled vocabulary in table5 |
Trait |
flavone content |
Trait Impact |
Controlled vocabulary in table6 |
Allele Impact |
Controlled vocabulary in table7 |
Causal Type |
Controlled vocabulary in table8 |
PMID |
32082354 |
Table 3: Controlled vocabulary for genotyping technology
Tech ID |
Genotyping Technology |
Abbreviation Name |
1 |
Whole Genome Sequencing |
WGS |
2 |
Genotyping by Sequencing |
GBS |
3 |
Genotyping by Array |
Array |
4 |
Specific-Locus Amplified Fragment Sequencing |
SLAF-seq |
5 |
Whole Exome Sequencing |
WES |
6 |
RNA Sequencing |
RNA-seq |
7 |
Restriction-site Associated DNA Sequencing |
RAD-seq |
8 |
Droplet-assisted RNA Targeting by Single-cell Sequencing |
DART-seq |
9 |
Polymerase Chain Reaction |
PCR |
10 |
Unclassified |
other |
Table 4: Controlled vocabulary for GWAS association model
Model ID |
Model Name |
Abbreviation Name |
1 |
Mixed Linear Model |
MLM |
2 |
General Linear Model |
GLM |
3 |
Logistic Regression Model |
LRM |
4 |
Compressed Mixed Linear Model |
CMLM |
5 |
Unified Mixed Linear Model |
UMLM |
6 |
Efficient Mixed Model |
EMMAX |
7 |
Multi-Locus Mixed Model |
MLMM |
8 |
Bayesian Sparse Linear Mixed Model |
BSLMM |
9 |
Factored Spectrally Transformed Linear Mixed Model |
FaST-LMM |
10 |
Fixed and random model Circulating Probability Unification |
FarmCPU |
11 |
Joint-Linkage Model |
JLM |
12 |
Additive Inherence Model |
Additive model |
13 |
Fisher's Exact Test |
Fisher's exact test |
14 |
Least Squares Regression Model |
LSR |
15 |
Chi Square test |
X² test |
16 |
Case Control Model |
Case Control Model |
17 |
Multi-Locus Random-SNP-Effect Mixed Linear Model |
mrMLM |
18 |
Fast-Multi-Locus Random-SNP-Effect Mixed Linear Model |
FASTmrEMMA |
19 |
Unclassified |
other |
Table 5: Controlled vocabulary for gene area
Area ID |
Area Name |
1 |
3_prime_UTR |
2 |
5_prime_UTR |
3 |
exon |
4 |
intron |
5 |
CDS |
6 |
promoter |
7 |
upstream |
8 |
downstream |
Table 6: Controlled vocabulary for trait impact
Vocabulary ID |
Vocabulary Name |
1 |
increasing |
2 |
decreasing |
3 |
early |
4 |
delaying |
5 |
other |
Table 7: Controlled vocabulary for allele impact
Vocabulary ID |
Vocabulary Name |
1 |
inferior |
2 |
superior |
3 |
other |
Table 8: Controlled vocabulary for causal type
Vocabulary ID |
Vocabulary Name |
1 |
causal |
2 |
potential causal |
As the genome sequence is continuously updating, we unified the genomic lociation of variants
which
were collected from different publications to the latest version
of the reference genome in GVM using
sequence-based searching. If there are variant records in the GVM
database, we use the reference identifier in
VarID and redirect the user to the variant view in GVM.
All variants were annotated by VEP.
To unify the representation of biological traits, trait entities are mapped to a suite of
reference
ontologies
(PTO; ATOL) and
species-specific ontology (CO) using the ‘term
search’ in
Planteome API and Livestock
Ontologies.
Since not all curated traits are included in existing ontologies, we additionally establish PPTO and APTO by
integrating
more comprehensive terms
based on Open Biological and Biomedical Ontologies (OBO) format.