SR4R, standing for SnpReady for Rice, is a database containing four reference SNP panels for rice, including 2,097,405 hapmapSNPs, 156,502 tagSNPs, 1,180 fixedSNPs, and 38 barcodeSNPs, which are capable of satisfying different needs for population genetics, evolutionary analysis, association studies and genomic breeding.
SR4R adopts machine learning algorithms to process massive raw SNPs and features the lowest SNP redundancy and highest genetic diversity of rice populations, significantly improving the efficiency of utilizing the rice variation map with different purposes.
We used the following steps to obtain the hapmapSNP panel. First, 2,556 accession samples with genotype missing rate less than 20% were selected. Second, SNPs with genotype missing rate ≥ 0.1 and minor allele frequency (MAF) ≤ 0.05 were removed. Third, genotype imputation was performed on the 2,556 accessions. Finally, a high-quality HapMap containing 2,097,405 SNPs without any missing genotype using Beagle was generated.
Genomic annotation of hapmapSNPs were performed using ANNOVAR (version 20160201) against the rice IRGSP (International Rice Genome Sequencing Project) gene annotation.
We adopted a LD-based SNP pruning procedure to infer haplotype tagging SNPs (tagSNPs) from the hapmapSNPs. Taking the reported LD length of rice ranging between 40 to 500 Kb, a LD-based SNP pruning method was used to construct the tagSNPs category using PLINK with –indep command. The PLINK parameters were selected based on the variance inflation factor (VIF), which recursively removed SNPs within a sliding window of 50 SNPs and a step size of 5 SNPs to shift the window.
The following steps were used for generating the fixedSNP panel. First, selective sweep regions were identified, which are specific to each subpopulation and common to the six subpopulations, by combining the ratio of Fst versus θπ based on the comparison of the cultivated subpopulation against the wild rice population. Second, using 100 Kb and 10 Kb windows, large and small genomic regions showing selective sweep signals were identified, respectively. In total, 227 (cultivated vs. wild), 381 (Ind vs. wild), 333 (Aus vs. wild), 296 (Aro vs. wild), 256 (TrJ vs. wild) and 269 (TeJ vs. wild) identified regions showed significantly smaller Tajima' D values compared to other regions. Third, genes located in the selective sweep regions were identified and a total of 1,180 SNPs occurred within the genes in the selective sweep regions were selected to generate the fixedSNP panel.
The MinimalMarker algorithm was performed on the fixedSNP panel to exhaustively traverse all possible genotype combinations to distinguish the 2,556 accessions. The MinimalMarker algorithm generate three sets of minimum marker combinations, in which each set contains 28 SNPs. After merging the three sets, 38 barcodeSNPs were finally selected to generate the panel.
Yes. Annotation files and genotype files of all these panels are downloadable at http://sr4r.ic4r.org/download.
Yes. SR4R provides online analysis tools and allows users to perform online analysis simply by uploading user-defined data. The online analysis tools for subpopulation classification and DNA fingerprint analysis are publicly available at http://sr4r.ic4r.org/onlineTools/ml and http://sr4r.ic4r.org/onlineTools/match, respectively.
To identify commercialized rice varieties using the combination of 38 barcodeSNPs, seven machine learning-based models were used, including decision tree, k-nearest neighboring, naïve Bayesian, artificial neural network, random forest, multinomial logistic regression and one-vs-rest logistic regression algorithms in the python sklearn library (https://scikit-learn.org/stable/). The precision of each model was assessed using ten-fold cross validation method; in details, the original sample set was randomly partitioned into ten subsets, in which nine subsets were used for training model and the rest one subset was used for testing model. This procedure was repeated for ten times and an average prediction accuracy were computed as the overall performance of the tested models.
This work was a collaboration project by China Agricultural University and Beijing Institute of Genomics, CAS. If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email at xwang@cau.edu.cn, songshh@big.ac.cn or zhangzhang@big.ac.cn.
We are happy if you would like to have a visit to explore the possibility for collaboration or learn more about our work. Our physical address is 1 Beichen West Road, Chaoyang District, Beijing 100101, China.