1. Datasets

Totally, 29,069 chloroplast genome assemblies covering 16,436 species (626 families, 4,019 genera) are deposited in CGIR. 9,785 assemblies covering 6,628 species were downloaded from NCBI, 16 assemblies covering 16 species were downloaded from NGDC Genome Warehouse and 1,170 assemblies covering 718 species were sequenced by National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences.

CGIR integrates 3,287,171 gene records, 10,914,100 simple sequence repeats (SSRs), 606,695 DNA barcodes and 27,400,816 DNA signature sequences (DSSs) of chloroplast genome. The taxonomic classification of each chloroplast genomes (including families, genera, and species) and gene locus name were standardized. SSRs and DNA barcodes were systematically analyzed. DSSs for species identification were investigated in 1,849 seed plants with more than two chloroplast genomes in CGIR.

2. Curation Model and Software

2.1 Taxonomy

The taxonomic information were standardized based on Species 2000. Briefly, we curated a taxonomic name if its taxonomic status in species 2000 is “synonym”. Species name not recorded in species 2000 were curated based on NCBI Taxonomy Database, and other references detailed below.

Plant Groups Reference Database
Angiosperm Angiosperm Phylogeny Group (APG IV)
Gymnosperm The Plant List (http://www.theplantlist.org)
Bryophytes The Plant List (http://www.theplantlist.org)
Pteridophyte Pteridophyte Phylogeny Group (PPG I)
Phycophyta AlgaeBase (https://www.algaebase.org)

2.2 Featured plants

Featured plants in CGIR were curated based on World Checklist of Useful Plant Species (2020) and divided into 6 categories, including environmental, food, forage, material, medicine, poison. The category of featured plants is detailed below.

Category Description
environmental Examples include intercrops and nurse crops, ornamentals, barrier hedges, shade plants, windbreaks, soil improvers, plants for revegetation and erosion control, wastewater purifiers, indicators of the presence of metals, pollution, or underground water.
food Food, including beverages, for humans only.
forage Forage and fodder for vertebrate animals only.
material Woods, fibres, cork, cane, tannins, latex, resins, gums, waxes, oils, lipids, etc. and their derived products, including charcoal, petroleum substitutes, fuel alcohols
medicine Both human and veterinary.
poison Plants which are poisonous to vertebrates and invertebrates, both accidentally and usefully, e.g., for hunting and fishing.

For medicine recorded in Chinese Pharmacopoeia and National Compilation of Chinese Herbal Medicines, their medicinal organs were also curated and the curation model for medicinal organ is listed below.

Medicinal organ Plant tissue
Radix Root
Rhizoma Subterraneous stem (including rhizome, tuber, bulb, corm, etc.)
Caulis Rattan stem
Lignum Phloem
Folium Xylem
Flos Flower
Fructus Fruit
Semen Seed
Herba Whole herb
Resina Resin
Others Not in the above categories. (e.g. algae)

2.3 Simple sequence repeats (SSR)

SSRs could be divided into three types (perfect, imperfect, and compound SSRs). Perfect and compound SSRs were identified using microsatellite identification tool (MISA); Imperfect SSRs were identified by IMEx. All primers were designed by Primer 3.

2.4 DNA barcodes

DNA Barcodes were identified based on an in-silico approach. For each DNA barcoding region, the selected forward and reverse primers were aligned to the chloroplast genomes using BLAST. Based on the alignment position, the nucleotides between the aligned primers were considered as DNA barcodes.

The primers used in DNA barcodes identification is listed below.

Barcode region Forward Primer Reverse Primer Reference
Name Sequence Name Sequence
atpI-atpH atpI TATTTACAAGYGGTATTCAAGCT atpH CCAAYCCAGCAGCAATAAC Shaw J, Lickey EB, Schilling EE, Small RL. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007;94(3):275-288.
ndhF-rpl32 ndhF GAAAGGTATKATCCAYGMATATT rpl32 CCAATATCCCTTYYTTTTCCAA
ndhJ-trnL ndhJ ATGCCYGAAAGTTGGATAGG trnL' GGTTCAAGTCCCTCTATCCC
petL-psbE petL AGTAGAAAACCGAAATAACTAGTTA psbE TATCGAATACTGGTAATAATATCAGC
psaI-accD psaI AATYGTACCACGTAATCYTTTAAA accD AGAAGCCATTGCAATTGCCGGAAA
psbB-psbH psbB TCCAAAAANKKGGAGATCCAAC psbF TCAAYRGTYTGTGTAGCCAT
psbD-trnT psbD CTCCGTARCCAGTCATCCATA trnT CCCTTTTAACTCAGTGGTAG
psbJ-petA psbJ ATAGGTACTGTARCYGGTATT petA AACARTTYGARAAGGTTCAATT
rpl14-rpl36 rpl14 AAGGAAATCCAAAAGGAACTCG rpl36 GGRTTGGAACAAATTACTATAATTCG
rpl32-trnL rpl32 CAGTTCCAAAAAAACGTACTTC trnL CTGCTTCCTAAGAGCAGCGT
rps12-rpl20 rps12 ATTAGAAANRCAAGACAGCCAAT rpl20 CGYYAYCGAGCTATATATCC
rps16 rps16F AAACGATGTGGTARAAAGCAAC rps16R AACATCWATTGCAASGATTCGATA
rps16-trnK rpS16x2F2 AAAGTGGGTTTTTATGATCC trnK TTAAAAGCCGAGTACTCTACC
trnC-trnD trnC CCAGTTCRAATCYGGGTG trnD GGGATTGTAGYTCAATTGGT
trnD-trnT trnD ACCAATTGAACTACAATCCC trnT CTACCACTGAGTTAAAAGGG
trnL-trnF trnL-c CGAAATCGGTAGACGCTACG trnL-f ATTTGAACTGGTGACACGAG
trnV-ndhC trnV GTCTACGGTTCGARTCCGTA ndhC TATTATTAGAAATGYCCARAAAATATCATATTC
atpF-atpH atpF ACTCGCACACACTCCCTTTCC atpH GCTTTTATGGAAGCTTTAACAAT CBOL Plant Working Group. A DNA barcode for land plants. Proc Natl Acad Sci U S A. 2009;106(31):12794-12797.
matK matK 3F CGTACAGTACTTTTGTGTTTACGAG matK 1R ACCCAGTCCATCTGGAAATCTTGGTTC
rbcL rbcL 1F ATGTCACCACAAACAGAAAC rbcL 724R TCGCATGTACCTGCAGTAGC
trnH-psbA trnH2 CGCGCATGGTGGATTCACAATCC psbAF GTTATGCATGAACGTAATGCTC
psbK–psbI psbK TTAGCCTTTGTTTGGCAAG psbI AGAGTTTGAGAGTAAGCAT Hollingsworth ML, Andra Clark A, Forrest LL, et al. Selecting barcoding loci for plants: evaluation of seven candidate loci with species-level sampling in three divergent groups of land plants. Mol Ecol Resour. 2009;9(2):439-457.
rpoB rpoB1 AAGTGCATTGTTGGAACTGG rpoB3 CCGTATGTGAAAAGAAGTATA
rpoC1 rpoC1-1 GTGGATACACTTCTTGATAATGG rpoC1-3 TGAGAAAACATAAGTAAACGGGC
accD accD-1F AGTATGGGATCCGTAGTAGG accD-4R TCTTTTACCCGCAAATGCAAT Kress WJ, Erickson DL. A two-locus global DNA barcode for land plants: the coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS One. 2007;2(6):e508.
ndhJ ndhJ-2F TTGGGCTTCGATTACCAAGG ndhJ-3R ATAATCCTTACGTAAGGGCC
atpB ESATPB172F AATGTTACTTGTGAAGTWCAACAAT ESATPE45R ATTCCAAACWATTCGATTWGGAG Schuettpelz, Eric, and Kathleen M. Pryer. Fern phylogeny inferred from 400 leptosporangiate species and three plastid genes. Taxon. 2007;56(4):1037-1050.
ycf5 ycf5-1 GGATTATTAGTCACTCGTTGG ycf5-4 CCCAATACCATCATACTTAC Zhang X, Zhou T, Yang J, et al. Comparative Analyses of Chloroplast Genomes of Cucurbitaceae Species: Lights into Selective Pressures and Phylogenetic Relationships. Molecules. 2018;23(9):2165.

2.5 DNA signature sequence (DSS)

A DSS is a nucleotide sequence with a constant length that is capable to detect the presence of an organism (named as target species) and to distinguish it from other species (named as background species). We applied BLAST with in-house Python scripts to identify DSSs. Briefly, for a target species, the first step was to generate k-mers (e.g., 20-mer) from one of its chloroplast assemblies using the sliding window method. A fixed-length (e.g., 20 bp) sliding window slides from the first base of the selected chloroplast assembly with 1 bp step to generate all possible k-mers. All obtained k-mers were de-duplicated for subsequent DSS identification. Second, the non-redundant k-mers were blasted against other assemblies to identify k-mers that were conserved in the target species and then blasted against the chloroplast assemblies of background species. Last, k-mers present in background species assemblies were removed, and the rest were considered as DSSs.

The DSSs deposited in CGIR are calculated using a 40 bp k-mer length and other species in the same family are considered background species when calculating DSSs for a target species. For example, to calculate Oryza ridleyi DSSs, we use other species from Poaceae as background species.

3. Database Usage

A. How can I get the chloroplast genome information?

If you are looking for the chloroplast genome information for any taxon of interest, please use Search/Advanced Search tool in Genome page. The search results include species name, synonym species name, genome size, GC content, accession number, etc. You could click assembly accession in results to get detailed information for the specific assembly, such as gene characteristics.

B. How can I get DSSs information?

If you are interested in developing a method for species identification, please visit the page of DSSs and search the taxa you are interested in. All DSSs can be downloaded for further exploration. Since DSSs were calculated at species level, there are no DSS for taxonomic ranks below species, such as subspecies or variations.

C. How can I identify plants using my barcode sequence in CGIR?

If you have a batch of barcode sequences, you could visit the page of Tools and submit it in BarcodeBLAST tool. The results of your sequences aligned against DNA barcodes deposited in the CGIR will be sent to your e-mail account.

D. Can I identify barcodes for my own chloroplasts in CGIR?

You could visit the page of Tools and submit your chloroplasts in BarcodeFinder tool. The results of barcodes identified will be sent to your e-mail account.

4. Support

A. Funding Support

- Special Funds for Basic Resources Investigation Research of the Ministry of Science and Technology (2018FY10080002)
- National Natural Science Foundation of China (81891013)

B. Comments & Collaborations

We look forward to worldwide comments, suggestions and guidance from colleagues and peers with common research interests.

5. Contact us

We would love to hear from you for any questions or comments. Please find our contact information here.

National Resource Center for Chinese Materia Medica

       Chinese Academy of Chinese Medical Sciences (CACMS)
       16 Dongzhimen South Road, Dongcheng District
       Beijing 100700, China

       Telephone: +86 (10) 8402-7175
       Fax: +86 (10) 8402-7175
       E-mail: y_yuan0732@163.com

Beijing Institute of Genomics, Chinese Academy of Sciences (BIG) / National Genomics Data Center (NGDC)

       Beijing Institute of Genomics, Chinese Academy of Sciences
       1 Beichen West Road, Chaoyang District
       Beijing 100101, China

       Telephone: +86 (10) 8409-7340
       Fax: +86 (10) 8409-7200
       E-mail: songshh@big.ac.cn