Help - CGIR - NGDC/CNCB

Totally, 42,784 chloroplast genome assemblies covering 19,950 species (692 families, 4,602 genera) are deposited in CGIR. 9,785 assemblies covering 6,628 species were downloaded from NCBI, 16 assemblies covering 16 species were downloaded from NGDC Genome Warehouse and 1,170 assemblies covering 718 species were sequenced by National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences.

CGIR integrates 4,909,479 gene records, 16,276,837 simple sequence repeats (SSRs), 1,048,784 DNA barcodes and 37,866,351 DNA signature sequences (DSSs) of chloroplast genome. The taxonomic classification of each chloroplast genomes (including families, genera, and species) and gene locus name were standardized. SSRs and DNA barcodes were systematically analyzed. DSSs for species identification were investigated in 1,849 seed plants with more than two chloroplast genomes in CGIR.

The taxonomic information were standardized based on Species 2000. Briefly, we curated a taxonomic name if its taxonomic status in species 2000 is “synonym”. Species name not recorded in species 2000 were curated based on NCBI Taxonomy Database, and other references detailed below.

Plant Groups	Reference Database
Angiosperm	Angiosperm Phylogeny Group (APG IV)
Gymnosperm	The Plant List (http://www.theplantlist.org)
Bryophytes	The Plant List (http://www.theplantlist.org)
Pteridophyte	Pteridophyte Phylogeny Group (PPG I)
Phycophyta	AlgaeBase (https://www.algaebase.org)

Featured plants in CGIR were curated based on World Checklist of Useful Plant Species (2020) and divided into 6 categories, including environmental, food, forage, material, medicine, poison. The category of featured plants is detailed below.

Category	Description
environmental	Examples include intercrops and nurse crops, ornamentals, barrier hedges, shade plants, windbreaks, soil improvers, plants for revegetation and erosion control, wastewater purifiers, indicators of the presence of metals, pollution, or underground water.
food	Food, including beverages, for humans only.
forage	Forage and fodder for vertebrate animals only.
material	Woods, fibres, cork, cane, tannins, latex, resins, gums, waxes, oils, lipids, etc. and their derived products, including charcoal, petroleum substitutes, fuel alcohols
medicine	Both human and veterinary.
poison	Plants which are poisonous to vertebrates and invertebrates, both accidentally and usefully, e.g., for hunting and fishing.

For medicine recorded in Chinese Pharmacopoeia and National Compilation of Chinese Herbal Medicines, their medicinal organs were also curated and the curation model for medicinal organ is listed below.

Medicinal organ	Plant tissue
Radix	Root
Rhizoma	Subterraneous stem (including rhizome, tuber, bulb, corm, etc.)
Caulis	Rattan stem
Lignum	Phloem
Folium	Xylem
Flos	Flower
Fructus	Fruit
Semen	Seed
Herba	Whole herb
Resina	Resin
Others	Not in the above categories. (e.g. algae)

SSRs could be divided into three types (perfect, imperfect, and compound SSRs). Perfect and compound SSRs were identified using microsatellite identification tool (MISA); Imperfect SSRs were identified by IMEx. All primers were designed by Primer 3.

DNA Barcodes were identified based on an in-silico approach. For each DNA barcoding region, the selected forward and reverse primers were aligned to the chloroplast genomes using BLAST. Based on the alignment position, the nucleotides between the aligned primers were considered as DNA barcodes.

The primers used in DNA barcodes identification is listed below.

Barcode region	Forward Primer		Reverse Primer		Reference
Barcode region	Name	Sequence	Name	Sequence	Reference
atpI-atpH	atpI	TATTTACAAGYGGTATTCAAGCT	atpH	CCAAYCCAGCAGCAATAAC	Shaw J, Lickey EB, Schilling EE, Small RL. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007;94(3):275-288.
ndhF-rpl32	ndhF	GAAAGGTATKATCCAYGMATATT	rpl32	CCAATATCCCTTYYTTTTCCAA
ndhJ-trnL	ndhJ	ATGCCYGAAAGTTGGATAGG	trnL'	GGTTCAAGTCCCTCTATCCC
petL-psbE	petL	AGTAGAAAACCGAAATAACTAGTTA	psbE	TATCGAATACTGGTAATAATATCAGC
psaI-accD	psaI	AATYGTACCACGTAATCYTTTAAA	accD	AGAAGCCATTGCAATTGCCGGAAA
psbB-psbH	psbB	TCCAAAAANKKGGAGATCCAAC	psbF	TCAAYRGTYTGTGTAGCCAT
psbD-trnT	psbD	CTCCGTARCCAGTCATCCATA	trnT	CCCTTTTAACTCAGTGGTAG
psbJ-petA	psbJ	ATAGGTACTGTARCYGGTATT	petA	AACARTTYGARAAGGTTCAATT
rpl14-rpl36	rpl14	AAGGAAATCCAAAAGGAACTCG	rpl36	GGRTTGGAACAAATTACTATAATTCG
rpl32-trnL	rpl32	CAGTTCCAAAAAAACGTACTTC	trnL	CTGCTTCCTAAGAGCAGCGT
rps12-rpl20	rps12	ATTAGAAANRCAAGACAGCCAAT	rpl20	CGYYAYCGAGCTATATATCC
rps16	rps16F	AAACGATGTGGTARAAAGCAAC	rps16R	AACATCWATTGCAASGATTCGATA
rps16-trnK	rpS16x2F2	AAAGTGGGTTTTTATGATCC	trnK	TTAAAAGCCGAGTACTCTACC
trnC-trnD	trnC	CCAGTTCRAATCYGGGTG	trnD	GGGATTGTAGYTCAATTGGT
trnD-trnT	trnD	ACCAATTGAACTACAATCCC	trnT	CTACCACTGAGTTAAAAGGG
trnL-trnF	trnL-c	CGAAATCGGTAGACGCTACG	trnL-f	ATTTGAACTGGTGACACGAG
trnV-ndhC	trnV	GTCTACGGTTCGARTCCGTA	ndhC	TATTATTAGAAATGYCCARAAAATATCATATTC
atpF-atpH	atpF	ACTCGCACACACTCCCTTTCC	atpH	GCTTTTATGGAAGCTTTAACAAT	CBOL Plant Working Group. A DNA barcode for land plants. Proc Natl Acad Sci U S A. 2009;106(31):12794-12797.
matK	matK 3F	CGTACAGTACTTTTGTGTTTACGAG	matK 1R	ACCCAGTCCATCTGGAAATCTTGGTTC
rbcL	rbcL 1F	ATGTCACCACAAACAGAAAC	rbcL 724R	TCGCATGTACCTGCAGTAGC
trnH-psbA	trnH2	CGCGCATGGTGGATTCACAATCC	psbAF	GTTATGCATGAACGTAATGCTC
psbK–psbI	psbK	TTAGCCTTTGTTTGGCAAG	psbI	AGAGTTTGAGAGTAAGCAT	Hollingsworth ML, Andra Clark A, Forrest LL, et al. Selecting barcoding loci for plants: evaluation of seven candidate loci with species-level sampling in three divergent groups of land plants. Mol Ecol Resour. 2009;9(2):439-457.
rpoB	rpoB1	AAGTGCATTGTTGGAACTGG	rpoB3	CCGTATGTGAAAAGAAGTATA
rpoC1	rpoC1-1	GTGGATACACTTCTTGATAATGG	rpoC1-3	TGAGAAAACATAAGTAAACGGGC
accD	accD-1F	AGTATGGGATCCGTAGTAGG	accD-4R	TCTTTTACCCGCAAATGCAAT	Kress WJ, Erickson DL. A two-locus global DNA barcode for land plants: the coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS One. 2007;2(6):e508.
ndhJ	ndhJ-2F	TTGGGCTTCGATTACCAAGG	ndhJ-3R	ATAATCCTTACGTAAGGGCC
atpB	ESATPB172F	AATGTTACTTGTGAAGTWCAACAAT	ESATPE45R	ATTCCAAACWATTCGATTWGGAG	Schuettpelz, Eric, and Kathleen M. Pryer. Fern phylogeny inferred from 400 leptosporangiate species and three plastid genes. Taxon. 2007;56(4):1037-1050.
ycf5	ycf5-1	GGATTATTAGTCACTCGTTGG	ycf5-4	CCCAATACCATCATACTTAC	Zhang X, Zhou T, Yang J, et al. Comparative Analyses of Chloroplast Genomes of Cucurbitaceae Species: Lights into Selective Pressures and Phylogenetic Relationships. Molecules. 2018;23(9):2165.

A DSS is a nucleotide sequence with a constant length that is capable to detect the presence of an organism (named as target species) and to distinguish it from other species (named as background species). We applied BLAST with in-house Python scripts to identify DSSs. Briefly, for a target species, the first step was to generate k-mers (e.g., 20-mer) from one of its chloroplast assemblies using the sliding window method. A fixed-length (e.g., 20 bp) sliding window slides from the first base of the selected chloroplast assembly with 1 bp step to generate all possible k-mers. All obtained k-mers were de-duplicated for subsequent DSS identification. Second, the non-redundant k-mers were blasted against other assemblies to identify k-mers that were conserved in the target species and then blasted against the chloroplast assemblies of background species. Last, k-mers present in background species assemblies were removed, and the rest were considered as DSSs.

The DSSs deposited in CGIR are calculated using a 40 bp k-mer length and other species in the same family are considered background species when calculating DSSs for a target species. For example, to calculate Oryza ridleyi DSSs, we use other species from Poaceae as background species.

A. How can I get the chloroplast genome information?

If you are looking for the chloroplast genome information for any taxon of interest, please use Search/Advanced Search tool in Genome page. The search results include species name, synonym species name, genome size, GC content, accession number, etc. You could click assembly accession in results to get detailed information for the specific assembly, such as gene characteristics.

B. How can I get DSSs information?

If you are interested in developing a method for species identification, please visit the page of DSSs and search the taxa you are interested in. All DSSs can be downloaded for further exploration. Since DSSs were calculated at species level, there are no DSS for taxonomic ranks below species, such as subspecies or variations.

C. How can I identify plants using my barcode sequence in CGIR?

If you have a batch of barcode sequences, you could visit the page of Tools and submit it in BarcodeBLAST tool. The results of your sequences aligned against DNA barcodes deposited in the CGIR will be sent to your e-mail account.

D. Can I identify barcodes for my own chloroplasts in CGIR?

You could visit the page of Tools and submit your chloroplasts in BarcodeFinder tool. The results of barcodes identified will be sent to your e-mail account.

A. Funding Support

- Special Funds for Basic Resources Investigation Research of the Ministry of Science and Technology (2018FY10080002)
- National Natural Science Foundation of China (81891013)

B. Comments & Collaborations

We look forward to worldwide comments, suggestions and guidance from colleagues and peers with common research interests.

We would love to hear from you for any questions or comments. Please find our contact information here.

National Resource Center for Chinese Materia Medica

       Chinese Academy of Chinese Medical Sciences (CACMS)
       16 Dongzhimen South Road, Dongcheng District
       Beijing 100700, China

       Telephone: +86 (10) 8402-7175
       Fax: +86 (10) 8402-7175
       E-mail: y_yuan0732@163.com

Beijing Institute of Genomics, Chinese Academy of Sciences (BIG) / National Genomics Data Center (NGDC)

       Beijing Institute of Genomics, Chinese Academy of Sciences
       1 Beichen West Road, Chaoyang District
       Beijing 100101, China

       Telephone: +86 (10) 8409-7340
       Fax: +86 (10) 8409-7200
       E-mail: songshh@big.ac.cn

1. Datasets

2. Curation Model and Software

2.1 Taxonomy

2.2 Featured plants

2.3 Simple sequence repeats (SSR)

2.4 DNA barcodes

2.5 DNA signature sequence (DSS)

3. Database Usage

4. Support

5. Contact us

National Resource Center for Chinese Materia Medica

Beijing Institute of Genomics, Chinese Academy of Sciences (BIG) / National Genomics Data Center (NGDC)