Help - CompoDynamics

III. Molecular compositions & features

I. Introduction

CompoDynamics ( https://ngdc.cncb.ac.cn/compodynamics) is a comprehensive database for characterizing molecular characteristics of coding sequences (CDS). Taking advantage of the exponential growth of RefSeq data, CompoDynamics presents rich molecular compositions and features for each CDS and genome, and provides a portal to compare these properties in different species, family domains and functional genes.

CompoDynamics currently contains over 118 million () high-quality CDSs of genes and proteins in genomes (complete/reference/representative) of species (see statistics page for more information). CompoDynamics provides extensive annotation of molecular compositions & features for both genomes and CDSs, including

Nucleotide Content (four individual nucleotides (A, T, G, C), pairwise combinations (GC, AG, GT, AT, AC, CT) and their positional contents in the 1st/2nd/3rd codon positions)
Codon (codon usage, RSCU, ENC, CDC)
Amino acid usage
Coding potential (coding potential score, coding label, Fickett score)
Physicochemical properties (molecular weight, surface area, residue volume, hydrophobicity, charge, structure)
Phase separation (phase separation potential score, disorder region)

Importantly, CompoDynamics provides multi-dimensional tools in exploration of molecular compositions & features, including comparison of multiple molecular compositions & features across species, protein family domains and genes of different function. In addition, CompoAnalyzer facilitates automatic analysis of the molecular compositions & features for user-input sequences.

2. What is the data source of CompoDynamics?

All coding sequences (*_cds_from_genomic.fna.gz) in archaea, bacteria, fungi, invertebrates, plants, vertebrates (mammals and other), protozoa and viruses from RefSeq were retrieved in May 27, 2020. CDSs from latest and complete/reference/representative genome assemblies were retained.

3. How did we perform quality control and analyze the data?

Sequences of length not a multiple of 3 are removed. CDSs are grouped into different categories, and most of them belong to canonical CDS. The remaining CDSs are labeled accordingly if they (i) lack start codon, (ii) lack stop codon or (iii) contain in-frame stop codon. The various genetic codes of species are considered during this process. Six types of features are calculated for CDS:

Classes	Calculations
Nucleotide	A, G, C, T, GC, AG, GT, AT, AC, CT and their positional contents in the 1st/2nd/3rd codon positions
Codon	Codon frequency (observed aaa, observed aag...), RSCU (rscu aag, rscu aaa...), ENC, CDC
Amino Acid	Amino acid number and frequency (observed ARG, observed CYS...)
Coding Potential	LGC coding label, CPC2 coding label, LGC coding potential score, CPC2 coding probability, Fickett score
Physicochemical Property	MW, SA, RV, kdHydrophobicity, wwHydrophobicity, hhHydrophobicity, Positive, Negative, Neutral, Alpha, Beta
Phase Separation	disorder percentage, disorder segment, disorder 30aa, disorder 50aa, LLR, COREscore, PAPAprop, PAPAfi, PScore

Among them, RSCU (Relative Synonymous Codon Usage) values, CDC (Codon Deviation Coefficient) values are computed by CAT (Zhang et al., 2012); ENC values are computed by CodonW (John Peden); Fickett scores, CPC2 labels are computed by CPC2 (Kang et al., 2017); LGC labels are computed by LGC (Wang et al., 2019); Phase separation features are calculated by ESpritz (I Walsh et al., 2012), PLAAC (AK Lancaster et al. 2014) and Pi-Pi (RMC Vernon et al., 2018).

Molecular compositions & features in genome scale are calculated by one of the three ways: 1) averaging across all CDSs of that genome; 2) weighted averaging on CDS lengths; 3) calculation based on other genome scale features. For instance, GC frequencies are averaged by weights of each CDS length to yield genome-level frequencies; the average value of hydrophobicity represents the value of genome and RSCU AAA values are calculated based on genome codon frequencies.

4. How is CompoDynamics organized?

The logical relationship of CompoDynamics is represented below:

II. How to use CompoDynamics?

1. Homepage:

In homepage, CompoDynamics offers a fast and case-insensitive search function. In the quick search box you search for various terms of interest, including:

Species category, e.g., fungi;
Taxon, e.g., Cosavirus;
Organism name, e.g. homo sapiens;
Assembly accession, e.g. GCF_900128725.1;
RefSeq item, e.g. NW_003943604.1, NZ_LT667500.1;
Gene name, e.g. 4a, ACE2;
Protein name, e.g. Eukaryotic translation initiation factor 2A;
Protein id, e.g. NP_114414.2;

Advanced search offers more specific filtering options:

By clicking “Search” it will direct you to browse page.

2. Browse page:

In this page the browse of the genome-level and CDS-level information are provided. It should be noted that the two tables are always associated. It means that the results in CDSs table belong to the genomes in Genomes table under any circumstances.

For more detailed search, click “Show Filters” button, and two kinds of filters (genomes filter and CDSs filter) are provided. The input filters will affect both results in Genomes table and CDSs table.

1. Genome-level browse

The Genomes table shows basic information about each genome, including species category, lineage, species, alias, assembly accession and CDS counts. You can get a quick search by clicking a taxonomy or more specific, one of the popular species we provided on this page. After selecting the genomes of interest and then clicking the "compare" button, it will compare all the molecular compositions & features of the genomes selected.

To view interested features of a genome, click “+”. For example, after clicking “+” for “Amino Acid”, it shows the statistics of amino acid usage of the genome.

Clicking a white-on-green term in Genomes table will direct you to the individual page of the genome. In this page, meta information and statistics of molecular compositions & features including nucleotide content, codon, amino acid usage, coding potential, physicochemical property and phase separation are presented in detail for the genome. Most of the graphs can be downloaded.

2. CDS-level browse

CDSs table shows basic information about each CDS, including Gene Name, Protein ID, Protein Name and CDS length.

To view interested features of a CDS, click “+”.

Clicking a white-on-green term will direct you to the individual page of the CDS. In this page, meta information and statistics of molecular compositions & features including nucleotide content, codon, amino acid usage, coding potential, physicochemical property and phase separation are presented in detail for the CDS. Most of the graphs can be downloaded.

3. Compositions & Features page

The Compositions & Features page together present six types of molecular compositions & features: nucleotide content, codon, amino acid usage, coding potential, physicochemical properties and phase separation features.

For example, if you are interested in the most codon-biased genomes in virus, you can go to Compositions-Codon page and sort by either CDC or ENC while applying virus as species category. Select the genomes of interest, and click the “compare” button, it will compare all the molecular compositions & features of the genomes for you.

4. Tools

1. SpeciesComparator

SpeciesComparator compares multiple molecular compositions & features between multiple species you select (up to 5). For example, if you select human and mouse, it will return you the available genomes.

Select from these genomes and click the “Compare” button, and it will compare multiple molecular compositions & features between the two genomes.

2. FamilyComparator

FamilyComparator compares multiple molecular compositions & features between CDSs of multiple Pfam families/domains. For example, if you select human as species and Trypsin and SH2 as family names, it will return you the available Pfam families/domains. The corresponding CDSs are listed in the CDS table.

Select from these Pfam families/domains and click the “Compare” button, it will compare multiple molecular compositions & features between CDSs of these families/domains.

3. GOComparator

GOComparator compares multiple molecular compositions & features between CDSs of multiple GO terms. For example, if you select mouse as species and cytoplasm and nucleus as function terms, it will return you the available GO terms. The corresponding CDSs are listed in the CDS table.

Click Compare button to compare.

4. CompoAnalyzer

CompoAnalyzer automatically analyzes the molecular compositions & features for user-input sequences. You can input your sequences in the box or upload your fasta file. Then select species or genetic code and click the “Run” button.

The molecular compositions & features of the sequences you input will be returned to you, including nucleotide content, codon, amino acid usage, coding potential, physicochemical property and phase separation features. All the data can be downloaded.

5. Statistics

Data statistics is provided to help users obtain an overview of compositions and features, Chargaff’s rules and the relationship between GC content and other compositions/features across various species categories.

III. Molecular compositions & features

1. Nucleotide Content

Four individual nucleotides (A, T, G, C)
Pairwise combinations (GC, AG, GT, AT, AC, CT)
Positional contents in the 1^st/2^nd/3^rd codon positions.

2. Codon

Codon usage (AAA, AAG, …): Percentage of 64 codons used by a CDS (CDS level) or a the coding region of a genome (Genome level).
RSCU (RSCU AAG, RSCU AAA, …): RSCU values are computed by CAT (Zhang et al., 2012). RSCU stands for relative synonymous codon usage, it is the ratio of the observed frequency of codons to the expected frequency given that all the synonymous codons for the same amino acids are used equally. For instance, Threonine is encoded by 4 synonymous codons: ACA, ACG, ACC, ACT. Thus RSCU ACG is calculated as (observed_ACG * 4 )/(observed_ACN), where N stands for all 4 nucleotides. For genome level, RSCUs are computed based on the genome codon usage.
ENC: ENC values are computed by CodonW (John Peden). ENC stands for effective number of codons (Wright F., 1990). The value is between 20 (when one codon is used effectively for each aa) and 61 (when codons are used randomly for each aa). For genome level, ENC is computed as the mean of all CDSs of the genome.
CDC:CDC values are computed by CAT (Zhang et al., 2012). CDC stands for codon deviation coefficient. CDC effectively accounts for background nucleotide composition in estimating codon usage bias and utilizes a bootstrap assessment of the statistical significance of codon usage bias. For genome level, CDC is computed as the mean of all CDSs of the genome.

3. Amino Acid

Amino Acid Usage (Ala, Cys, ...): Percentage of 20 codons used by a CDS (CDS level) or the coding region of a genome (genome level).

4. Coding Potential

CPC2 label, LGC label: Coding/noncoding classification provided by CPC2 (Kang et al., 2017) and LGC (Wang et al., 2019). For genome level, numbers (freqs) of CDSs predicted as CPC2 coding/CPC2 noncoding/LGC coding/LGC noncoding/coding in both/noncoding in both/TUCP (Transcripts of Uncertain Coding Potential, meaning predicted as coding by one method but noncoding by the other) are provided.
CPC2 coding probability, LGC coding potential score: The coding potential estimates provided by CPC2 and LGC. For genome level, the values are computed as the mean of all CDSs of the genome.
Fickett Score: Fickett Scores are computed by CPC2 (Kang et al. 2017). Fickett Score is used to examine the coding probability by accounting both combinational effect of nucleotide composition and codon usage bias (Fickett J.W., 1982). For genome level, Fickett Score is computed as the mean of all CDSs of the genome.

5. Physicochemical Property

Physicochemical properties, including molecular weight, residue volume, surface area, structure and charge, are scaled from 0 to 1. For each genome, the grand average of hydropathicity (GRAVY) is calculated as a sum of each amino acid frequency multiplied by its hydrophobicity value based on the Kyte-Doolittle method. Likewise, genome-wide averages of other physicochemical properties are also calculated in a similar way as GRAVY (Kyte J et al., 1982).

MW: Relative molecular weight of the protein product. Note that the weight is normalized by protein length. MW is computed as the mean of relative molecular weights of all the amino acids weighted by aa frequency. The table below shows the weights of each aa:
SA: Relative surface area of the protein product. The value is normalized by protein length. SA is computed as the mean of surface areas of all the amino acids weighted by aa frequency. The table below shows the relative surface areas of each aa:
For genome level, SA is computed as the mean of all CDSs of the genome.
RV: Relative residue volume of the protein product. The value is normalized by protein length. RV is computed as the mean of relative residue volumes of all the amino acids weighted by aa frequency. The table below shows the residue volume of each aa:
For genome level, RV is computed as the mean of all CDSs of the genome.
kdHydrophobicity, wwHydrophobicity, hhHydrophobicity: These are hydrophobicity values based on different hydrophobicity scales: kdHydrophobicity (Kyte J et al., 1982), wwHydrophobicity (Wimley W.C. et al., 1996), and hhHydrophobicity (Hessa T et al., 2005). Hydrophobicity is computed as the mean of hydrophobicities of all the amino acids weighted by aa frequency. The hydrophobicity values for amino acids in the three scales are listed in the table below.
For genome level, hydrophobicity is computed as the mean of all CDSs of the genome.
Positive, Negative, Neutral: These are the fractions of positively/negatively/neutrally charged amino acids. Among the 20 amino acids, five have a side chain which can be charged. Lys, His and Arg are positively charged; Glu and Asp are negatively charged. The other 15 amino acids are neutrally charged.
For genome level, these three values are computed as the mean of all CDSs of the genome.

Alpha, Beta: These are the fractions of amino acids which promote α-helix/β-sheet secondary structures. 8 amino acids promote α-helix (Ala, Cys, Glu, Gln, His, Leu, Lys, Met). 6 β-sheet (promote Ile, Phe, Thr, Trp, Tyr, Val).
For genome level, these two values are computed as the mean of all CDSs of the genome.

Amino acid	3-letter	1-letter	MW	RV [1]	SA [2]	kdHydrophobicity [3]	wwHydrophobicity [4]	hhHydrophobicity [5]	Alpha [6]	Beta [6]
Alanine	Ala	A	89.09318	88.6	115	1.80	-0.17	0.11	1.29	0.90
Arginine	Arg	R	174.20096	173.4	225	-4.50	-0.81	2.58	0.96	0.99
Asparagine	Asn	N	132.11792	114.1	160	-3.50	-0.42	2.05	0.90	0.76
Aspartic acid	Asp	D	133.10268	111.1	150	-3.50	-1.23	3.49	1.04	0.72
Cysteine	Cys	C	121.15818	108.5	135	2.50	0.24	-0.13	1.11	0.74
Glutamine	Gln	Q	146.1445	143.8	180	-3.5	-0.58	2.36	1.27	0.80
Glutamic acid	Glu	E	147.12926	138.4	190	-3.50	-2.02	2.68	1.44	0.75
Glycine	Gly	G	75.06660	60.1	75	-0.40	-0.01	0.74	0.56	0.92
Histidine	His	H	155.15456	153.2	195	-3.20	-0.96	2.06	1.22	1.08
Isoleucine	Ile	I	131.17292	166.7	175	4.50	0.31	-0.60	0.97	1.45
Leucine	Leu	L	131.17292	166.7	170	3.80	0.56	-0.55	1.30	1.02
Lysine	Lys	K	146.18756	168.6	200	-3.90	-0.99	2.71	1.23	0.77
Methionine	Met	M	149.21134	162.9	185	1.9	0.23	-0.1	1.47	0.97
Phenylalanine	Phe	F	165.18914	189.9	210	2.8	1.13	-0.32	1.07	1.32
Proline	Pro	P	115.13046	112.7	145	-1.6	-0.45	2.23	0.52	0.64
Serine	Ser	S	105.09258	89.0	115	-0.80	-0.13	0.84	0.82	0.95
Threonine	Thr	T	119.11916	116.1	140	0.70	-0.14	0.52	0.82	1.21
Tryptophan	Trp	W	204.22518	227.8	255	-0.90	1.85	0.30	0.99	1.14
Tyrosine	Tyr	Y	181.18854	193.6	230	-1.3	0.94	0.68	0.72	1.25
Valine	Val	V	117.14634	140.0	155	4.20	-0.07	-0.31	0.91	1.49

[1]. RV: Zamyatnin, A.A. 1972. Prog Biophys Mol Biol. 24: 107-123.

[2]. SA: Chothia, C. 1976. J Mol Biol. 105: 1-12.

[3]. kdHydrophobicity: A simple method for displaying the hydropathic character of a protein. Kyte J, Doolittle RF. J Mol Biol. 1982 May 5;157(1):105-32.

[4]. wwHydrophobicity: Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Wimley WC, White SH. Nat Struct Biol. 1996 Oct;3(10):842-8.

[5]. hhHydrophobicity: Recognition of transmembrane helices by the endoplasmic reticulum translocon. Hessa T, Kim H, Bihlmaier K, Lundin C, Boekel J, Andersson H, Nilsson I, White SH, von Heijne G. Nature. 2005 Jan 27;433(7024):377-81

[6]. Alpha, Beta: Chou, P.Y. and Fasman, G.D. 1974.. Biochemistry. 13: 222-245.

6. Phase Separation

Total % Disorder, Disorder Segment, Disorder 30AA, Disorder 50AA: ESpritz (I Walsh et al., 2012) calculates the proportion of IDR (intrinsically disordered regions) regions in each protein sequence (Total % Disorder), the number and location of IDRs greater than 30aa (Disorder 30AA), the number and location of IDRs greater than 50aa (Disorder 50AA), and the number and length of total IDR regions (Disorder Segment).
For genome level, the numbers and percentages of CDSs containing Disorder Segment/Disorder 30AA/Disorder 50AA are computed.
COREscore, LLR, PAPAprop, PAPAfi: These values are computed by PLAAC (AK Lancaster et al. 2014). COREscore value (larger value indicates higher phase separation potential), LLR (for log-likelihood ratio), PAPAprop and PAPAfi scores are calculated by PLAAC for each coding sequence.
PScore: Phase separation propensity score (PScore) is computed by Pi-Pi (RMC Vernon et al., 2018).
It is the number of π-π electrostatic interactions indicative of potential for phase separation. If the PScore is greater than 4, it represents a very strong phase separation potential.