KaKs_Calculator calculating selective pressure on coding and non-coding sequences

Manual

Installation

For high efficiency and compatibility with more platforms, the kernel codes of KaKs_Calculator are written in standard C++. For Windows version, we use Visual C++ 6.0 for GUI (Graphics User Interface) application.

Command Line for Linux/Unix/Mac/Windows

  • Unpack the package of KaKs_CalculatorXXX.tar.gz by the following commands.

    uzip KaKs_CalculatorXXX.zip

  • Compile the programs in the source codes folder with the help of g++/gcc compiler.

    cd KaKs_CalculatorXXX/src
    make

  • Run "KaKs" for coding sequences and "KnKs" for non-coding sequences.

For Windows, there are two compiled executables in the folder named "bin".

GUI Application for Windows

  • Enter the folder named "GUI".
  • Click the setup application and install KaKs_Calculator 3.0.
  • Please find KaKs_Calculator in "Start Menu".

Of note, two example files (coding.axt and noncoding.axt), as tested on Windows 10, are availlable in the following folder.

C:\Program Files (x86)\CNCB-NGDC\KaKs_Calculator

Coding Sequences: Ka and Ks

Calculating Ka and Ks normally involves three steps. Let us assume that the number of lengths between two DNA sequences compared is n and the number of substitutions between them is m. To calculate Ka and Ks, we need to count the numbers of synonymous (S) and nonsynonymous (N) sites (S + N = n) and the numbers of synonymous (Sd) and nonsynonymous (Nd) substitutions (Sd + Nd = m). Then it is after correcting multiple substitutions that (Nd/N) and (Sd/S) could represent Ka and Ks, respectively, since the observed number of substitutions underestimates the real number of substitutions as sequences diverge over time. Therefore, we can conclude from mentioned above that these methods normally involve three steps to estimate Ka and Ks: counting S and N, counting Sd and Nd, and correction for multiple substitutions.

Methods for calculating Ka and Ks adopt different substitution models with subtle yet significant differences. They can be classified as approximate methods and maximum-likelihood methods. Different from approximate methods, maximum-likelihood methods adopt the probability theory to finish all three steps mentioned above in one go.

Approximate Methods

There are several approximate methods incorporated into KaKs_Calculator, and we list their abbreviations in the program and their corresponding reference(s) as follows.

  • NG: Nei, M. and Gojobori, T. (1986)
  • LWL: Li, W.H., et al. (1985)
  • LPB: Li, W.H. (1993) and Pamilo, P. and Bianchi, N.O. (1993)
  • MLWL (Modified LWL), MLPB (Modified LPB): Tzeng, Y.H., et al. (2004)
  • YN: Yang, Z. and Nielsen, R. (2000)
  • MYN (Modified YN): Zhang, Z., et al. (2006)

Maximum-Likelihood Methods

The method of GY takes account of sequence evolutionary features, such as transition/transversion rate ratio and nucleotide frequencies (reflected in the HKY Model) and incorporates these features into a codon-based model. We extend this method to a set of candidate models in a maximum likelihood framework and use the AICc for model selection and model averaging.

  • GY: Goldman, N. and Yang, Z. (1994)
  • MS (Model Selection), MA (Model Averaging): based on a set of candidate models defined by Posada, D. (2003) as follows.
Model Substitution Rates Nucleotide Frequency
JC
F81
rTC=rAG=rTA=rCG=rTG=rCA Equal
Unequal
K2P
HKY
rTC=rAG ≠ rTA=rCG=rTG=rCA Equal
Unequal
TrNEF
TrN
rTC ≠ rAG ≠ rTA=rCG=rTG=rCA Equal
Unequal
K3P
K3PUF
rTC=rAG ≠ rTA=rCG ≠ rTG=rCA Equal
Unequal
TIMEF
TIM
rTC ≠ rAG ≠ rTA=rCG ≠ rTG=rCA Equal
Unequal
TVMEF
TVM
rTC=rAG ≠ rTA ≠ rCG ≠ rTG ≠ rCA Equal
Unequal
SYM
GTR
rTC≠rAG≠rTA≠rCG≠rTG≠rCA Equal
Unequal

rij: substitution rate between i and j, where i ≠ j and i, j∈{A, C, G, T}

Non-Coding Sequences: Kn and Ks

For calculating selective pressure on non-coding sequences, synonymous substitutions of adjacent coding sequences are used as a reference baseline, which do not provoke amino acid changes and thus thought to be invisible to selection and reflect the neutral rate of evolution. Therefore, selective pressure on non-coding sequences (ξ) can be quantified as non-coding nucleotide substitution rate (Kn) normalized by neutral substitution rate (assumed as Ks), viz., ξ = Kn/Ks, where Ks is inferred from adjacent coding sequences.

As the number of observed substitutions is less than the number of real substitutions, similar to coding sequences, correction of multiple substitutions for non-coding sequences is also needed. In KaKs_Calculator, the HKY model is used for this correction. Therefore, Kn can be deduced from the observed transitional and transversional substitutions (S and V, respectively) as well as four nucleotide frequencies (πA, πT, πG and πC).

KaKs_Calculator provides users with two ways to obtain the value of neutral mutation rate or Ks, which is either calculated from adjacent coding sequences uploaded by users or just specified in a straightforward manner by users. It should be noted that one non-coding sequence may have multiple adjacent coding genes, which are completely specified by users and thus can lead to different estimates of Ks and ξ.

Input Sequence Format

KaKs_Calculator accepts quasi-AXT sequence format as follows. Before calculation, gaps and stop codons between compared sequences will be removed. You can also see “example.axt” in the folder of “KaKs_CalculatorXXX/examples/”.

For example:

NP_000026
ATGCTCCTGTG-CCACTGGCC
ATCCCC-TGCGCTCACTGGAC

NP_000053
ACAGaTtCTACCc-GCCcACTA--GgtGtt
---ggTTCTCCtACCcA-G-CACTACTggg

Each pair of sequences in an AXT file contains three lines: one sequence name line and two sequence lines. Any pairwise sequence is separated from one another by one blank line.

  • Sequence name line

    NP_000026

  • Pairwise sequences lines

    ATGCTCCTGTG-CCACTGGCC
    ATCCCC-TGCGCTCACTGGAC

Parameters setting

KaKs_Calculator is more suitable and efficient for large-scale data analysis. It reads a pair of sequences and computes corresponding estimates one by one, so that it requires memory proportional to the maximum length among pairwise sequences.

In KaKs_Calculator, there are two programs, named "KaKs" and "KnKs", that are capable for calculating selective pressure on coding and non-coding sequences, respectively.

KaKs for coding sequences

KaKs allows user to choose more than one method to calculate Ka and Ks at one running time. The following is the parameters’ setting in Linux/Unix/Mac.

  • -i    AXT sequence file name for calculating Ka and Ks
  • -o   File name for outputting results
  • -c   Genetic code (Default = 1-Standard Code). See more genetic codes at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
  • -m  Methods for calculating Ka and Ks (Default = MA): NG, LWL, LPB, MLWL, MLPB, YN, MYN, GY, MS, MA, ALL (including all above methods)
  • -d   File name for details about each candidate model only when using the method of MS or MA
  • -h   Show help information

For example:

  • use MA method and standard code

    KaKs -i test.axt -o test.axt.kaks

  • use MA method and vertebrate mitochondrial code

    KaKs -i test.axt -o test.axt.kaks -c 2

  • use MA method and standard code and output details of model selection on each candidate model

    KaKs -i test.axt -o test.axt.kaks -d test.axt.details

  • use LWL, YN and MYN and standard Code

    KaKs -i test.axt -o test.axt.kaks -m LWL -m YN -m MYN

KnKs for non-coding sequences

KnKs is able to estimate Kn and Ks for non-coding sequences. The following is the parameters’ setting in Linux/Unix/Mac.

  • -i      Input axt file name of non-coding aligned sequences [string]
  • -j      Input axt file name of adjacent coding aligned sequences [string] || neutral mutation rate or Ks [numeric]
  • -o     Output file name for noncoding estimates [string]
  • -d     Whether to output details of coding estimates [int, 0 or 1; default = 1)
  • -c     Genetic code table [int, default = 1-Standard Code).
  • -h     Help information

    For example:

KnKs -i test.axt -j adj.axt -o test.axt.knks    //use 'adj.axt' to deduce neutral mutation rate
KnKs -i test.axt -j 0.618   -o test.axt.knks    //use 0.618 as netural mutation rate

Windows

The Windows version provides users with a friendly interface to select input sequences’ file, genetic code and method(s). During calculating, you can minimize the application window and send it to tray. After finishing calculation, KaKs_Calculator allows users to export results to file or clipboard at will.

Output Format

KaKs_Calculator provides comprehensive information estimated from compared sequences, including numbers of synonymous and nonsynonymous sites, numbers of synonymous and nonsynonymous substitutions, GC contents, maximum-likelihood score, and AICC, in addition to synonymous and nonsynonymous substitution rates and their ratio.

KaKs for coding sequences

  • Sequence: Name of Pairwise sequence
  • Method: Name of method for calculation of Ka and Ks
  • Ka: Nonsynonymous substitution rate
  • Ks: Synonymous substitution rate
  • Ka/Ks: Selective strength
  • P-Value(Fisher): The value computed by Fisher exact test, based on synonymous and nonsynonymous sites (S-Sites and N-Sites) and synonymous and nonsynonymous sites (S-Substitutions and N-Substitutions) 
  • Length: Sequence length (after removing gaps and stop codon(s))
  • S-Sites: Synonymous sites, S
  • N-Sites: Nonsynonymous sites, N
  • Fold-Sites(0:2:4): 0,2,4-fold degenerate sites
  • Substitutions: Substitutions between sequences
  • S-Substitutions: Synonymous substitutions, Sd
  • N-Substitutions: Nonsynonymous substitutions, Nd
  • Fold-S-Substitutions(0:2:4): Synonymous substitutions at 0,2,4-fold
  • Fold-N-Substitutions(0:2:4): Nonsynonymous substitutions at 0,2,4-fold
  • Divergence-Distance: Nucleotide substitutions per codon, viz., a weighted average of Ka and Ks,  3 * (Ka * N + Ks * S) / (S + N)
  • Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA): Ratios of six substitution rates to the substitution rate between C and A
  • GC(1:2:3): GC content of entire sequences and of three codon positions
  • ML-Score: Maximum likelihood score
  • AICc: Value of AICc
  • Akaike-Weight: Value of Akaike weight for model selection
  • Model: Selected model for the method of MS

KnKs for non-coding sequences

  • Sequence: Name of Pairwise sequence
  • Kn: Nucleotide substitution rate
  • Ks: Synonymous substitution rate of adjacent coding sequences
  • Kn/Ks: Selective strength
  • Length: Sequence length (after removing gaps and unrecognized characters [not A/T/G/C])
  • Substitutions: Substitutions between non-coding sequences
  • Kappa: Transition/transversion rate ratio
  • GC: GC content of non-coding sequences

FAQ

Which method(s) is recommended for calculating Ka and Ks?

In KaKs_Calculator, there are multiple computational methods for detecting selection on coding sequences, which fall into approximate methods and maximum-likelihood methods. Not surprisingly, approximate methods are more time-efficient than maximum-likelihood methods. Considering that different users may have different preferences, it should be noted, however, that maximum-likelihood methods are believed to achieve higher accuracy and that different methods adopt different models and strategies and thus can lead to different estimates.

See an evaluation paper on these methods:

Evaluation of six methods for estimating synonymous and nonsynonymous substitution ratesGenomics Proteomics Bioinformatics 2006. https://doi.org/10.1016/S1672-0229(06)60030-2

See an example in which contradictory findings are produced by different methods:

Correlation between Ka/Ks and Ks is related to substitution model and evolutionary lineageJ Mol Evol 2009. https://doi.org/10.1007/s00239-009-9222-9

How to generate input sequences for KaKs_Calculator, viz., codon-based alignments with AXT file format?

Codon-based alignments are obtained through two steps: first construct an alignment based on protein sequences and then back-translate it into the corresponding coding DNA alignment. Thus, we developed ParaAT (Parallel Alignment and back-Translation), a parallel tool that is capable of constructing protein-coding DNA alignments for a large number of homologs.

Please visit the ParaAT page: https://ngdc.cncb.ac.cn/biocode/tools/BT000003

See the following paper for details:

ParaAT: a parallel tool for constructing multiple protein-coding DNA alignmentsBiochem Biophys Res Commun 2012. https://doi.org/10.1016/j.bbrc.2012.02.101

How is MYA (Million Years Ago) deduced?

To deduce MYA, a neutral mutation rate is required. Although there is a debate on non-neutral evolution at synonymous sites (http://www.nature.com/nrg/journal/v7/n2/abs/nrg1770.html), Ks is widely believed to be a approximation of neutral mutation rate and its unit is substitutions per site.

Case 1: The commonly adopted estimate of mutational rate is 1.5×10−8 substitutions per site per year in Arabidopsis (http://www.ncbi.nlm.nih.gov/pubmed/16632643; http://www.ncbi.nlm.nih.gov/pubmed/11018155). Since Ks estimated by KaKs_Calculator is a divergence between a pair of sequences, so that MYA = Ks / 2 / (1.5×10−8).

Case 2: “Estimates of the times since divergence of …… were made by calculation of synonymous base substitution rates (Ks values) and assumption of a mutation rate of 1.4 × 10−8 substitutions per synonymous site per year (Koch et al., 2000). The median Ks values between orthologous genes …… (0.49 to 0.51) are representative of the Brassica–Arabidopsis divergence, which we estimate at 17 to 18 MYA.

Divergence time = (Ks / 2) / (mutation rate) = (0.25 substitutions/site) / (1.4×10−8 substitutions/site/year) = 0.1785×108 year = 17.85×106 year = 17.85 MYA

Here is a paper related to this estimation at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1475497/pdf/tpc1801339.pdf.

What is divergence distance estimated by KaKs_Calculator?

Divergence distance estimated by KaKs_Calculator is a weighted average of synonymous substitution rate and non-synonymous substitution rate. It should be noted that it can not be used as neutral mutation rate.

What values are used for the fisher exact test and what is the null hypothesis?

The fisher exact test is to built a 2 × 2 contingency table using the numbers of synonymous and nonsynonymous substitutions (Sd and Nd) and the numbers of synonymous and nonsynonymous sites (S and N) and to examine the significance of the substitution and site between the two classifications (synonymous and nonsynonymous).

  Synonymous Nonsynonymous
Site S N
Substitution Sd Nd

Therefore, its null hypothesis is Sd/S = Nd/N. Since Sd/S approximates Ks and Nd/N approximates Ka (the observed substitutions are less than the real ones, so it is only after correcting multiple substitutions that Nd/N and Sd/S could represent Ka and Ks, respectively), its null hypothesis also means neutral mutation. Reject the null hypothesis if Sd/S is singificantly greater (negative selection) or smaller (positive selection) than Nd/N, as indicated by small P-value (e.g., <0.05).

Acknowledgements

I would like to extend special thanks to Lina Ma for constructive suggestions and discussions on this work and Zhao Li for valuable help on data collection. I also thank Zhuojing Fan for designing the logo as well as Qing Guo and Lin Dai for fixing a bug on Windows GUI. I am extremely grateful to a number of users for reporting bugs and sending comments since the first release of KaKs_Calculator in 2006. (Version 3.0, 2021)

We thank Professor Wen-Hsiung Li for providing us with his computer program and Professor Ziheng Yang for his invaluable source codes in PAML. We are grateful to Heng Li for his advice and Yafeng Hu for his help in software designing. We also thank all anonymous users for reporting bugs and sending suggestions. (Version 1.0, 2006)