ParaAT (Parallel Alignment and back-Translation) is capable of manipulating a large number of homologous groups. ParaAT features parallel construction of multiple protein-coding DNA alignments and thus is well suited for large-scale data analysis in the high-throughput era.
ParaAT involves two parallel regions, in which the master thread creates a team of slave threads running concurrently. In the first parallel region, ParaAT aligns protein sequences for multiple homologs by assigning each homolog to one of the slave threads. In the second region, ParaAT parallelly back-translates protein sequence alignments into the corresponding DNA alignments.
ParaAT is well suited for large-scale data analysis in the high-throughput era, providing good scalability and exhibiting high parallel efficiency for computationally demanding tasks.
To ease the input of multiple homologous groups, ParaAT accepts a tab-delimited text file with each row representing a homologous group. An example data file containing 9,712 human-chimp-mouse homologs is available at here. For instance,
NP_000005 NP_783327 XP_001139819
NP_000006 NP_032699 XP_001146758
NP_000008 NP_031409 XP_001162935
.......
There is no restriction on the number of homologous groups (dependent on the memory size of computer used). In addition, different rows (groups) can have different numbers of homologous genes.
-h, -homolog | Homolog group file [string, required] |
-a, -aminoacid | File containing multiple amino acid sequences [string, required] |
-n, -nuc | File containing multiple nucleotide sequences [string, required] |
-p, -processor | File containing the number of processes, changeable during running [string, required] |
-o, -output | Output folder [string, required] |
-m, -msa | Multiple sequence aligner or specify with its full path (clustalw2 | t_coffee | mafft | muscle, optional) |
-f, -format | Output file format (axt | fasta | paml | codon | clustal, optional), default = fasta |
-c, -code | Genetic Code used [integer, optional], default = 1-The Standard Code
|
-g, -nogap | Remove aligned codons with gaps |
-t, -nomismatch | Remove mismatched codons |
-k, -kaks | Enable using KaKs_Calculator for Ka and Ks estimation (requiring axt format) |
-v, -verbose | Verbose output |
-?, -help | Display help information |
Before running, please make sure ParaAT.pl, Epal2nal.pl and multiple sequence aligner (clustalw2 | t_coffee | mafft | muscle) are put into the global directory and can be accessible from any working directory.
Supposing the following files.
File Name | Description |
---|---|
hmc.homolog | 9,712 human-mouse-chimp orthologous groups |
hmc.pep | A file containing all amino acids sequences for human, mouse, chimp |
hmc.cds | A file containing all nucleotide sequences for human, mouse, chimp |
hmc.proc | A file containing a number, viz., the number of processor to be used |
Thus, the command to run ParaAT is below and the results are outputted into a folder named "hmc-output":
ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output hmc-output
To output the axt-formatted sequences, the command is:
ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output
hmc-output
-format axt
To output the axt-formatted sequences and perform Ka and Ks calculations, the command is:
ParaAT.pl -homolog hmc.homolog -aminoacid hmc.pep -nuc hmc.cds -processor hmc.proc -output
hmc-output
-format axt -kaks
The outputted sequences with AXT format are below.
NP_000006-NP_032699
ATGGACATTGAAGCATATTTTGAAAGAATTGGCTATAAGAACTCTAGGAACAAATTGGACTTGGAAACATTAACTGACATTCTTGAGCACCAGATCCGGGCTGTTCCCTTTGAGAACCTTAACATGCATTGTGGGCAAGCCATGGAGTTGGGCTTAGAGGCTATTTTTGATCACATTGTAAGAAGAAACCGGGGTGGGTGGTGTCTCCAGGTCAATCAACTTCTGTACTGGGCTCTGACCACAATCGGTTTTCAGACCACAATGTTAGGAGGGTATTTTTACATCCCTCCAGTTAACAAATACAGCACTGGCATGGTTCACCTTCTCCTGCAGGTGACCATTGACGGCAGGAATTACATTGTCGATGCTGGGTCTGGAAGCTCCTCCCAGATGTGGCAGCCTCTAGAATTAATTTCTGGGAAGGATCAGCCTCAGGTGCCTTGCATTTTCTGCTTGACAGAAGAGAGAGGAATCTGGTACCTGGACCAAATCAGGAGAGAGCAGTATATTACAAACAAAGAATTTCTTAATTCTCATCTCCTGCCAAAGAAGAAACACCAAAAAATATACTTATTTACGCTTGAACCTCGAACAATTGAAGATTTTGAGTCTATGAATACATACCTGCAGACGTCTCCAACATCTTCATTTATAACCACATCATTTTGTTCCTTGCAGACCCCAGAAGGGGTTTACTGTTTGGTGGGCTTCATCCTCACCTATAGAAAATTCAATTATAAAGACAATACAGATCTGGTCGAGTTTAAAACTCTCACTGAGGAAGAGGTTGAAGAAGTGCTGAGAAATATATTTAAGATTTCCTTGGGGAGAAATCTCGTGCCCAAACCTGGTGATGGATCCCTTACTATT
ATGGACATCGAAGCATACTTTGAAAGGATTGGTTACAAGAACTCAGTGAATAAATTGGACTTAGCCACATTAACTGAAGTTCTTCAGCACCAGATGCGAGCAGTTCCTTTTGAGAATCTTAACATGCATTGTGGAGAAGCCATGCATCTGGATTTACAGGACATTTTTGACCACATAGTAAGGAAGAAGAGAGGTGGATGGTGTCTCCAGGTTAATCATCTGCTGTACTGGGCTCTGACCAAAATGGGCTTTGAAACCACAATGTTGGGAGGATATGTTTACATAACTCCAGTCAGCAAATATAGCAGTGAAATGGTCCACCTTCTAGTACAGGTGACCATCAGTGACAGGAAGTACATTGTGGATTCCGCCTATGGAGGCTCCTACCAGATGTGGGAGCCTCTGGAATTAACATCTGGGAAGGATCAGCCTCAGGTGCCTGCCATCTTCCTTTTGACAGAGGAGAATGGAACCTGGTACTTGGACCAAATCAGAAGAGAGCAGTATGTTCCAAATGAAGAATTTGTTAACTCAGACCTCCTTGAAAAGAACAAATATCGAAAAATCTACTCCTTTACTCTTGAGCCCCGAGTTATCGAGGATTTTGAATATGTGAATAGCTATCTTCAGACATCGCCAGCATCTGTGTTTGTAAGCACATCGTTCTGTTCCTTGCAGACCTCGGAAGGGGTTCACTGTTTAGTGGGCTCCACCTTTACAAGTAGGAGATTCAGCTATAAGGACGATGTAGATCTGGTTGAGTTTAAATATGTGAATGAGGAAGAAATAGAAGATGTACTGAAAACCGCATTTGGCATTTCTTTGGAGAGAAAGTTTGTGCCCAAACATGGTGAACTAGTTTTTACTATT
NP_000008-NP_031409
ATGGCCGCCGCGCTGCTCGCCCGGGCCTCGGGCCCTGCCCGCAGAGCTCTCTGTCCTAGGGCCTGGCGGCAGTTACACACCATCTACCAGTCTGTGGAACTGCCCGAGACACACCAGATGTTGCTCCAGACATGCCGGGACTTTGCCGAGAAGGAGTTGTTTCCCATTGCAGCCCAGGTGGATAAGGAACATCTCTTCCCAGCGGCTCAGGTGAAGAAGATGGGCGGGCTTGGGCTTCTGGCCATGGACGTGCCCGAGGAGCTTGGCGGTGCTGGCCTCGATTACCTGGCCTACGCCATCGCCATGGAGGAGATCAGCCGTGGCTGCGCCTCCACCGGAGTCATCATGAGTGTCAACAACTCTCTCTACCTGGGGCCCATCTTGAAGTTTGGCTCCAAGGAGCAGAAGCAGGCGTGGGTCACGCCTTTCACCAGTGGTGACAAAATTGGCTGCTTTGCCCTCAGCGAACCAGGGAACGGCAGTGATGCAGGAGCTGCGTCCACCACCGCCCGGGCCGAGGGCGACTCATGGGTTCTGAATGGAACCAAAGCCTGGATCACCAATGCCTGGGAGGCTTCGGCTGCCGTGGTCTTTGCCAGCACGGACAGAGCCCTGCAAAACAAGGGCATCAGTGCCTTCCTGGTCCCCATGCCAACGCCTGGGCTCACGTTGGGGAAGAAAGAAGACAAGCTGGGCATCCGGGGCTCATCCACGGCCAACCTCATCTTTGAGGACTGTCGCATCCCCAAGGACAGCATCCTGGGGGAGCCAGGGATGGGCTTCAAGATAGCCATGCAAACCCTGGACATGGGCCGCATCGGCATCGCCTCCCAGGCCCTGGGCATTGCCCAGACCGCCCTCGATTGTGCTGTGAACTACGCTGAGAATCGCATGGCCTTCGGGGCGCCCCTCACCAAGCTCCAGGTCATCCAGTTCAAGTTGGCAGACATGGCCCTGGCCCTGGAGAGTGCCCGGCTGCTGACCTGGCGCGCTGCCATGCTGAAGGATAACAAGAAGCCTTTCATCAAGGAGGCAGCCATGGCCAAGCTGGCCGCCTCGGAGGCCGCGACCGCCATCAGCCACCAGGCCATCCAGATCCTGGGCGGCATGGGCTACGTGACAGAGATGCCGGCAGAGCGGCACTACCGCGACGCCCGCATCACTGAGATCTACGAGGGCACCAGCGAAATCCAGCGGCTGGTGATCGCCGGGCATCTGCTCAGGAGCTACCGGAGC
ATGGCTGCCGCCTTGCTCGCCCGGGCCCGTGGCCCTCTCCGTAGAGCTCTCGGTGTTCGGGACTGGCGACGGTTACACACTGTTTACCAGTCTGTGGAGCTGCCTGAGACACACCAGATGTTGCGTCAGACATGCCGTGACTTTGCCGAGAAGGAGTTGGTCCCCATTGCGGCCCAGCTGGACAGGGAGCATCTCTTCCCCACAGCTCAGGTTAAGAAGATGGGTGAGCTCGGGCTGCTGGCCATGGATGTGCCAGAGGAGCTGAGTGGTGCAGGCTTGGATTACCTGGCCTACTCCATCGCCCTGGAGGAGATCAGCCGTGCCTGCGCCTCCACGGGAGTTATCATGAGCGTCAACAATTCTCTCTACTTGGGACCCATTCTGAAGTTTGGATCCGCACAGCAGAAGCAACAGTGGATCACCCCTTTCACCAATGGTGACAAAATCGGCTGTTTTGCCCTCAGTGAGCCAGGCAATGGCAGTGATGCTGGAGCCGCTTCCACCACTGCCCGGGAAGAGGGTGACTCATGGGTCCTCAACGGCACCAAAGCTTGGATCACCAACTCCTGGGAGGCTTCCGCCACGGTGGTATTTGCCAGCACAGACAGGTCCCGGCAGAACAAGGGTATCAGTGCCTTCCTGGTTCCCATGCCAACTCCTGGGCTCACGCTGGGCAAGAAGGAAGACAAGCTGGGCATCCGGGCCTCCTCCACAGCTAACCTCATCTTTGAGGACTGCCGGATCCCCAAGGAGAACCTGCTTGGGGAGCCTGGGATGGGCTTCAAAATAGCCATGCAAACCCTGGACATGGGTCGCATTGGCATCGCCTCCCAGGCCCTGGGCATCGCCCAGGCCTCCCTGGATTGTGCTGTGAAGTATGCCGAGAACCGCAATGCCTTTGGGGCACCGCTCACCAAGCTCCAAAATATCCAGTTCAAGCTGGCAGACATGGCCCTGGCCCTGGAGAGTGCCCGCCTGCTGACCTGGCGTGCTGCCATGTTGAAAGACAACAAGAAACCTTTCACCAAGGAGTCCGCCATGGCCAAACTGGCTGCATCGGAGGCTGCAACCGCCATTAGCCACCAGGCCATCCAGATCCTGGGCGGCATGGGGTATGTGACAGAGATGCCGGCTGAGCGGTACTACCGAGATGCCCGCATCACTGAGATCTACGAAGGGACCAGCGAAATCCAGAGACTGGTGATCGCTGGGCATCTGCTCCGGAGCTACCGGAGC