Yesterday I went looking for a well-behaved codon-aware aligner. A question that has been brought up on both Stack Overflow and BioStars.
What I discovered is that many software projects that claim to be codon aligners are actually just (1) translating the nucleotide sequence to an amino acid seqeunce, (2) aligning the amino acid sequencing, and (3) mapping the a.a. alignment structure to the nucleotide sequences, in a codon-by-codon fashion. Well, that's fine as far as it goes, but (a) I can do that myself, and (b) this strategy fails in certain cases (e.g., a polymeric stretch of amino acids that have different, synonymous codons.)
Here are the software programs I tried. Sources are available for all of them, and most also have web app interfaces (not ultimately what I want but good for a test drive.)
Here are details of my test and the results I got:
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQ-AANTGNSGQVS
>three
SKKKEQQ-AANTGNSGQVS
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAG---GCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAG---GCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
CLUSTAL 2.1 multiple sequence alignment
tAlign_2 SKKKEQQ-AANTGNSGQVS
tAlign_3 SKKKEQQ-AANTGNSGQVS
tAlign_1 SKKAAQQAAANTGNSGQVS
*** ** ***********
(with "align translated codons" option checked)
Curiously, with all three test sequences, it doesn't put the GCT where I want it...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC
But with seqs 1 and 3, it works correctly...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC
As it does with just seqs 1 and 2...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
TGGAAGAAAAAGGAACAGCAGGCA---GCTAACACAGGAAACAGCGGTCAGGTTAGC
I think the behavior with all 3 seqs is not a glitch but is rather a result of the probabilistic model that PRANK uses for multiple sequence alignments (ie, the alignments inform each other.)
What I discovered is that many software projects that claim to be codon aligners are actually just (1) translating the nucleotide sequence to an amino acid seqeunce, (2) aligning the amino acid sequencing, and (3) mapping the a.a. alignment structure to the nucleotide sequences, in a codon-by-codon fashion. Well, that's fine as far as it goes, but (a) I can do that myself, and (b) this strategy fails in certain cases (e.g., a polymeric stretch of amino acids that have different, synonymous codons.)
Here are the software programs I tried. Sources are available for all of them, and most also have web app interfaces (not ultimately what I want but good for a test drive.)
- TranslatorX [cite] - Perl script and web app by Abascal F, Zardoya R, Telford MJ
- MACSE [cite] - Java jar archive by
- transAlign [cite] - Perl script by Olaf R.P. Bininda-Emonds at University of Oldenburg
- PRANK [code|cite] - C++ program by Ari Löytynoja developed in the Goldman lab at EMBL-EBI and now at the University of Helsinki
- Bio.CodonAlign [docs] - fork of BioPython by Zheng Ruan written in a Google Summer of Code 2013
Wish-list: What I really want is something like PRANK but that is streamlined for fast pairwise alignment. Ideally callable as a Python module! Or at least be able to communicate with the binary via stdin and stout pipes. Wish I knew some C++ but I don't :/
Here are details of my test and the results I got:
Test Sequences
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
Results from translatorX
one | AGT | AAG | AAA | GCA | GCA | CAG | CAA | GCA | GCA | GCT | AAC | ACA | GGA | AAC | AGC | GGT | CAG | GTT | AGC |
two | AGT | AAG | AAA | AAG | GAA | CAG | CAG | --- | GCA | GCT | AAC | ACA | GGA | AAC | AGC | GGT | CAG | GTT | AGC |
three | AGT | AAG | AAA | AAG | GAA | CAG | CAG | --- | GCA | GCA | AAC | ACA | GGA | AAC | AGC | GGT | CAG | GTT | AGC |
>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQ-AANTGNSGQVS
>three
SKKKEQQ-AANTGNSGQVS
Results from MACSE
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC
>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQAA-NTGNSGQVS
>three
SKKKEQQAA-NTGNSGQVS
Results from transAlign
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAG---GCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAG---GCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
CLUSTAL 2.1 multiple sequence alignment
tAlign_2 SKKKEQQ-AANTGNSGQVS
tAlign_3 SKKKEQQ-AANTGNSGQVS
tAlign_1 SKKAAQQAAANTGNSGQVS
*** ** ***********
Results from PRANK
(with "align translated codons" option checked)
Curiously, with all three test sequences, it doesn't put the GCT where I want it...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC
But with seqs 1 and 3, it works correctly...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC
As it does with just seqs 1 and 2...
>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
TGGAAGAAAAAGGAACAGCAGGCA---GCTAACACAGGAAACAGCGGTCAGGTTAGC
I think the behavior with all 3 seqs is not a glitch but is rather a result of the probabilistic model that PRANK uses for multiple sequence alignments (ie, the alignments inform each other.)