BioCozy: 2014

Yesterday I went looking for a well-behaved codon-aware aligner. A question that has been brought up on both Stack Overflow and BioStars.

What I discovered is that many software projects that claim to be codon aligners are actually just (1) translating the nucleotide sequence to an amino acid seqeunce, (2) aligning the amino acid sequencing, and (3) mapping the a.a. alignment structure to the nucleotide sequences, in a codon-by-codon fashion. Well, that's fine as far as it goes, but (a) I can do that myself, and (b) this strategy fails in certain cases (e.g., a polymeric stretch of amino acids that have different, synonymous codons.)

Here are the software programs I tried. Sources are available for all of them, and most also have web app interfaces (not ultimately what I want but good for a test drive.)

TranslatorX [cite] - Perl script and web app by Abascal F, Zardoya R, Telford MJ
MACSE [cite] - Java jar archive by Vincent Ranwez at Université Montpellier 2
transAlign [cite] - Perl script by Olaf R.P. Bininda-Emonds at University of Oldenburg
PRANK [code|cite] - C++ program by Ari Löytynoja developed in the Goldman lab at EMBL-EBI and now at the University of Helsinki
Bio.CodonAlign [docs] - fork of BioPython by Zheng Ruan written in a Google Summer of Code 2013

The upshot: only PRANK seems to be a true codon-aware aligner.

Wish-list: What I really want is something like PRANK but that is streamlined for fast pairwise alignment. Ideally callable as a Python module! Or at least be able to communicate with the binary via stdin and stout pipes. Wish I knew some C++ but I don't :/

Here are details of my test and the results I got:

Test Sequences

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCAAACACAGGAAACAGCGGTCAGGTTAGC

Results from translatorX

one

AGT

AAG

AAA

GCA

CAG

CAA

GCA

GCT

AAC

ACA

GGA

AAC

AGC

GGT

CAG

GTT

AGC

two

AGT

AAG

AAA

AAG

GAA

CAG

---

GCA

GCT

AAC

ACA

GGA

AAC

AGC

GGT

CAG

GTT

AGC

three

AGT

AAG

AAA

AAG

GAA

CAG

---

GCA

GCA

AAC

ACA

GGA

AAC

AGC

GGT

CAG

GTT

AGC

>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQ-AANTGNSGQVS
>three
SKKKEQQ-AANTGNSGQVS

Results from MACSE

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC

>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQAA-NTGNSGQVS
>three
SKKKEQQAA-NTGNSGQVS

Results from transAlign

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAG---GCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAG---GCAGCTAACACAGGAAACAGCGGTCAGGTTAGC

CLUSTAL 2.1 multiple sequence alignment

tAlign_2        SKKKEQQ-AANTGNSGQVS
tAlign_3        SKKKEQQ-AANTGNSGQVS
tAlign_1        SKKAAQQAAANTGNSGQVS
                *** ** ***********

Results from PRANK

(with "align translated codons" option checked)

Curiously, with all three test sequences, it doesn't put the GCT where I want it...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC

But with seqs 1 and 3, it works correctly...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC

As it does with just seqs 1 and 2...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
TGGAAGAAAAAGGAACAGCAGGCA---GCTAACACAGGAAACAGCGGTCAGGTTAGC

I think the behavior with all 3 seqs is not a glitch but is rather a result of the probabilistic model that PRANK uses for multiple sequence alignments (ie, the alignments inform each other.)

How big is this news? Well, bigger than 4 GB, that's for sure.

A while back I brought up this issue on BioStars.

From the Bowtie2 news page:

Version 2.2.0 - February 10, 2014

Improved index querying efficiency using "population count" instructions available since SSE4.2
Added support for large and small indexes, removing 4-billion-nucleotide barrier. Bowtie 2 can now be used with reference genomes of any size.
Fixed bug that could cause bowtie2-build to crash when reference length is close to 4 billion.
Added a CL: string to the @PG SAM header to preserve information about the aligner binary and paramteres.
Fixed bug that could cause bowtie2-build to crash when reference length is close to 4 billion.
No longer releasing 32-bit binaries. Simplified manual and Makefile accordingly.
Credits to the Intel® enabling team for performance optimizations included in this release. Thank you!
Phased out CygWin support. MinGW can still be used for Windows building.
Added the .bat generation for Windows.
Fixed some issues with some uncommon chars in fasta files.
Fixed wrappers so bowtie can now be used with symlinks.

BioCozy

Friday, May 16, 2014

Hunting for the perfect codon aware aligner

Test Sequences

Results from translatorX

Results from MACSE

Results from transAlign

Results from PRANK

Wednesday, February 19, 2014

Color palettes in R (wrt ggplot2)

Wednesday, February 12, 2014

Bowtie2 Version 2.2.0 released -- HUGE news!

Version 2.2.0 - February 10, 2014