Friday, May 16, 2014

Hunting for the perfect codon aware aligner

Yesterday I went looking for a well-behaved codon-aware aligner.   A question that has been brought up on both Stack Overflow and BioStars.


What I discovered is that many software projects that claim to be codon aligners are actually just (1) translating the nucleotide sequence to an amino acid seqeunce, (2) aligning the amino acid sequencing, and (3) mapping the a.a. alignment structure to the nucleotide sequences, in a codon-by-codon fashion.  Well, that's fine as far as it goes, but (a) I can do that myself, and (b) this strategy fails in certain cases (e.g., a polymeric stretch of amino acids that have different, synonymous codons.)

Here are the software programs I tried. Sources are available for all of them, and most also have web app interfaces (not ultimately what I want but good for a test drive.)
  • TranslatorX [cite] - Perl script and web app by Abascal F, Zardoya R, Telford MJ
  • MACSE [cite] - Java jar archive by Vincent Ranwez at Université Montpellier 2
  • transAlign  [cite] - Perl script by Olaf R.P. Bininda-Emonds at University of Oldenburg
  • PRANK [code|cite] - C++ program by Ari Löytynoja developed in the Goldman lab at EMBL-EBI and now at the University of Helsinki
  • Bio.CodonAlign [docs] - fork of BioPython by Zheng Ruan written in a Google Summer of Code 2013


http://code.google.com/p/prank-msa/wiki/PRANKThe upshot:  only PRANK seems to be a true codon-aware aligner. 


Wish-list:  What I really want is something like PRANK but that is streamlined for fast pairwise alignment.  Ideally callable as a Python module!  Or at least be able to communicate with the binary via stdin and stout pipes.  Wish I knew some C++ but I don't :/


Here are details of my test and the results I got:

Test Sequences


>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCAAACACAGGAAACAGCGGTCAGGTTAGC



Results from translatorX



one AGT AAG AAA GCA GCA CAG CAA GCA GCA GCT AAC ACA GGA AAC AGC GGT CAG GTT AGC
two AGT AAG AAA AAG GAA CAG CAG --- GCA GCT AAC ACA GGA AAC AGC GGT CAG GTT AGC
three AGT AAG AAA AAG GAA CAG CAG --- GCA GCA AAC ACA GGA AAC AGC GGT CAG GTT AGC

>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQ-AANTGNSGQVS
>three
SKKKEQQ-AANTGNSGQVS



Results from MACSE



>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC 
 
>one
SKKAAQQAAANTGNSGQVS
>two
SKKKEQQAA-NTGNSGQVS
>three
SKKKEQQAA-NTGNSGQVS 
 


Results from transAlign


>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAG---GCAGCAAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAG---GCAGCTAACACAGGAAACAGCGGTCAGGTTAGC



CLUSTAL 2.1 multiple sequence alignment

tAlign_2        SKKKEQQ-AANTGNSGQVS
tAlign_3        SKKKEQQ-AANTGNSGQVS
tAlign_1        SKKAAQQAAANTGNSGQVS
                ***  ** ***********




Results from PRANK


(with "align translated codons" option checked)

Curiously, with all three test sequences, it doesn't put the GCT where I want it...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
AGTAAGAAAAAGGAACAGCAGGCAGCT---AACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC



But with seqs 1 and 3, it works correctly...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>three
AGTAAGAAAAAGGAACAGCAGGCAGCA---AACACAGGAAACAGCGGTCAGGTTAGC

 
As it does with just seqs 1 and 2...

>one
AGTAAGAAAGCAGCACAGCAAGCAGCAGCTAACACAGGAAACAGCGGTCAGGTTAGC
>two
TGGAAGAAAAAGGAACAGCAGGCA---GCTAACACAGGAAACAGCGGTCAGGTTAGC



I think the behavior with all 3 seqs is not a glitch but is rather a result of the probabilistic model that PRANK uses for multiple sequence alignments (ie, the alignments inform each other.) 

Wednesday, February 19, 2014

Color palettes in R (wrt ggplot2)

I just found an outstanding rundown of color issues in R (with a nod to ggplot2) on cookbook-r.com.

Mostly just a collection of resources from around the web.

I really like their colorblind friendly palette:

 
cbPalette <- span=""> c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")


Which is better than the ggplot2 defaults:


Wednesday, February 12, 2014

Bowtie2 Version 2.2.0 released -- HUGE news!

 How big is this news?  Well, bigger than 4 GB, that's for sure.


From the Bowtie2 news page:

Version 2.2.0 - February 10, 2014

  • Improved index querying efficiency using "population count" instructions available since SSE4.2
  • Added support for large and small indexes, removing 4-billion-nucleotide barrier. Bowtie 2 can now be used with reference genomes of any size.
  • Fixed bug that could cause bowtie2-build to crash when reference length is close to 4 billion.
  • Added a CL: string to the @PG SAM header to preserve information about the aligner binary and paramteres.
  • Fixed bug that could cause bowtie2-build to crash when reference length is close to 4 billion.
  • No longer releasing 32-bit binaries. Simplified manual and Makefile accordingly.
  • Credits to the Intel® enabling team for performance optimizations included in this release. Thank you!
  • Phased out CygWin support. MinGW can still be used for Windows building.
  • Added the .bat generation for Windows.
  • Fixed some issues with some uncommon chars in fasta files.
  • Fixed wrappers so bowtie can now be used with symlinks.

Monday, December 2, 2013

pip install from github at a specific branch

I had difficulties figuring out how to use pip to install a specific branch from a github repo.  Because of my corporate firewall, all of the other methods that you can find around on various websites did not work for me.  This is the method that worked:

pip install git+ssh://git@github.com/h3/django-dajaxice.git@django-1.5+

Of course, I have http_proxy and https_proxy set and exported.

Wednesday, August 28, 2013

Making sense of Python version madness and pip/setuptools/virtualenv game

This posting is motivated by cases where you are stuck using an older server OS (CentOS 5.8 in this case) and want to install an up-to-date python 2.7 along with a raft of necessary packages.  What is the best way?

distribute? pip? setuptools? virtualenv?  What to make of all this?

Bottom line: use pip.  To install pip, you either install it from source, or else install virtualenv (which includes pip).  Installing from source allows you to manage packages with pip globally on your system (and not in a virtual environment), then you have to install pip from source.  But to install pip yourself, you first need to install setuptools.  All of these installs should be done using the target python interpreter.   Got that?

I wrote the following notes after digesting the helpful hints given at a variety of places, such as these two gist pages.  Note that although there are many tutorials that still guide you to install virtualenv with easy_install (such as here or here), we're actually not supposed to do it this way any longer.  My notes that follow will not make use of virtualenv, but in some situations this might be desirable -- this is just for a system-wide fresh installation of Python 2.7 and pip to install packages. (Here is a nice overview of setting up virtualenv if you want to go that route.)

First you want to be sure your CentOS machine is ready to built something like Python:

yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel
Then you will fetch and unzip Python:
wget http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.bz2
tar xf Python-2.7.5.tar.bz2
cd Python-2.7.5
Do the usual "configure, make, make install" waltz: 
 
./configure --prefix=/opt/python2.7.5 --with-threads --enable-shared
make
make install
At this point, it seems like you're done but not quite.  
$ /opt/python2.7.5/bin/python /opt/python2.7.5/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
/opt/python2.7.5/
 
The trick is to tell ldconfig where to find the new python shared libraries:  
echo "/opt/python2.7.5/lib/" > /etc/ld.so.conf.d/opt-python2.7.5.conf
ldconfig 
 
Now we're cooking with gas: 
$ /opt/python2.7.5/bin/python 
Python 2.7.5 (default, Aug 21 2013, 09:50:18) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2

Get setuptools from the setuptools project page.  You have to click on the version you want and then to find the download link, scroll to the very bottom of the page ("In the basement, down a long hallway behind a door marked 'danger-TIGER!'")

Aside: I am having bad luck these days trying to wget from https servers -- something about SSL certificates and my corporate proxy and the old version of wget that is on CentOS 5.8.

# wget "https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz"
--2013-08-21 11:49:47--  https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz
Resolving proxy.company.com... 

Connecting to proxy.company.com:8080... connected.
ERROR: certificate common name `*.a.ssl.proxy.net' doesn't match requested host name `pypi.python.org'.
To connect to pypi.python.org insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.


So curl seems to work better.  Maybe because it is not checking SSL certs?
# curl -O "https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  663k  100  663k    0     0   229k      0  0:00:02  0:00:02 --:--:--  359k


Once you have downloaded and untar'ed setuptools, you want to install it.  But it seems the setuptools instructions appear flawed:
# /opt/python2.7.5/bin/python setup.py --prefix=/opt/setuptools
error: option --prefix not recognized


So just do the usual "python setup.py build and install" two-step.  Be sure you are calling the desired target python version.
/opt/python2.7.5/bin/python setup.py build
/opt/python2.7.5/bin/python setup.py install

Now you can install pip:
curl -O https://pypi.python.org/packages/source/p/pip/pip-1.4.1.tar.gz
tar xvzf pip-1.4.1.tar.gz
cd pip-1.4.1
/opt/python2.7.5/bin/python setup.py build
/opt/python2.7.5/bin/python setup.py install

And now... now... NOW!  Praise the Almighty, you are ready to use pip to install packages!!!

/opt/python2.7.5/bin/pip install numpy   # install this before biopython
/opt/python2.7.5/bin/pip install scipy
/opt/python2.7.5/bin/pip install biopython
/opt/python2.7.5/bin/pip install pysam


Tuesday, August 20, 2013

Roll call of my preferred bioinformatic tools (NGS focus)

A roll call of my preferred bioinformatic tools

A somewhat useless post, except that it helps me to remember a few of the lesser used (but occasionally very, very handy) pieces of software, like FLASH of Flexbar.



Thursday, July 25, 2013

SSHFS: How to Mount Remote Partition via SSH on CentOS

This is hugely helpful for anyone wanting mount-point accessible files over SSH. 

Don't bother with fstab or the mount command.  Full instructions here, and the executive summary is:

yum install fuse sshfs
sshfs user@remote_host:/remote_directory /local_mount_partition