Tuesday, March 27, 2012

GNU parallel -- the best thing since sliced bread


When I first found GNU parallel last fall, I thought it was the bee's knees. So well conceived, so well executed, so well documented... so generous to the user, and most of all, so darn handy.


I've been using it extensively with the ::: notation to feed data files into a pipeline in parallel, and it does exactly what I want it to do.

But today it goes to a whole new level, as I realize there is a whole other side to the tool: the ability to feed parts of a single input file into a pipeline in parallel.

For example, BLAST can be effectively (embarrassingly) parallelized as follows:
cat temp.fa | parallel -j 10 --block 100k --recstart '>' --pipe blastn -evalue 0.01 -outfmt 6 -db refseq_genomic -query - > result.txt

The magic is in the pipe option.  Of course, we've all written blast parallelizers, but I've never seen a solution as neat and simple as this one liner!

This is from Michael Kuhn's excellent post on the subject.  Thanks!

Monday, March 19, 2012

Of bash and streams and pipes - One Tee To Rule Them All

There is a great answer on stack overflow showing how to feed one output stream into multiple processes.  It is a little bit of a hack of the excellent tool tee, which I've been using for some time.  Normally, tee take stout and writes it to a file, while passing it along to stout.  This solution provided on s.o. shows how to hack this with the bash trick >(process)  -- so as far as tee knows, it is writing to 2 or more files, but the "files" are actually background processes.

This syntax is bash-dependent, so won't work in sh, and this will likely come up if you're using system() in, say,  perl.  There is another excellent answer on stack overflow on how to deal with this.

Side note:  Using the <(command) trick, my post from March 16 can now be written:

paste <(grep -v "#" hsa.gff | cut -f 1,4,5,9) <(grep -v "#" hsa.gff | cut -f 6,7) | sed s/^/chr/ >mirbase18.bed

 

Friday, March 16, 2012

mirbase gff annotations in bed format

The mirbase.org folks in Manchester are doing a great job, I think, but they only release their annotations in gff format.  I think the following will convert that into an acceptable bed-format file:
## get mirbase18 annotation and convert to a bed like format
wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff
grep -v "#" hsa.gff | cut -f 1,4,5,9 >mirbase18.temp1
grep -v "#" hsa.gff | cut -f 6,7 >mirbase18.temp2
paste mirbase18.temp1 mirbase18.temp2 | sed s/^/chr/ >mirbase18.bed

Extra credit: Is there a clever shell trick to piping two separate stdins into paste, thus avoiding the temp files?

Thursday, March 8, 2012

2012 NAR Database Summary (Category List)

Bioinformatic databases grow like kudzu, so I'll no doubt be referring to this directory in the coming year.

http://www.oxfordjournals.org/nar/database/c/

 

How to get the entire recursive directory structure of an FTP site using wget

How can I use wget to get a recursive directory listing of an entire FTP site?

yes, wget --no-remove-listing ftp://myftpserver/ftpdirectory/ will get and retain a directory listing for a single directory (which was discussed here)

but, wget -r --no-remove-listing ftp://myftpserver/ftpdirectory/ will recurse and download an entire site

so, if you want to get the recursive directory structure, but not download the entire site, try wget -r --no-remove-listing --spider ftp://myftpserver/ftpdirectory/