Tuesday, October 23, 2012

Amazed by RStudio!

RStudio (http://www.rstudio.com/) has to be one of the coolest pieces of software I have installed in recent memory.

RStudio provides a very well-planned and well-executed IDE for R.  This alone would be a reason to sing Hosanna.


But it gets better!!!  The Server package (deb and rpm packages available) allows you access to this same interface via a webserver, so you can run R on a beefy computational server from any web browser.  And it seems to work every bit as nicely as the locally installed IDE.

The user accounts are integrated with the linux user accounts, very nice.  I even got it to work (after reading this post) in a kerberos situation by overwriting the "rstudio" pam file that the RStudio package provides with the login one.  (You might want to back up  the rstudio one first).

# cd /etc/pam.d/
# cp login rstudio

Well done and thank you to the folks at the RStudio company.

Tuesday, October 16, 2012

How to demultiplex fastq files with a dedicated, separate barcode file

Oh, Barcodes, How do I obtain thee?  Let me count the ways:
1) Internal to one of the reads
2) Printed in the header:
3) Encoded in the header
4) In a separate file

There are many tools out there to deal with multiplexed read data.
fastx_barcode_splitter.pl (from FASTX_Toolkit)
fastq-multx (from ea-utils)
split_libraries_fastq.py (from QIIME)

As far as I know, none of the above adequately deal with case (4) above.  The QIIME tools does, but appears to turn the fastq into a fasta file in the process (?).

Here is a one-liner homebrew solution to this situation:  just tack the index reads onto the front of read1, and then use an off-the-shelf tool (I like the FASTX_Toolkit one above) to demultiplex.  Like this:

paste -d '' <(echo; sed -n '1,${n;p;}' R2.fq | sed G) R1.fq | sed '/^$/d' | fastx_barcode_splitter.pl --bol --bcfile barcodes.txt --prefix debarcoded

I am assuming that R2.fq is the barcode fastq file, and R1.fq is the is the Read 1 file.  The commands above take the 2nd and 4th lines from the barcode fastq file, spaces them out correctly, pastes them together (with no delimiter - note: '' is a double single-quote, not a single double-quote), and then uses another paste command to pre-pend THAT result to the R1 file.  Also there is a final sed command to remove the trailing blank lines.  It ain't pretty, but it works.

The fancy <(  ) constructs is a bash thing called "process substitution".  If you're not using bash, it won't work.

Hope this helps someone.

(edit:  I simplified the initial sed command on 2/1/2013, so let me know if it somehow broke, but it should still work fine.)

Thursday, October 11, 2012

1000 Genomes Project Accessibility Mask (BED file)


The 1000 genomes project publishes a .bed file mask ("Strict Accessibility Mask") for regions of the human genome where it is difficult to place reads, due to repetitivity.  Follow the link: