Wednesday, February 27, 2013

Setting up JBrowse 1.8.1 on a CentOS 5.8

Some important details to successfully set up JBrowse on a Centos 5.8 machine.

Install and compile samtools from source.  I built it with fPIC flags like this:

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

Be sure that the file libbam.a that you just created with the fPIC flag is (also) present in the lib directory.  I had to rename the old one and copy the new one there.


# ls -al /usr/local/samtools/lib*
-rw-rw-r-- 1 odsolbe odsolbe 309986 Feb 27 09:51 /usr/local/samtools/libbam.a

/usr/local/samtools/lib:
total 1316
drwxr-xr-x 2 odsolbe odsolbe    4096 Feb 27 09:54 .
drwxr-xr-x 7 odsolbe odsolbe    4096 Feb 27 09:51 ..
-rw-rw-r-- 1 odsolbe odsolbe  309986 Feb 27 09:54 libbam.a
-rw-r--r-- 1 odsolbe odsolbe 1000346 Oct 24  2011 libbam.a_orig


Set two env variables: SAMTOOLS and PARL5LIB (see here)

export PERL5LIB=/path/to/...../jbrowse/extlib/lib/perl5 export SAMTOOLS=/usr/local/samtools

This is what worked for me.  Your milage may vary. 

Friday, February 1, 2013

What your father never told you about tophat

Following up on yesterday's posting about the "right" way to run cufflinks, I am here recording my experiences running tophat with and without the --no-novel-juncs setting.

Same dataset as described in the previous posting.  Still using the Illumina iGenomes Ensembl annotation files. The command I used was like this (note: the transcriptome index was pre-built during a previous run.)

tophat [--no-novel-juncs] -G genes.gtf --transcriptome-index=transcriptome -p 8 -o tophat_out file_1.fq.gz file_2.fq.gz

Mean run time for the three replicates (which have very nearly the same number of reads, about 19 million each):
Default (with novel junction finding): 160 minutes
With --no-novel-juncs option:  150 minutes

No big time savings from turning off novel junction finding.  No difference in the mean correlation of genes.fpkm_tracking from the resultant cufflinks run.

Also, I tried --no-novel-juncs with 8, 16, and 24 processors, on a server with 32 cores (not hyperthreaded).

8 procs = 150 minutes
16 procs = 134 minutes
24 procs = 121 minutes

Tophat basically parallelizes the bowtie2 alignment stage, but there is a lot of time spent reading and writing files, which is single-threaded.

I have, in the past, had problems with tophat dying when running with too many processors (24, for instance) but the failure is not predictable:


[2013-02-05 08:08:09] Mapping right_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
[2013-02-05 08:08:34] Mapping right_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
[2013-02-05 08:09:01] Mapping right_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
[2013-02-05 08:09:25] Joining segment hits
[2013-02-05 08:11:22] Reporting output tracks
        [FAILED]
Error running /usr/local/tophat-2.0.7.Linux_x86_64/tophat_reports --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir Sample_04_L005/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p24 --inner-dist-mean 50 --inner-dist-std-dev 20 --gtf-annotations /data/genomes/iGenomes/Ensembl/transcriptome.gff --gtf-juncs Sample_04_L005/tmp/transcriptome.juncs --no-closure-search --no-coverage-search --no-microexon-search --sam-header Sample_04_L005/tmp/genome_genome.bwt.samheader.sam --report-discordant-pair-alignments --report-mixed-alignments --samtools=/usr/local/samtools/samtools --bowtie2-max-penalty 6 --bowtie2-min-penalty 2 --bowtie2-penalty-for-N 1 --bowtie2-read-gap-open 5 --bowtie2-read-gap-cont 3 --bowtie2-ref-gap-open 5 --bowtie2-ref-gap-cont 3 /data/genomes/iGenomes/Ensembl/genome.fa Sample_04_L005/junctions.bed Sample_04_L005/insertions.bed Sample_04_L005/deletions.bed Sample_04_L005/fusions.out Sample_04_L005/tmp/accepted_hits Sample_04_L005/tmp/left_kept_reads.m2g.bam,Sample_04_L005/tmp/left_kept_reads.m2g_um.mapped.bam,Sample_04_L005/tmp/left_kept_reads.m2g_um.candidates Sample_04_L005/tmp/left_kept_reads.bam Sample_04_L005/tmp/right_kept_reads.m2g.bam,Sample_04_L005/tmp/right_kept_reads.m2g_um.mapped.bam,Sample_04_L005/tmp/right_kept_reads.m2g_um.candidates Sample_04_L005/tmp/right_kept_reads.bam
Error: bam_merge failed to open BAM file Sample_04_L005/tmp/right_kept_reads.m2g_um.candidates15.bam



Thursday, January 31, 2013

data.table - my new favorite R package

I love the plyr package and use ddply very often.  But it becomes ungainly with big data.

I found out about data.table today.  Basically, it's indexed data frames, allowing binary search operations.

Their quick start guide has this footnote:

"We wonder how many people are deploying parallel techniques to code that is vector scanning"

(Yep, that was me.  And yes, crazy that I am only discovering data.table in 2013.)

But I see the light now!  Take me to the water and set me down!

Wednesday, January 30, 2013

What your mother never told you about cufflinks

In this post, I am trying to nail down the "best" way to run cufflinks, part of the so-called "tuxedo suite" of RNA-Seq tools (tophat, cufflinks, etc.)

I have a dataset composed of 3 technical replicates -- exactly the same library, run in 3 separate hi-seq lanes (multiplexed along with other samples, but I am ignoring those for now.) They were run on an Illumina HiSeq 2000, with 101 bp paired-end reads.

Number of reads, before and after quality trimming and preprocessing:

Replicate   Before      After
1           19,939,455  19,074,425
2           19,929,909  19,120,424
3           19,950,587  19,101,188
You can see, the lanes were very uniform in the number of reads.

For all of the following commands, I am using the iGenomes files (for the bowtie2 index, the GTF file, and the genome.fa file), available here: 
http://ccb.jhu.edu/software/tophat/igenomes.shtml

I ran tophat like this:
tophat -G genes.gtf --transcriptome-index=transcriptome -p 8 -o out genome rep1_1.fq.gz rep1_2.fq.gz

Cufflinks was run like this:
cufflinks -p 24 --no-effective-length-correction -b genome.fa -G genes.gtf -o out out/accepted_hits.bam

The two variables are highlighted.  

The cufflinks online manual says (note, "Cuffdiff" is a typo and should read "Cufflinks")


--no-effective-length-correctionCuffdiff will not employ its "effective" length normalization to transcript FPKM.

and:

-b/--frag-bias-correct Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Works for more details.

For each run, I sorted the resulting genes.fpkm_tracking files, pulled out the FPKM column, and calculated the mean of the pairwise Pearson correlation for the 3 replicates. Here are the results:


effective length corr
fragment bias corr
Time
Correlation
Yes
Yes
43 min
0.9790755
Yes
No
16 min
0.9749819
No
Yes
43 min
0.9999465
No
No
16 min
0.9999626


Note: the effective length correction is on by default, whereas the frag bias correction is off by default.

The best way to run cufflinks is with --no-effective-length-correction !!!

frag-bias-correction makes the analysis much slower, without any discernable benefit in this trial run.

Tuesday, October 23, 2012

Amazed by RStudio!

RStudio (http://www.rstudio.com/) has to be one of the coolest pieces of software I have installed in recent memory.

RStudio provides a very well-planned and well-executed IDE for R.  This alone would be a reason to sing Hosanna.


But it gets better!!!  The Server package (deb and rpm packages available) allows you access to this same interface via a webserver, so you can run R on a beefy computational server from any web browser.  And it seems to work every bit as nicely as the locally installed IDE.

The user accounts are integrated with the linux user accounts, very nice.  I even got it to work (after reading this post) in a kerberos situation by overwriting the "rstudio" pam file that the RStudio package provides with the login one.  (You might want to back up  the rstudio one first).

# cd /etc/pam.d/
# cp login rstudio

Well done and thank you to the folks at the RStudio company.