Monday, December 2, 2013

pip install from github at a specific branch

I had difficulties figuring out how to use pip to install a specific branch from a github repo.  Because of my corporate firewall, all of the other methods that you can find around on various websites did not work for me.  This is the method that worked:

pip install git+ssh://git@github.com/h3/django-dajaxice.git@django-1.5+

Of course, I have http_proxy and https_proxy set and exported.

Wednesday, August 28, 2013

Making sense of Python version madness and pip/setuptools/virtualenv game

This posting is motivated by cases where you are stuck using an older server OS (CentOS 5.8 in this case) and want to install an up-to-date python 2.7 along with a raft of necessary packages.  What is the best way?

distribute? pip? setuptools? virtualenv?  What to make of all this?

Bottom line: use pip.  To install pip, you either install it from source, or else install virtualenv (which includes pip).  Installing from source allows you to manage packages with pip globally on your system (and not in a virtual environment), then you have to install pip from source.  But to install pip yourself, you first need to install setuptools.  All of these installs should be done using the target python interpreter.   Got that?

I wrote the following notes after digesting the helpful hints given at a variety of places, such as these two gist pages.  Note that although there are many tutorials that still guide you to install virtualenv with easy_install (such as here or here), we're actually not supposed to do it this way any longer.  My notes that follow will not make use of virtualenv, but in some situations this might be desirable -- this is just for a system-wide fresh installation of Python 2.7 and pip to install packages. (Here is a nice overview of setting up virtualenv if you want to go that route.)

First you want to be sure your CentOS machine is ready to built something like Python:

yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel
Then you will fetch and unzip Python:
wget http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.bz2
tar xf Python-2.7.5.tar.bz2
cd Python-2.7.5
Do the usual "configure, make, make install" waltz: 
 
./configure --prefix=/opt/python2.7.5 --with-threads --enable-shared
make
make install
At this point, it seems like you're done but not quite.  
$ /opt/python2.7.5/bin/python /opt/python2.7.5/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
/opt/python2.7.5/
 
The trick is to tell ldconfig where to find the new python shared libraries:  
echo "/opt/python2.7.5/lib/" > /etc/ld.so.conf.d/opt-python2.7.5.conf
ldconfig 
 
Now we're cooking with gas: 
$ /opt/python2.7.5/bin/python 
Python 2.7.5 (default, Aug 21 2013, 09:50:18) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2

Get setuptools from the setuptools project page.  You have to click on the version you want and then to find the download link, scroll to the very bottom of the page ("In the basement, down a long hallway behind a door marked 'danger-TIGER!'")

Aside: I am having bad luck these days trying to wget from https servers -- something about SSL certificates and my corporate proxy and the old version of wget that is on CentOS 5.8.

# wget "https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz"
--2013-08-21 11:49:47--  https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz
Resolving proxy.company.com... 

Connecting to proxy.company.com:8080... connected.
ERROR: certificate common name `*.a.ssl.proxy.net' doesn't match requested host name `pypi.python.org'.
To connect to pypi.python.org insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.


So curl seems to work better.  Maybe because it is not checking SSL certs?
# curl -O "https://pypi.python.org/packages/source/s/setuptools/setuptools-1.0.tar.gz"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  663k  100  663k    0     0   229k      0  0:00:02  0:00:02 --:--:--  359k


Once you have downloaded and untar'ed setuptools, you want to install it.  But it seems the setuptools instructions appear flawed:
# /opt/python2.7.5/bin/python setup.py --prefix=/opt/setuptools
error: option --prefix not recognized


So just do the usual "python setup.py build and install" two-step.  Be sure you are calling the desired target python version.
/opt/python2.7.5/bin/python setup.py build
/opt/python2.7.5/bin/python setup.py install

Now you can install pip:
curl -O https://pypi.python.org/packages/source/p/pip/pip-1.4.1.tar.gz
tar xvzf pip-1.4.1.tar.gz
cd pip-1.4.1
/opt/python2.7.5/bin/python setup.py build
/opt/python2.7.5/bin/python setup.py install

And now... now... NOW!  Praise the Almighty, you are ready to use pip to install packages!!!

/opt/python2.7.5/bin/pip install numpy   # install this before biopython
/opt/python2.7.5/bin/pip install scipy
/opt/python2.7.5/bin/pip install biopython
/opt/python2.7.5/bin/pip install pysam


Tuesday, August 20, 2013

Roll call of my preferred bioinformatic tools (NGS focus)

A roll call of my preferred bioinformatic tools

A somewhat useless post, except that it helps me to remember a few of the lesser used (but occasionally very, very handy) pieces of software, like FLASH of Flexbar.



Thursday, July 25, 2013

SSHFS: How to Mount Remote Partition via SSH on CentOS

This is hugely helpful for anyone wanting mount-point accessible files over SSH. 

Don't bother with fstab or the mount command.  Full instructions here, and the executive summary is:

yum install fuse sshfs
sshfs user@remote_host:/remote_directory /local_mount_partition

Thursday, May 23, 2013

RNA-Seq analysis: pitfalls to avoid


Here I am writing out some of the details of my tophat/cufflinks RNA-Seq workflow.  I will try to concisely summarize what works and what pitfalls to avoid.  None of what follows is particularly cutting edge, but since these analyses often involve days of computation, each and every trial-and-error glitch adds up and sets you back.  This document is intended to help you avoid such setbacks.

I. Get reference genome and annotation

The first step is to obtain the necessary reference files: the reference genome sequence (FASTA file) and the transcriptome annotation (GFF or GTF file.)  In the past, I have found these files from one of the main purveyors of public human genome data: NCBI, UCSC, and Ensembl.  In principle, it should be possible to use the GFF (or GTF) files from any of these sources, but in practice, there are subtleties about the formatting of these "standard" files that can make them not play well with tophat/cufflinks (more on this in the next paragraph).  So I have found it is better to use Illumina's "iGenome" downloads, neatly indexed on the cufflinks site.  Note that iGenome is not offering new annotations -- they are just repackaging the annotations created by the big public projects in a standardized format.  Also, you are getting not only the annotations, but also the genome sequence and even a pre-built bowtie2 index -- which may save you a couple hours!

A note here about the annotation formats.  GFF/GTF files are available from pretty much all the sources listed above, but the formats are loose, not well adhered to, and thus the annotations created by these institutions are not all interchangeable.  The GFF2 format was defined by Sanger and is also embraced by UCSC.  Two attempts to improve GFF2 include GTF (defined by the Brent Lab at WUSTL) and GFF3 (described by SequenceOntology).  Of these two updates, GTF is the more incremental (the difference is summarized by UCSC here and has also been referred to as "GFF2.5").  GFF3 is more of a revamping (The SequenceOntology site states "The GFF format, although widely used, has fragmented into multiple incompatible dialects... The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.")

In principle, TopHat will accept GTF or GFF3 annotations.  GTF seems to be somewhat more "native" to Tophat and iGenome uses GTF format, so this is a good thing.

NB: be prepared to wait a while on those iGenome downloads, they are big downloads.  I recommend downloading them directly to your server with wget -- just copy the link from the download page.  For example:

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz

II. Quality check and filter raw reads (rinse and repeat)

Recent software improvements to Illumina NGS platforms means that data quality control is much better than in the "early" days.  But it is still necessary to check the quality of your incoming data, and (possibly) run a filter to remove low quality reads, low quality ends-of-reads, too short reads, and reads (or portions of reads) that have been "contaminated" by oligos used in the library prep.  These contaminating reads might be due to primer-dimer, inserts shorter than the read-length, etc.

For QC, I typically use FastQC in command-line mode, very simply like this:
fastqc -o fastqc_dir rawdata/Sample*fq.gz

This will generate a webpage with many useful plots, most important of which is the overall mean quality per base position (shown at right).   It is normal for the quality to fall at the end, but be wary if there is very much of the quality distribution down bellow 25-30.

The other really useful piece of info returned by FastQC is a list of "overrepresented sequences."  Such sequences are automatically checked against the included contaminants_list.txt file -- definitely have a look at this -- library sequence artifact, if common, will sink your analysis.

For cleaning data, I typically use cutadapt (removes contaminating sequences) and the FASTX-Toolkit (trims and filters.)

You can (and should) take advantage of Linux's ability to pipe commands together.  Many commands are written to read from (and write to) STDOUT.  You can read the links if you want to learn more, but in practice, this might look like this:

cutadapt -O 5 -b  CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT Sample01_1.fq.gz | fastx_quality_trimmer -Q33 -t 25 -l 50 | fastq_quality_filter -Q33 -q 25 -p 80 -z > Sample01_filtered_1.fq.gz


III. TopHat (map reads)

Now that you have the annotations, the reference genome sequence, and the bowtie2 index, and you have quality-filtered your data, you are ready to run TopHat.  Here is your basic TopHat command:

tophat -G genes.gtf --transcriptome-index=transcriptome -o out -p 6 --no-novel-juncs genome file1_1.fq.gz file1_2.fq.gz

Where:
  • genes.gtf is the GTF annotation (it can be found at Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf in iGenomes)
  • transcriptome is the base name of the bowtie2 index of the transcriptome.  This file does not exist the first time you do an analysis -- it is created by TopHat on the fly (takes about an hour) and subsequent runs will reuse this index and save time.  If you have more than 1 file to process (and surely you do) you definitely want to use this option.  You can put this file anywhere you want, but I usually tuck it in here: Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/transcriptome 
  • genome is the base name of the bowtie2 index of the genome (it can be found at Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome if you use the directory structure that comes from the iGenome download)
  • Your output file is designated by "out"
  • The number of process threads to use is set by -p.  I recommend setting it to somewhat less than the number of threads available (on an 8 processor system, use 6 or 7.  On a 32 processor system, I typically use 24-28, depending on what else is going on.  In my experience, it is not better to use all of the processors -- the jobs will actually run slower!)
One TopHat option you may want to consider is the --no-novel-juncs option.  Detection of novel junctions (ie, genes that are not described in the provided annotations) is a fundamental part of how TopHat works.  You can even run TopHat without any annotation at all!  As I described in a previous posting, turning off novel-junction detection doesn't really save you very much time -- so why would you want to turn it off?  Well, here is one possible reason.  Say you have dozens or even hundreds of samples to process, with 100s of millions of reads for each sample.  If novel junction discovery is used it will be performed for each sample independently.  This creates a problem when it is time to compare samples -- you have to use the cuffmerge utility to merge the annotations back together and re-compute the transcript quantitation estimates.  In my experience, this can be very, very time and memory consuming.  So, unless you really are looking for novel transcripts (maybe you are working with a poorly characterized tissue or a non-model organism), I suggesting turning this function off and just leaning on the annotations provided by Ensemble.

If you have many files to process, consider batch processing them with GNU parallel.  I use parallel all the time, even when I am not parallelizing the task at hand.  (Yes, I realize xargs would be a purer tool for such cases, but I like the bells and whistles that parallel comes with.)  It's just a handy way of processing many files in the same way.  You will want to read the man page for parallel, the syntax has a learning curve, but it is a huge help once you get the hang of it!

Putting it all together:  Your command might look something like this:

parallel -j 1 "tophat -G Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf --no-novel-juncs --transcriptome-index=Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/transcriptome -p 6 -o {/} Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome {/}_1.fq.gz {/}_2.fq.gz" ::: `ls /path/to/data/sample_*_1.fq.gz | sed 's/_1.fq.gz//'`


IV. Cufflink (or, alternatives)

The next step in the analysis pipeline is to take the BAM of aligned reads and estimate transcript abundances.  Cufflinks is the "standard" (as if there was one) next step after TopHat.  Cufflinks takes the alignment BAM files produced by TopHat and computes the transcript abundances -- summarized both at the gene level and at the transcript (ie, alternative splicing isoform) level as "FPKM" - fragments per kilobase per million reads.

Cufflinks is probably the trickiest step in the analysis.  A note about "no-effective-length-correction": This Cufflinks feature has been discussed by many folks all over, most notably this discussion on SeqAnswers involving the author himself.  Previously, I noted that turning off "effective length correction" reduces the variance in the transcript quantitation in a set of technical replicates. Also, in that same post, I note that using the "bias correction" option doesn't seem to help much (and it takes a lot longer.)  Note that "effective length correction" is something more than just lengh-correction.  Everyone corrects for transcript length

A typical cufflinks command looks like this:

cufflinks -p 6 --no-effective-length-correction -G genes.gtf -o out out/accepted_hits.bam

The genes.gtf file is same one used for the TopHat command and the BAM file you provide is the one that was made by the TopHat command.  The two key files produced by Cufflinks are:
genes.fpkm_tracking
isoforms.fpkm_tracking
These are basically long flat files, really mainly consisting of two key columns: the Gene ID and the FPKM.  There are a lot of other columns providing info on the gene symbol, location, the confidence interval, etc.

V. Statistical testing

If all you want out of your RNASeq is a simple DE analysis, the recommended next step is Cuffdiff.  If you are coming to RNA-Seq from microarray analysis, like I was, you may be tempted to use these FPKMs just like array spot intensities, and choose Fisher's exact test to see if two populations have a significantly different level of gene expression.  However, there are fundamental statistical reasons why this is not okay.  RNA-Seq gene abundance is count data, and the distribution of count data is fundamentally different from that of spot intensities.  There is a lot written about this on the internet:  Simon Anders has been tirelessly preaching the gospel -- if you are new to this, you should definitely read these posts:
Calculating p-values from FPKM?  (particularly post #6)
Biological replicates for RNA-seq
Multiple DGE libraries comparison. (EdgeR baySeq DESeq)
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates?
RNA-seq output
DEG for paired samples, biological replicates

VI. Alternatives to Cufflinks

You do have alternatives to Cufflinks and it's smart to check them out.  Options include HTSeq2 (with is an R/Bioconductor based update to the Python based HTSeq) and eXpress.  Both of these will accept the BAM file produced by TopHat as input.  In my hands, with the dataset of technical & biological replicates, I have found HTSeq2 produces estimates with slightly less variance than Cufflinks.  I don't have enough experience with eXpress but I will check it out and update this post asap.

Wednesday, May 1, 2013

Indispensable bookmarks for fixing computing annoyances

This list will no doubt grow...

Understanding GNU Emacs and Tabs
Executive summary: put this in your .emacs.el file:

(setq-default major-mode 'text-mode)
(define-key text-mode-map (kbd "TAB") 'self-insert-command); 


ggplot2: Changing the Default Order of Legend Labels and Stacking of Data
Executive summary:

ggplot(diamonds, aes(clarity, fill=cut, order= -as.numeric(cut))) 

actually, this is better, available in recent versions of ggplot2:
guides(color = guide_legend(reverse = TRUE))



Tuesday, April 9, 2013

Getting yum to work with RHEL / RHN behind a proxy


/etc/sysconfig/rhn/up2date

enableProxy=0
httpProxy=proxyserver.company.com:8080


~/.bash_profile

export http_proxy=http://proxyserver.company.com:8080
export HTTP_PROXY=http://proxyserver.company.com:8080


The thing is, there is no universal agreement among linux command line programs, some seem to use HTTP_PROXY and some use http_proxy.  So, yes, export http_proxy both as lower and UPPER case!

And complicating the picture is that RHN doesn't use either of these -- it pulls the settings out of /etc/sysconfig/rhn/up2date 


Wednesday, March 6, 2013

command line reverse complement sequence in linux shell

Sometimes it seems like BSD / GNU / Linux folks thought of everything.

echo AATGATACGGCGACCACCGG | rev | tr ATGC TACG

I'm a bowtie2 partisan too

I mean, I totally agree with the conclusions, but, um, is there a conflict of interest here?  Just sayin.



"... Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete."

Citation: Lindner R, Friedel CC (2012) A Comprehensive Evaluation of Alignment Algorithms in the Context of RNA-Seq. PLoS ONE 7(12): e52403. doi:10.1371/journal.pone.0052403
Editor: Steven L. Salzberg, Johns Hopkins University, United States of America
Received: August 30, 2012; Accepted: November 16, 2012; Published: December 26, 2012


Wednesday, February 27, 2013

Setting up JBrowse 1.8.1 on a CentOS 5.8

Some important details to successfully set up JBrowse on a Centos 5.8 machine.

Install and compile samtools from source.  I built it with fPIC flags like this:

make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC

Be sure that the file libbam.a that you just created with the fPIC flag is (also) present in the lib directory.  I had to rename the old one and copy the new one there.


# ls -al /usr/local/samtools/lib*
-rw-rw-r-- 1 odsolbe odsolbe 309986 Feb 27 09:51 /usr/local/samtools/libbam.a

/usr/local/samtools/lib:
total 1316
drwxr-xr-x 2 odsolbe odsolbe    4096 Feb 27 09:54 .
drwxr-xr-x 7 odsolbe odsolbe    4096 Feb 27 09:51 ..
-rw-rw-r-- 1 odsolbe odsolbe  309986 Feb 27 09:54 libbam.a
-rw-r--r-- 1 odsolbe odsolbe 1000346 Oct 24  2011 libbam.a_orig


Set two env variables: SAMTOOLS and PARL5LIB (see here)

export PERL5LIB=/path/to/...../jbrowse/extlib/lib/perl5 export SAMTOOLS=/usr/local/samtools

This is what worked for me.  Your milage may vary. 

Friday, February 1, 2013

What your father never told you about tophat

Following up on yesterday's posting about the "right" way to run cufflinks, I am here recording my experiences running tophat with and without the --no-novel-juncs setting.

Same dataset as described in the previous posting.  Still using the Illumina iGenomes Ensembl annotation files. The command I used was like this (note: the transcriptome index was pre-built during a previous run.)

tophat [--no-novel-juncs] -G genes.gtf --transcriptome-index=transcriptome -p 8 -o tophat_out file_1.fq.gz file_2.fq.gz

Mean run time for the three replicates (which have very nearly the same number of reads, about 19 million each):
Default (with novel junction finding): 160 minutes
With --no-novel-juncs option:  150 minutes

No big time savings from turning off novel junction finding.  No difference in the mean correlation of genes.fpkm_tracking from the resultant cufflinks run.

Also, I tried --no-novel-juncs with 8, 16, and 24 processors, on a server with 32 cores (not hyperthreaded).

8 procs = 150 minutes
16 procs = 134 minutes
24 procs = 121 minutes

Tophat basically parallelizes the bowtie2 alignment stage, but there is a lot of time spent reading and writing files, which is single-threaded.

I have, in the past, had problems with tophat dying when running with too many processors (24, for instance) but the failure is not predictable:


[2013-02-05 08:08:09] Mapping right_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
[2013-02-05 08:08:34] Mapping right_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
[2013-02-05 08:09:01] Mapping right_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
[2013-02-05 08:09:25] Joining segment hits
[2013-02-05 08:11:22] Reporting output tracks
        [FAILED]
Error running /usr/local/tophat-2.0.7.Linux_x86_64/tophat_reports --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir Sample_04_L005/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p24 --inner-dist-mean 50 --inner-dist-std-dev 20 --gtf-annotations /data/genomes/iGenomes/Ensembl/transcriptome.gff --gtf-juncs Sample_04_L005/tmp/transcriptome.juncs --no-closure-search --no-coverage-search --no-microexon-search --sam-header Sample_04_L005/tmp/genome_genome.bwt.samheader.sam --report-discordant-pair-alignments --report-mixed-alignments --samtools=/usr/local/samtools/samtools --bowtie2-max-penalty 6 --bowtie2-min-penalty 2 --bowtie2-penalty-for-N 1 --bowtie2-read-gap-open 5 --bowtie2-read-gap-cont 3 --bowtie2-ref-gap-open 5 --bowtie2-ref-gap-cont 3 /data/genomes/iGenomes/Ensembl/genome.fa Sample_04_L005/junctions.bed Sample_04_L005/insertions.bed Sample_04_L005/deletions.bed Sample_04_L005/fusions.out Sample_04_L005/tmp/accepted_hits Sample_04_L005/tmp/left_kept_reads.m2g.bam,Sample_04_L005/tmp/left_kept_reads.m2g_um.mapped.bam,Sample_04_L005/tmp/left_kept_reads.m2g_um.candidates Sample_04_L005/tmp/left_kept_reads.bam Sample_04_L005/tmp/right_kept_reads.m2g.bam,Sample_04_L005/tmp/right_kept_reads.m2g_um.mapped.bam,Sample_04_L005/tmp/right_kept_reads.m2g_um.candidates Sample_04_L005/tmp/right_kept_reads.bam
Error: bam_merge failed to open BAM file Sample_04_L005/tmp/right_kept_reads.m2g_um.candidates15.bam



Thursday, January 31, 2013

data.table - my new favorite R package

I love the plyr package and use ddply very often.  But it becomes ungainly with big data.

I found out about data.table today.  Basically, it's indexed data frames, allowing binary search operations.

Their quick start guide has this footnote:

"We wonder how many people are deploying parallel techniques to code that is vector scanning"

(Yep, that was me.  And yes, crazy that I am only discovering data.table in 2013.)

But I see the light now!  Take me to the water and set me down!

Wednesday, January 30, 2013

What your mother never told you about cufflinks

In this post, I am trying to nail down the "best" way to run cufflinks, part of the so-called "tuxedo suite" of RNA-Seq tools (tophat, cufflinks, etc.)

I have a dataset composed of 3 technical replicates -- exactly the same library, run in 3 separate hi-seq lanes (multiplexed along with other samples, but I am ignoring those for now.) They were run on an Illumina HiSeq 2000, with 101 bp paired-end reads.

Number of reads, before and after quality trimming and preprocessing:

Replicate   Before      After
1           19,939,455  19,074,425
2           19,929,909  19,120,424
3           19,950,587  19,101,188
You can see, the lanes were very uniform in the number of reads.

For all of the following commands, I am using the iGenomes files (for the bowtie2 index, the GTF file, and the genome.fa file), available here: 
http://ccb.jhu.edu/software/tophat/igenomes.shtml

I ran tophat like this:
tophat -G genes.gtf --transcriptome-index=transcriptome -p 8 -o out genome rep1_1.fq.gz rep1_2.fq.gz

Cufflinks was run like this:
cufflinks -p 24 --no-effective-length-correction -b genome.fa -G genes.gtf -o out out/accepted_hits.bam

The two variables are highlighted.  

The cufflinks online manual says (note, "Cuffdiff" is a typo and should read "Cufflinks")


--no-effective-length-correctionCuffdiff will not employ its "effective" length normalization to transcript FPKM.

and:

-b/--frag-bias-correct Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Works for more details.

For each run, I sorted the resulting genes.fpkm_tracking files, pulled out the FPKM column, and calculated the mean of the pairwise Pearson correlation for the 3 replicates. Here are the results:


effective length corr
fragment bias corr
Time
Correlation
Yes
Yes
43 min
0.9790755
Yes
No
16 min
0.9749819
No
Yes
43 min
0.9999465
No
No
16 min
0.9999626


Note: the effective length correction is on by default, whereas the frag bias correction is off by default.

The best way to run cufflinks is with --no-effective-length-correction !!!

frag-bias-correction makes the analysis much slower, without any discernable benefit in this trial run.