In this post, I am trying to nail down the "best" way to run cufflinks, part of the so-called "tuxedo suite" of RNA-Seq tools (tophat, cufflinks, etc.)
I have a dataset composed of 3 technical replicates -- exactly the same library, run in 3 separate hi-seq lanes (multiplexed along with other samples, but I am ignoring those for now.) They were run on an Illumina HiSeq 2000, with 101 bp paired-end reads.
Number of reads, before and after quality trimming and preprocessing:
Replicate Before After
1 19,939,455 19,074,425
2 19,929,909 19,120,424
3 19,950,587 19,101,188
You can see, the lanes were very uniform in the number of reads.
For all of the following commands, I am using the iGenomes files (for the bowtie2 index, the GTF file, and the genome.fa file), available here:
http://ccb.jhu.edu/software/tophat/igenomes.shtml
I ran tophat like this:
tophat -G genes.gtf --transcriptome-index=transcriptome -p 8 -o out genome rep1_1.fq.gz rep1_2.fq.gz
Cufflinks was run like this:
cufflinks -p 24 --no-effective-length-correction -b genome.fa -G genes.gtf -o out out/accepted_hits.bam
The two variables are highlighted.
The cufflinks online manual says (note, "Cuffdiff" is a typo and should read "Cufflinks")
and:
For each run, I sorted the resulting genes.fpkm_tracking files, pulled out the FPKM column, and calculated the mean of the pairwise Pearson correlation for the 3 replicates. Here are the results:
Note: the effective length correction is on by default, whereas the frag bias correction is off by default.
The best way to run cufflinks is with --no-effective-length-correction !!!
frag-bias-correction makes the analysis much slower, without any discernable benefit in this trial run.
I have a dataset composed of 3 technical replicates -- exactly the same library, run in 3 separate hi-seq lanes (multiplexed along with other samples, but I am ignoring those for now.) They were run on an Illumina HiSeq 2000, with 101 bp paired-end reads.
Number of reads, before and after quality trimming and preprocessing:
Replicate Before After
1 19,939,455 19,074,425
2 19,929,909 19,120,424
3 19,950,587 19,101,188
You can see, the lanes were very uniform in the number of reads.
For all of the following commands, I am using the iGenomes files (for the bowtie2 index, the GTF file, and the genome.fa file), available here:
http://ccb.jhu.edu/software/tophat/igenomes.shtml
I ran tophat like this:
tophat -G genes.gtf --transcriptome-index=transcriptome -p 8 -o out genome rep1_1.fq.gz rep1_2.fq.gz
Cufflinks was run like this:
cufflinks -p 24 --no-effective-length-correction -b genome.fa -G genes.gtf -o out out/accepted_hits.bam
The two variables are highlighted.
The cufflinks online manual says (note, "Cuffdiff" is a typo and should read "Cufflinks")
--no-effective-length-correction | Cuffdiff will not employ its "effective" length normalization to transcript FPKM. |
and:
-b/--frag-bias-correct | Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Works for more details. |
For each run, I sorted the resulting genes.fpkm_tracking files, pulled out the FPKM column, and calculated the mean of the pairwise Pearson correlation for the 3 replicates. Here are the results:
effective length corr
|
fragment bias corr
|
Time
|
Correlation
|
Yes
|
Yes
|
43 min
|
0.9790755
|
Yes
|
No
|
16 min
|
0.9749819
|
No
|
Yes
|
43 min
|
0.9999465
|
No
|
No
|
16 min
|
0.9999626
|
Note: the effective length correction is on by default, whereas the frag bias correction is off by default.
The best way to run cufflinks is with --no-effective-length-correction !!!
frag-bias-correction makes the analysis much slower, without any discernable benefit in this trial run.
Thanks! I will give it a try! P.S. your links are down.
ReplyDeleteThanks. Fixed now.
Delete