BioCozy: Using CASAVA to extract raw fastq files from Illumina BCL files

1250 directories

3) Delete the SampleSheet.csv file -- really! In my experience, and for my purposes, I just want to get the raw fastq files. I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end). I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads. So that is why I recommend doing this without a SampleSheet file.

3) Run the configureBclToFastq.pl script to generate the make file (this is fast). There are two tricks to getting this to work: you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask". The bases mask defines which cycles gets assigned to which reads. The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles". I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).

$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24] [configureBclToFastq.pl] INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Basecalling software: RTA
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: version: 1.14 (build 23)
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26] [configureBclToFastq.pl] INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27] [configureBclToFastq.pl] INFO: Running self tests on /home/username/fastq completed with no problems

Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data. It has something to do with the way the cycles are reported back from the machine. Basically, it means that if you are expecting 150 bp reads, you need to put in "y151". Illumina can help you understand this, maybe. Give them a call.

4) Go to the newly create fastq directory and type make (this is slow).

cd fastq
make

5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):

$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username 44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username 35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz

6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post: How to demultiplex fastq files with a dedicated, separate barcode file.

3 comments:

AnonymousOctober 14, 2012 at 11:34 PM
Hi,
I am using casava 1.8.2 for bcl to fastq conversion. I have few questions:

The command I use is:

/usr/local/bin/configureBclToFastq.pl --input-dir /usr/local/share/CASAVA-1.8.2/examples/Validation/110120_P20_0993_A805CKABXX/Data/Intensities/BaseCalls --output-dir Unaligned

I get the output and I can identify each sample separately without giving samplesheet.csv file

Questiotion 1) Is the samplesheet.csv necessary for the conversion? If we do not provide samplesheet, will it have any impact on the output?

Question 2) I am getting different numbers of fastq files for different samples like 10 files (5 for read 1 and 5 for read 2) for sample 1, 14 (7 for each read) for sample 2 and so on. Why is it like this? Is there any way that I can get a single file for each sample?
OwenOctober 16, 2012 at 12:18 PM
In my experience, omitting the SampleSheet.csv file will not affect the data itself, it just means that you won't get any demultiplexing as an automated part of the CASAVA pipeline.

As for your differing number of reads, I believe CASAVA likes to "chunk" the reads, capping them at 4 million reads (16 million lines) per file. The files can be simple cat'ed together:

lane1_NoIndex_L001_R1_001.fastq.gz
lane1_NoIndex_L001_R1_002.fastq.gz
lane1_NoIndex_L001_R1_003.fastq.gz
lane1_NoIndex_L001_R1_004.fastq.gz

cat lane1_NoIndex_L001_R1*.gz > lane1_NoIndex_L001_R1.fq.gz

(yes, you can cat together gz files without ungzipping them)

Hope this helps.
UnknownOctober 28, 2013 at 3:41 PM
I am working on data processing at CHLA, and have an issue. We have two servers, how in CASAVA can I output to a different server, aka I do not want to save locally the output directory on the server that has the same software as CASAVA, and wish to include the output directory to a different server. Advice?

Tuesday, September 18, 2012

Using CASAVA to extract raw fastq files from Illumina BCL files

3 comments: