Tuesday, September 18, 2012

Using CASAVA to extract raw fastq files from Illumina BCL files

Here is how you can extract raw fastq files from Illumina BCL data.

This will give you raw fastq files.  No demultiplexing.  Just the reads, and nothin' but the reads.

Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.

You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing.  Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)

Here is what I did:

1) get a copy of CASAVA 1.8.  Illumina technical support can hook you up.  I'm using CASAVA-1.8.2.  [edit: seems you can download it from here.]

2) Get the entire run directory, which should look something like this (this is from MiSeq data):

120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
|   |-- Intensities
|   |   |-- BaseCalls
|   |   |   |-- L001
|   |   |   |   |-- C1.1   #
|   |   |   |   |-- C10.1  # the BCL files live here
|   |   |   |   |-- C100.1 # (one per cycle per tile)
|   |   |   |   |-- and so on
|   |   |   |-- Matrix
|   |   |   `-- Phasing
|   |   |-- L001
|   |   |   |-- C1.1       #
|   |   |   |-- C10.1      # the FWHMMap files live here
|   |   |   |-- C100.1     # (one per cycle per tile per base)
|   |   |   |-- and so on
|   |   `-- Offsets
|   |-- RTALogs
|   `-- TileStatus
|-- Images
|   `-- Focus
|       `-- L001
|           |-- C1.1       #
|           |-- C10.1      # the TIF images live here
|           |-- C100.1     # (one per cycle per tile)
|           |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
    `-- L001
        |-- C1.1
        |-- C10.1
        |-- C100.1

        |-- and so on

1250 directories

3) Delete the SampleSheet.csv file -- really!  In my experience, and for my purposes, I just want to get the raw fastq files.  I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end).  I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads.  So that is why I recommend doing this without a SampleSheet file.

3) Run the configureBclToFastq.pl script to generate the make file (this is fast).  There are two tricks to getting this to work:  you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask".  The bases mask defines which cycles gets assigned to which reads.  The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles".  I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).


$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24]   [configureBclToFastq.pl]        INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Basecalling software: RTA
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO:              version: 1.14 (build 23)
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26]   [configureBclToFastq.pl]        INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27]   [configureBclToFastq.pl]        INFO: Running self tests on /home/username/fastq completed with no problems

Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data.  It has something to do with the way the cycles are reported back from the machine.  Basically, it means that if you are expecting 150 bp reads, you need to put in "y151".  Illumina can help you understand this, maybe.  Give them a call.


4) Go to the newly create fastq directory and type make (this is slow).

cd fastq
make

5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):


$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username  44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username  35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz

6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post:  How to demultiplex fastq files with a dedicated, separate barcode file.

3 comments:

  1. Hi,
    I am using casava 1.8.2 for bcl to fastq conversion. I have few questions:

    The command I use is:

    /usr/local/bin/configureBclToFastq.pl --input-dir /usr/local/share/CASAVA-1.8.2/examples/Validation/110120_P20_0993_A805CKABXX/Data/Intensities/BaseCalls --output-dir Unaligned

    I get the output and I can identify each sample separately without giving samplesheet.csv file

    Questiotion 1) Is the samplesheet.csv necessary for the conversion? If we do not provide samplesheet, will it have any impact on the output?

    Question 2) I am getting different numbers of fastq files for different samples like 10 files (5 for read 1 and 5 for read 2) for sample 1, 14 (7 for each read) for sample 2 and so on. Why is it like this? Is there any way that I can get a single file for each sample?

    ReplyDelete
  2. In my experience, omitting the SampleSheet.csv file will not affect the data itself, it just means that you won't get any demultiplexing as an automated part of the CASAVA pipeline.

    As for your differing number of reads, I believe CASAVA likes to "chunk" the reads, capping them at 4 million reads (16 million lines) per file. The files can be simple cat'ed together:

    lane1_NoIndex_L001_R1_001.fastq.gz
    lane1_NoIndex_L001_R1_002.fastq.gz
    lane1_NoIndex_L001_R1_003.fastq.gz
    lane1_NoIndex_L001_R1_004.fastq.gz

    cat lane1_NoIndex_L001_R1*.gz > lane1_NoIndex_L001_R1.fq.gz

    (yes, you can cat together gz files without ungzipping them)

    Hope this helps.

    ReplyDelete
  3. I am working on data processing at CHLA, and have an issue. We have two servers, how in CASAVA can I output to a different server, aka I do not want to save locally the output directory on the server that has the same software as CASAVA, and wish to include the output directory to a different server. Advice?

    ReplyDelete