Tuesday, September 25, 2012

bwa and the -n option (Maximum number of alignments)

When you ask bwa to report multiple hits, it has a very strange behavior.

I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):

samsebwa samse [-n maxOcc] >
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


sampebwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] >
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


Wait, what?  So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary?  Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.

I would think that a much more reasonable approach would be to report up to n hits.

I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table).  The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below).  The number in the table shows the number of reads that had exactly that number of reported extra hits.



bwa_n1.sam
bwa_n2.sam
bwa_n3.sam
bwa_n4.sam
bwa_n5.sam
bwa_n6.sam
bwa_n7.sam
bwa_n8.sam
bwa_n9.sam
bwa_n10.sam
bwa_n11.sam
bwa_n12.sam
bwa_n13.sam
bwa_n14.sam
bwa_n15.sam
bwa_n16.sam
bwa_n17.sam
bwa_n18.sam
bwa_n19.sam
bwa_n20.sam
0
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
1
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
2
401
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3
6919
6919
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
4
503
503
503
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
5
270
270
270
270
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
6
617
617
617
617
617
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
7
85
85
85
85
85
85
515
515
515
515
515
515
515
515
515
515
515
515
515
515
8
105
105
105
105
105
105
105
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
9
14
14
14
14
14
14
14
14
108
108
108
108
108
108
108
108
108
108
108
108
10
26
26
26
26
26
26
26
26
26
164
164
164
164
164
164
164
164
164
164
164
11
0
0
0
0
0
0
0
0
0
0
125
125
125
125
125
125
125
125
125
125
12
0
0
0
0
0
0
0
0
0
0
0
1002
1002
1002
1002
1002
1002
1002
1002
1002
13
0
0
0
0
0
0
0
0
0
0
0
0
75
75
75
75
75
75
75
75
14
0
0
0
0
0
0
0
0
0
0
0
0
0
866
866
866
866
866
866
866
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
49
49
49
49
49
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1072
1072
1072
1072
1072
17
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1321
1321
1321
1321
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
161
161
161
19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
58
58
20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
163


Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame.  Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!

There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.


Tuesday, September 18, 2012

Using CASAVA to extract raw fastq files from Illumina BCL files

Here is how you can extract raw fastq files from Illumina BCL data.

This will give you raw fastq files.  No demultiplexing.  Just the reads, and nothin' but the reads.

Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.

You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing.  Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)

Here is what I did:

1) get a copy of CASAVA 1.8.  Illumina technical support can hook you up.  I'm using CASAVA-1.8.2.  [edit: seems you can download it from here.]

2) Get the entire run directory, which should look something like this (this is from MiSeq data):

120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
|   |-- Intensities
|   |   |-- BaseCalls
|   |   |   |-- L001
|   |   |   |   |-- C1.1   #
|   |   |   |   |-- C10.1  # the BCL files live here
|   |   |   |   |-- C100.1 # (one per cycle per tile)
|   |   |   |   |-- and so on
|   |   |   |-- Matrix
|   |   |   `-- Phasing
|   |   |-- L001
|   |   |   |-- C1.1       #
|   |   |   |-- C10.1      # the FWHMMap files live here
|   |   |   |-- C100.1     # (one per cycle per tile per base)
|   |   |   |-- and so on
|   |   `-- Offsets
|   |-- RTALogs
|   `-- TileStatus
|-- Images
|   `-- Focus
|       `-- L001
|           |-- C1.1       #
|           |-- C10.1      # the TIF images live here
|           |-- C100.1     # (one per cycle per tile)
|           |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
    `-- L001
        |-- C1.1
        |-- C10.1
        |-- C100.1

        |-- and so on

1250 directories

3) Delete the SampleSheet.csv file -- really!  In my experience, and for my purposes, I just want to get the raw fastq files.  I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end).  I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads.  So that is why I recommend doing this without a SampleSheet file.

3) Run the configureBclToFastq.pl script to generate the make file (this is fast).  There are two tricks to getting this to work:  you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask".  The bases mask defines which cycles gets assigned to which reads.  The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles".  I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).


$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24]   [configureBclToFastq.pl]        INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Basecalling software: RTA
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO:              version: 1.14 (build 23)
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26]   [configureBclToFastq.pl]        INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27]   [configureBclToFastq.pl]        INFO: Running self tests on /home/username/fastq completed with no problems

Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data.  It has something to do with the way the cycles are reported back from the machine.  Basically, it means that if you are expecting 150 bp reads, you need to put in "y151".  Illumina can help you understand this, maybe.  Give them a call.


4) Go to the newly create fastq directory and type make (this is slow).

cd fastq
make

5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):


$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username  44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username  35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz

6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post:  How to demultiplex fastq files with a dedicated, separate barcode file.

Wednesday, September 12, 2012

join command requires alphanumeric sort

You may already know that the linux join command requires sorted fields to work properly -- after all, it tells you so right in the man page:

Important: FILE1 and FILE2 must be sorted on the join fields.

But I just found out the hard way that it must be an alphanumeric sort, not a numeric sort.

Lesson learned.

Tuesday, September 4, 2012

Convert FASTQ to FASTA using sed command

Easiest possible solution, and I believe significantly faster than awk-based methods:

sed -n '1~4s/^@/>/p;2~4p' 


The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

If you find this helpful, go up-vote my answer in StackOverflow.