BioCozy: September 2012

Tuesday, September 25, 2012

bwa and the -n option (Maximum number of alignments)

When you ask bwa to report multiple hits, it has a very strange behavior.

I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):

samse

bwa samse [-n maxOcc] >

Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.

OPTIONS:

-n INT

Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]

sampe

bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] >

Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.

OPTIONS:


-n INT	Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]

Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.

I would think that a much more reasonable approach would be to report up to n hits.

I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.

	bwa_n1.sam	bwa_n2.sam	bwa_n3.sam	bwa_n4.sam	bwa_n5.sam	bwa_n6.sam	bwa_n7.sam	bwa_n8.sam	bwa_n9.sam	bwa_n10.sam	bwa_n11.sam	bwa_n12.sam	bwa_n13.sam	bwa_n14.sam	bwa_n15.sam	bwa_n16.sam	bwa_n17.sam	bwa_n18.sam	bwa_n19.sam	bwa_n20.sam
0	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22	22
1	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511	10511
2	401	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821	3821
3	6919	6919	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357	10357
4	503	503	503	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638	1638
5	270	270	270	270	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973	12973
6	617	617	617	617	617	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680	36680
7	85	85	85	85	85	85	515	515	515	515	515	515	515	515	515	515	515	515	515	515
8	105	105	105	105	105	105	105	2281	2281	2281	2281	2281	2281	2281	2281	2281	2281	2281	2281	2281
9	14	14	14	14	14	14	14	14	108	108	108	108	108	108	108	108	108	108	108	108
10	26	26	26	26	26	26	26	26	26	164	164	164	164	164	164	164	164	164	164	164
11	0	0	0	0	0	0	0	0	0	0	125	125	125	125	125	125	125	125	125	125
12	0	0	0	0	0	0	0	0	0	0	0	1002	1002	1002	1002	1002	1002	1002	1002	1002
13	0	0	0	0	0	0	0	0	0	0	0	0	75	75	75	75	75	75	75	75
14	0	0	0	0	0	0	0	0	0	0	0	0	0	866	866	866	866	866	866	866
15	0	0	0	0	0	0	0	0	0	0	0	0	0	0	49	49	49	49	49	49
16	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1072	1072	1072	1072	1072
17	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1321	1321	1321	1321
18	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	161	161	161
19	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	58	58
20	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	163

Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!

There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.

Tuesday, September 18, 2012

Using CASAVA to extract raw fastq files from Illumina BCL files

1250 directories

3) Delete the SampleSheet.csv file -- really! In my experience, and for my purposes, I just want to get the raw fastq files. I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end). I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads. So that is why I recommend doing this without a SampleSheet file.

3) Run the configureBclToFastq.pl script to generate the make file (this is fast). There are two tricks to getting this to work: you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask". The bases mask defines which cycles gets assigned to which reads. The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles". I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).

$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24] [configureBclToFastq.pl] INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Basecalling software: RTA
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: version: 1.14 (build 23)
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26] [configureBclToFastq.pl] INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27] [configureBclToFastq.pl] INFO: Running self tests on /home/username/fastq completed with no problems

Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data. It has something to do with the way the cycles are reported back from the machine. Basically, it means that if you are expecting 150 bp reads, you need to put in "y151". Illumina can help you understand this, maybe. Give them a call.

4) Go to the newly create fastq directory and type make (this is slow).

cd fastq
make

5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):

$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username 44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username 35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz

6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post: How to demultiplex fastq files with a dedicated, separate barcode file.

Wednesday, September 12, 2012

join command requires alphanumeric sort

You may already know that the linux join command requires sorted fields to work properly -- after all, it tells you so right in the man page:

Important: FILE1 and FILE2 must be sorted on the join fields.

But I just found out the hard way that it must be an alphanumeric sort, not a numeric sort.

Lesson learned.

Tuesday, September 4, 2012

Convert FASTQ to FASTA using sed command

Easiest possible solution, and I believe significantly faster than awk-based methods:

sed -n '1~4s/^@/>/p;2~4p'

The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

If you find this helpful, go up-vote my answer in StackOverflow.