Tuesday, September 25, 2012

bwa and the -n option (Maximum number of alignments)

When you ask bwa to report multiple hits, it has a very strange behavior.

I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):

samsebwa samse [-n maxOcc] >
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


sampebwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] >
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


Wait, what?  So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary?  Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.

I would think that a much more reasonable approach would be to report up to n hits.

I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table).  The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below).  The number in the table shows the number of reads that had exactly that number of reported extra hits.



bwa_n1.sam
bwa_n2.sam
bwa_n3.sam
bwa_n4.sam
bwa_n5.sam
bwa_n6.sam
bwa_n7.sam
bwa_n8.sam
bwa_n9.sam
bwa_n10.sam
bwa_n11.sam
bwa_n12.sam
bwa_n13.sam
bwa_n14.sam
bwa_n15.sam
bwa_n16.sam
bwa_n17.sam
bwa_n18.sam
bwa_n19.sam
bwa_n20.sam
0
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
1
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
2
401
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3
6919
6919
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
4
503
503
503
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
5
270
270
270
270
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
6
617
617
617
617
617
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
7
85
85
85
85
85
85
515
515
515
515
515
515
515
515
515
515
515
515
515
515
8
105
105
105
105
105
105
105
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
9
14
14
14
14
14
14
14
14
108
108
108
108
108
108
108
108
108
108
108
108
10
26
26
26
26
26
26
26
26
26
164
164
164
164
164
164
164
164
164
164
164
11
0
0
0
0
0
0
0
0
0
0
125
125
125
125
125
125
125
125
125
125
12
0
0
0
0
0
0
0
0
0
0
0
1002
1002
1002
1002
1002
1002
1002
1002
1002
13
0
0
0
0
0
0
0
0
0
0
0
0
75
75
75
75
75
75
75
75
14
0
0
0
0
0
0
0
0
0
0
0
0
0
866
866
866
866
866
866
866
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
49
49
49
49
49
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1072
1072
1072
1072
1072
17
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1321
1321
1321
1321
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
161
161
161
19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
58
58
20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
163


Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame.  Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!

There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.


No comments:

Post a Comment