When you ask bwa to report multiple hits, it has a very strange behavior.
I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):
Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.
I would think that a much more reasonable approach would be to report up to n hits.
I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.
Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!
There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.
I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):
samse | bwa samse [-n maxOcc]
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
|
sampe | bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P]
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
|
Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.
I would think that a much more reasonable approach would be to report up to n hits.
I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.
bwa_n1.sam
|
bwa_n2.sam
|
bwa_n3.sam
|
bwa_n4.sam
|
bwa_n5.sam
|
bwa_n6.sam
|
bwa_n7.sam
|
bwa_n8.sam
|
bwa_n9.sam
|
bwa_n10.sam
|
bwa_n11.sam
|
bwa_n12.sam
|
bwa_n13.sam
|
bwa_n14.sam
|
bwa_n15.sam
|
bwa_n16.sam
|
bwa_n17.sam
|
bwa_n18.sam
|
bwa_n19.sam
|
bwa_n20.sam
|
|
0
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
1
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
2
|
401
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3
|
6919
|
6919
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
4
|
503
|
503
|
503
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
5
|
270
|
270
|
270
|
270
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
6
|
617
|
617
|
617
|
617
|
617
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
7
|
85
|
85
|
85
|
85
|
85
|
85
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
8
|
105
|
105
|
105
|
105
|
105
|
105
|
105
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
9
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
10
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
11
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
12
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
13
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
14
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
866
|
866
|
866
|
866
|
866
|
866
|
866
|
15
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
49
|
49
|
49
|
49
|
49
|
49
|
16
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1072
|
1072
|
1072
|
1072
|
1072
|
17
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1321
|
1321
|
1321
|
1321
|
18
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
161
|
161
|
161
|
19
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
58
|
58
|
20
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
163
|
Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!
There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.