The 1000 genomes project publishes a .bed file mask ("Strict Accessibility Mask") for regions of the human genome where it is difficult to place reads, due to repetitivity. Follow the link:
Thursday, October 11, 2012
Tuesday, September 25, 2012
bwa and the -n option (Maximum number of alignments)
When you ask bwa to report multiple hits, it has a very strange behavior.
I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):
Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.
I would think that a much more reasonable approach would be to report up to n hits.
I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.
Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!
There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.
I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):
| samse | bwa samse [-n maxOcc]
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
|
| sampe | bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P]
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
|
Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.
I would think that a much more reasonable approach would be to report up to n hits.
I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.
bwa_n1.sam
|
bwa_n2.sam
|
bwa_n3.sam
|
bwa_n4.sam
|
bwa_n5.sam
|
bwa_n6.sam
|
bwa_n7.sam
|
bwa_n8.sam
|
bwa_n9.sam
|
bwa_n10.sam
|
bwa_n11.sam
|
bwa_n12.sam
|
bwa_n13.sam
|
bwa_n14.sam
|
bwa_n15.sam
|
bwa_n16.sam
|
bwa_n17.sam
|
bwa_n18.sam
|
bwa_n19.sam
|
bwa_n20.sam
|
|
0
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
1
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
2
|
401
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3
|
6919
|
6919
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
4
|
503
|
503
|
503
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
5
|
270
|
270
|
270
|
270
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
6
|
617
|
617
|
617
|
617
|
617
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
7
|
85
|
85
|
85
|
85
|
85
|
85
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
8
|
105
|
105
|
105
|
105
|
105
|
105
|
105
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
9
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
10
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
11
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
12
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
13
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
14
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
866
|
866
|
866
|
866
|
866
|
866
|
866
|
15
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
49
|
49
|
49
|
49
|
49
|
49
|
16
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1072
|
1072
|
1072
|
1072
|
1072
|
17
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1321
|
1321
|
1321
|
1321
|
18
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
161
|
161
|
161
|
19
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
58
|
58
|
20
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
163
|
Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!
There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.
Tuesday, September 18, 2012
Using CASAVA to extract raw fastq files from Illumina BCL files
Here is how you can extract raw fastq files from Illumina BCL data.
This will give you raw fastq files. No demultiplexing. Just the reads, and nothin' but the reads.
Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.
You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing. Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)
Here is what I did:
1) get a copy of CASAVA 1.8. Illumina technical support can hook you up. I'm using CASAVA-1.8.2. [edit: seems you can download it from here.]
2) Get the entire run directory, which should look something like this (this is from MiSeq data):
120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
| |-- Intensities
| | |-- BaseCalls
| | | |-- L001
| | | | |-- C1.1 #
| | | | |-- C10.1 # the BCL files live here
| | | | |-- C100.1 # (one per cycle per tile)
| | | | |-- and so on
| | | |-- Matrix
| | | `-- Phasing
| | |-- L001
| | | |-- C1.1 #
| | | |-- C10.1 # the FWHMMap files live here
| | | |-- C100.1 # (one per cycle per tile per base)
| | | |-- and so on
| | `-- Offsets
| |-- RTALogs
| `-- TileStatus
|-- Images
| `-- Focus
| `-- L001
| |-- C1.1 #
| |-- C10.1 # the TIF images live here
| |-- C100.1 # (one per cycle per tile)
| |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
`-- L001
|-- C1.1
|-- C10.1
|-- C100.1
|-- and so on
1250 directories
3) Delete the SampleSheet.csv file -- really! In my experience, and for my purposes, I just want to get the raw fastq files. I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end). I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads. So that is why I recommend doing this without a SampleSheet file.
This will give you raw fastq files. No demultiplexing. Just the reads, and nothin' but the reads.
Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.
You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing. Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)
Here is what I did:
1) get a copy of CASAVA 1.8. Illumina technical support can hook you up. I'm using CASAVA-1.8.2. [edit: seems you can download it from here.]
2) Get the entire run directory, which should look something like this (this is from MiSeq data):
120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
| |-- Intensities
| | |-- BaseCalls
| | | |-- L001
| | | | |-- C1.1 #
| | | | |-- C10.1 # the BCL files live here
| | | | |-- C100.1 # (one per cycle per tile)
| | | | |-- and so on
| | | |-- Matrix
| | | `-- Phasing
| | |-- L001
| | | |-- C1.1 #
| | | |-- C10.1 # the FWHMMap files live here
| | | |-- C100.1 # (one per cycle per tile per base)
| | | |-- and so on
| | `-- Offsets
| |-- RTALogs
| `-- TileStatus
|-- Images
| `-- Focus
| `-- L001
| |-- C1.1 #
| |-- C10.1 # the TIF images live here
| |-- C100.1 # (one per cycle per tile)
| |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
`-- L001
|-- C1.1
|-- C10.1
|-- C100.1
|-- and so on
3) Delete the SampleSheet.csv file -- really! In my experience, and for my purposes, I just want to get the raw fastq files. I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end). I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads. So that is why I recommend doing this without a SampleSheet file.
3) Run the configureBclToFastq.pl script to generate the make file (this is fast). There are two tricks to getting this to work: you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask". The bases mask defines which cycles gets assigned to which reads. The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles". I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).
$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24] [configureBclToFastq.pl] INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Basecalling software: RTA
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: version: 1.14 (build 23)
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26] [configureBclToFastq.pl] INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27] [configureBclToFastq.pl] INFO: Running self tests on /home/username/fastq completed with no problems
Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data. It has something to do with the way the cycles are reported back from the machine. Basically, it means that if you are expecting 150 bp reads, you need to put in "y151". Illumina can help you understand this, maybe. Give them a call.
4) Go to the newly create fastq directory and type make (this is slow).
cd fastq
make
5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):
$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username 44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username 35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz
6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post: How to demultiplex fastq files with a dedicated, separate barcode file.
$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24] [configureBclToFastq.pl] INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Basecalling software: RTA
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: version: 1.14 (build 23)
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26] [configureBclToFastq.pl] INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27] [configureBclToFastq.pl] INFO: Running self tests on /home/username/fastq completed with no problems
Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data. It has something to do with the way the cycles are reported back from the machine. Basically, it means that if you are expecting 150 bp reads, you need to put in "y151". Illumina can help you understand this, maybe. Give them a call.
4) Go to the newly create fastq directory and type make (this is slow).
cd fastq
make
5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):
$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username 44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username 35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz
6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post: How to demultiplex fastq files with a dedicated, separate barcode file.
Wednesday, September 12, 2012
join command requires alphanumeric sort
You may already know that the linux join command requires sorted fields to work properly -- after all, it tells you so right in the man page:
Important: FILE1 and FILE2 must be sorted on the join fields.
Important: FILE1 and FILE2 must be sorted on the join fields.
But I just found out the hard way that it must be an alphanumeric sort, not a numeric sort.
Lesson learned.
Tuesday, September 4, 2012
Convert FASTQ to FASTA using sed command
Easiest possible solution, and I believe significantly faster than awk-based methods:
sed -n '1~4s/^@/>/p;2~4p'
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
If you find this helpful, go up-vote my answer in StackOverflow.
Wednesday, April 11, 2012
Convert FASTA file RNA to DNA using sed on the linux command line
sed '/^[^>]/ y/uU/tT/' uracil.fasta > thymine.fasta
Piece of cake.
Wednesday, April 4, 2012
lftp mirror: best way to download an entire directory from a FTP server
As much as I hate FTP, there situations where one must use it. I finally have settled on the best way to get an entire directory, with recursive file retrieval:
lftp -e "mirror dir1/subdir2/subdir3" -u username,password ftphost.domain.com
Of course, FTP must die, but until then, I am glad there is lftp.
Subscribe to:
Posts (Atom)