http://support.illumina.com/downloads/illumina_adapter_sequences_letter.ilmn
Friday, December 7, 2012
Illumina finally publishes their "proprietary" oligo sequences
http://support.illumina.com/downloads/illumina_adapter_sequences_letter.ilmn
Tuesday, October 23, 2012
Amazed by RStudio!
Tuesday, October 16, 2012
How to demultiplex fastq files with a dedicated, separate barcode file
1) Internal to one of the reads
2) Printed in the header:
3) Encoded in the header
4) In a separate file
There are many tools out there to deal with multiplexed read data.
fastx_barcode_splitter.pl (from FASTX_Toolkit)
fastq-multx (from ea-utils)
split_libraries_fastq.py (from QIIME)
As far as I know, none of the above adequately deal with case (4) above. The QIIME tools does, but appears to turn the fastq into a fasta file in the process (?).
Here is a one-liner homebrew solution to this situation: just tack the index reads onto the front of read1, and then use an off-the-shelf tool (I like the FASTX_Toolkit one above) to demultiplex. Like this:
paste -d '' <(echo; sed -n '1,${n;p;}' R2.fq | sed G) R1.fq | sed '/^$/d' | fastx_barcode_splitter.pl --bol --bcfile barcodes.txt --prefix debarcoded
The fancy <( ) constructs is a bash thing called "process substitution". If you're not using bash, it won't work.
Hope this helps someone.
(edit: I simplified the initial sed command on 2/1/2013, so let me know if it somehow broke, but it should still work fine.)
Thursday, October 11, 2012
1000 Genomes Project Accessibility Mask (BED file)
Tuesday, September 25, 2012
bwa and the -n option (Maximum number of alignments)
I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):
samse | bwa samse [-n maxOcc]
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
|
sampe | bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P]
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
|
Wait, what? So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary? Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.
I would think that a much more reasonable approach would be to report up to n hits.
I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table). The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below). The number in the table shows the number of reads that had exactly that number of reported extra hits.
bwa_n1.sam
|
bwa_n2.sam
|
bwa_n3.sam
|
bwa_n4.sam
|
bwa_n5.sam
|
bwa_n6.sam
|
bwa_n7.sam
|
bwa_n8.sam
|
bwa_n9.sam
|
bwa_n10.sam
|
bwa_n11.sam
|
bwa_n12.sam
|
bwa_n13.sam
|
bwa_n14.sam
|
bwa_n15.sam
|
bwa_n16.sam
|
bwa_n17.sam
|
bwa_n18.sam
|
bwa_n19.sam
|
bwa_n20.sam
|
|
0
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
22
|
1
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
10511
|
2
|
401
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3821
|
3
|
6919
|
6919
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
10357
|
4
|
503
|
503
|
503
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
1638
|
5
|
270
|
270
|
270
|
270
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
12973
|
6
|
617
|
617
|
617
|
617
|
617
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
36680
|
7
|
85
|
85
|
85
|
85
|
85
|
85
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
515
|
8
|
105
|
105
|
105
|
105
|
105
|
105
|
105
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
2281
|
9
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
14
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
108
|
10
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
26
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
164
|
11
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
125
|
12
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
1002
|
13
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
75
|
14
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
866
|
866
|
866
|
866
|
866
|
866
|
866
|
15
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
49
|
49
|
49
|
49
|
49
|
49
|
16
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1072
|
1072
|
1072
|
1072
|
1072
|
17
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
1321
|
1321
|
1321
|
1321
|
18
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
161
|
161
|
161
|
19
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
58
|
58
|
20
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
163
|
Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame. Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!
There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.
Tuesday, September 18, 2012
Using CASAVA to extract raw fastq files from Illumina BCL files
This will give you raw fastq files. No demultiplexing. Just the reads, and nothin' but the reads.
Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.
You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing. Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)
Here is what I did:
1) get a copy of CASAVA 1.8. Illumina technical support can hook you up. I'm using CASAVA-1.8.2. [edit: seems you can download it from here.]
2) Get the entire run directory, which should look something like this (this is from MiSeq data):
120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
| |-- Intensities
| | |-- BaseCalls
| | | |-- L001
| | | | |-- C1.1 #
| | | | |-- C10.1 # the BCL files live here
| | | | |-- C100.1 # (one per cycle per tile)
| | | | |-- and so on
| | | |-- Matrix
| | | `-- Phasing
| | |-- L001
| | | |-- C1.1 #
| | | |-- C10.1 # the FWHMMap files live here
| | | |-- C100.1 # (one per cycle per tile per base)
| | | |-- and so on
| | `-- Offsets
| |-- RTALogs
| `-- TileStatus
|-- Images
| `-- Focus
| `-- L001
| |-- C1.1 #
| |-- C10.1 # the TIF images live here
| |-- C100.1 # (one per cycle per tile)
| |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
`-- L001
|-- C1.1
|-- C10.1
|-- C100.1
|-- and so on
3) Delete the SampleSheet.csv file -- really! In my experience, and for my purposes, I just want to get the raw fastq files. I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end). I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads. So that is why I recommend doing this without a SampleSheet file.
$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24] [configureBclToFastq.pl] INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Basecalling software: RTA
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: version: 1.14 (build 23)
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25] [configureBclToFastq.pl] INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26] [configureBclToFastq.pl] INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27] [configureBclToFastq.pl] INFO: Running self tests on /home/username/fastq completed with no problems
Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data. It has something to do with the way the cycles are reported back from the machine. Basically, it means that if you are expecting 150 bp reads, you need to put in "y151". Illumina can help you understand this, maybe. Give them a call.
4) Go to the newly create fastq directory and type make (this is slow).
cd fastq
make
5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):
$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username 44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username 35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz
6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post: How to demultiplex fastq files with a dedicated, separate barcode file.
Wednesday, September 12, 2012
join command requires alphanumeric sort
Important: FILE1 and FILE2 must be sorted on the join fields.
But I just found out the hard way that it must be an alphanumeric sort, not a numeric sort.
Lesson learned.
Tuesday, September 4, 2012
Convert FASTQ to FASTA using sed command
sed -n '1~4s/^@/>/p;2~4p'
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
Wednesday, April 11, 2012
Convert FASTA file RNA to DNA using sed on the linux command line
sed '/^[^>]/ y/uU/tT/' uracil.fasta > thymine.fasta
Piece of cake.
Wednesday, April 4, 2012
lftp mirror: best way to download an entire directory from a FTP server
As much as I hate FTP, there situations where one must use it. I finally have settled on the best way to get an entire directory, with recursive file retrieval:
lftp -e "mirror dir1/subdir2/subdir3" -u username,password ftphost.domain.com
Of course, FTP must die, but until then, I am glad there is lftp.
Tuesday, March 27, 2012
GNU parallel -- the best thing since sliced bread
When I first found GNU parallel last fall, I thought it was the bee's knees. So well conceived, so well executed, so well documented... so generous to the user, and most of all, so darn handy.
I've been using it extensively with the ::: notation to feed data files into a pipeline in parallel, and it does exactly what I want it to do.
But today it goes to a whole new level, as I realize there is a whole other side to the tool: the ability to feed parts of a single input file into a pipeline in parallel.
For example, BLAST can be effectively (embarrassingly) parallelized as follows:
cat temp.fa | parallel -j 10 --block 100k --recstart '>' --pipe blastn -evalue 0.01 -outfmt 6 -db refseq_genomic -query - > result.txt
The magic is in the pipe option. Of course, we've all written blast parallelizers, but I've never seen a solution as neat and simple as this one liner!
This is from Michael Kuhn's excellent post on the subject. Thanks!
Monday, March 19, 2012
Of bash and streams and pipes - One Tee To Rule Them All
There is a great answer on stack overflow showing how to feed one output stream into multiple processes. It is a little bit of a hack of the excellent tool tee, which I've been using for some time. Normally, tee take stout and writes it to a file, while passing it along to stout. This solution provided on s.o. shows how to hack this with the bash trick >(process) -- so as far as tee knows, it is writing to 2 or more files, but the "files" are actually background processes.
This syntax is bash-dependent, so won't work in sh, and this will likely come up if you're using system() in, say, perl. There is another excellent answer on stack overflow on how to deal with this.
Side note: Using the <(command) trick, my post from March 16 can now be written:
paste <(grep -v "#" hsa.gff | cut -f 1,4,5,9) <(grep -v "#" hsa.gff | cut -f 6,7) | sed s/^/chr/ >mirbase18.bed
Friday, March 16, 2012
mirbase gff annotations in bed format
## get mirbase18 annotation and convert to a bed like format
wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff
grep -v "#" hsa.gff | cut -f 1,4,5,9 >mirbase18.temp1
grep -v "#" hsa.gff | cut -f 6,7 >mirbase18.temp2
paste mirbase18.temp1 mirbase18.temp2 | sed s/^/chr/ >mirbase18.bed
Extra credit: Is there a clever shell trick to piping two separate stdins into paste, thus avoiding the temp files?
Thursday, March 8, 2012
2012 NAR Database Summary (Category List)
Bioinformatic databases grow like kudzu, so I'll no doubt be referring to this directory in the coming year.
http://www.oxfordjournals.org/nar/database/c/
How to get the entire recursive directory structure of an FTP site using wget
How can I use wget to get a recursive directory listing of an entire FTP site?
yes, wget --no-remove-listing ftp://myftpserver/ftpdirectory/ will get and retain a directory listing for a single directory (which was discussed here)
but, wget -r --no-remove-listing ftp://myftpserver/ftpdirectory/ will recurse and download an entire site
so, if you want to get the recursive directory structure, but not download the entire site, try wget -r --no-remove-listing --spider ftp://myftpserver/ftpdirectory/
Thursday, February 16, 2012
rRNA genes in the human genome
If you're involved in RNA-Seq, you should read it. Dr. Moran's discussion makes it clear why it is necessary to map reads to a synthetic rRNA-ome before doing full-genome alignment -- if you want to accurately count, or filter out, rRNA reads (which can predominate in RNA-seq datasets).