Tuesday, October 23, 2012

Amazed by RStudio!

RStudio (http://www.rstudio.com/) has to be one of the coolest pieces of software I have installed in recent memory.

RStudio provides a very well-planned and well-executed IDE for R.  This alone would be a reason to sing Hosanna.


But it gets better!!!  The Server package (deb and rpm packages available) allows you access to this same interface via a webserver, so you can run R on a beefy computational server from any web browser.  And it seems to work every bit as nicely as the locally installed IDE.

The user accounts are integrated with the linux user accounts, very nice.  I even got it to work (after reading this post) in a kerberos situation by overwriting the "rstudio" pam file that the RStudio package provides with the login one.  (You might want to back up  the rstudio one first).

# cd /etc/pam.d/
# cp login rstudio

Well done and thank you to the folks at the RStudio company.

Tuesday, October 16, 2012

How to demultiplex fastq files with a dedicated, separate barcode file

Oh, Barcodes, How do I obtain thee?  Let me count the ways:
1) Internal to one of the reads
2) Printed in the header:
3) Encoded in the header
4) In a separate file

There are many tools out there to deal with multiplexed read data.
fastx_barcode_splitter.pl (from FASTX_Toolkit)
fastq-multx (from ea-utils)
split_libraries_fastq.py (from QIIME)

As far as I know, none of the above adequately deal with case (4) above.  The QIIME tools does, but appears to turn the fastq into a fasta file in the process (?).

Here is a one-liner homebrew solution to this situation:  just tack the index reads onto the front of read1, and then use an off-the-shelf tool (I like the FASTX_Toolkit one above) to demultiplex.  Like this:

paste -d '' <(echo; sed -n '1,${n;p;}' R2.fq | sed G) R1.fq | sed '/^$/d' | fastx_barcode_splitter.pl --bol --bcfile barcodes.txt --prefix debarcoded

I am assuming that R2.fq is the barcode fastq file, and R1.fq is the is the Read 1 file.  The commands above take the 2nd and 4th lines from the barcode fastq file, spaces them out correctly, pastes them together (with no delimiter - note: '' is a double single-quote, not a single double-quote), and then uses another paste command to pre-pend THAT result to the R1 file.  Also there is a final sed command to remove the trailing blank lines.  It ain't pretty, but it works.

The fancy <(  ) constructs is a bash thing called "process substitution".  If you're not using bash, it won't work.

Hope this helps someone.

(edit:  I simplified the initial sed command on 2/1/2013, so let me know if it somehow broke, but it should still work fine.)

Thursday, October 11, 2012

1000 Genomes Project Accessibility Mask (BED file)


The 1000 genomes project publishes a .bed file mask ("Strict Accessibility Mask") for regions of the human genome where it is difficult to place reads, due to repetitivity.  Follow the link:


Tuesday, September 25, 2012

bwa and the -n option (Maximum number of alignments)

When you ask bwa to report multiple hits, it has a very strange behavior.

I am addressing this behavior, highlighted in bold below, and especially the part in yellow italic bold (this is from the bwa manual):

samsebwa samse [-n maxOcc] >
Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


sampebwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] >
Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
OPTIONS:
-n INTMaximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]


Wait, what?  So you're saying, if there are 10 good hits in a given index, and I ask for 10, I will get 10 (1 primary plus 9 in the XA tag), but if I ask for 11, I get only the 1 primary?  Even worse, it pretty much seems like you have no way of knowing when multiple hits are being omitted due to this pathological behavior.

I would think that a much more reasonable approach would be to report up to n hits.

I almost couldn't believe this could be true, so I ran the "bwa sampe" command on the same sai file, differing only the "n" number (columns in below table).  The number in the table shows the number of accession numbers after the XA tag of the bwa sam file (rows in table below).  The number in the table shows the number of reads that had exactly that number of reported extra hits.



bwa_n1.sam
bwa_n2.sam
bwa_n3.sam
bwa_n4.sam
bwa_n5.sam
bwa_n6.sam
bwa_n7.sam
bwa_n8.sam
bwa_n9.sam
bwa_n10.sam
bwa_n11.sam
bwa_n12.sam
bwa_n13.sam
bwa_n14.sam
bwa_n15.sam
bwa_n16.sam
bwa_n17.sam
bwa_n18.sam
bwa_n19.sam
bwa_n20.sam
0
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
1
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
10511
2
401
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3821
3
6919
6919
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
10357
4
503
503
503
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
1638
5
270
270
270
270
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
12973
6
617
617
617
617
617
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
36680
7
85
85
85
85
85
85
515
515
515
515
515
515
515
515
515
515
515
515
515
515
8
105
105
105
105
105
105
105
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
2281
9
14
14
14
14
14
14
14
14
108
108
108
108
108
108
108
108
108
108
108
108
10
26
26
26
26
26
26
26
26
26
164
164
164
164
164
164
164
164
164
164
164
11
0
0
0
0
0
0
0
0
0
0
125
125
125
125
125
125
125
125
125
125
12
0
0
0
0
0
0
0
0
0
0
0
1002
1002
1002
1002
1002
1002
1002
1002
1002
13
0
0
0
0
0
0
0
0
0
0
0
0
75
75
75
75
75
75
75
75
14
0
0
0
0
0
0
0
0
0
0
0
0
0
866
866
866
866
866
866
866
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
49
49
49
49
49
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1072
1072
1072
1072
1072
17
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1321
1321
1321
1321
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
161
161
161
19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
58
58
20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
163


Sure enough, looking at 10 < n < 20, you can see it works pretty much like the manual says it does... which seems like a shame.  Considering the recently added support for indices with 64-bit addressing, bwa could be really useful in metagenomic environmental sampling... but not if it pathologically erases multiple hits!

There is something else weird going on for n < 10 — my guess is that it ALWAYS reports up to 10 additional perfect hits, if present, regardless of the n number.


Tuesday, September 18, 2012

Using CASAVA to extract raw fastq files from Illumina BCL files

Here is how you can extract raw fastq files from Illumina BCL data.

This will give you raw fastq files.  No demultiplexing.  Just the reads, and nothin' but the reads.

Read 1 is generally the forward read, Read 2 is generally the index read, and Read 3 is generally the reverse read.

You might want to do this if you want to see the barcode read quality scores, which you may want to consider to do a more accurate demultiplexing.  Examining these quality scores may also help you understand "barcode bleed" (aka, Illumina's dirty little secret.)

Here is what I did:

1) get a copy of CASAVA 1.8.  Illumina technical support can hook you up.  I'm using CASAVA-1.8.2.  [edit: seems you can download it from here.]

2) Get the entire run directory, which should look something like this (this is from MiSeq data):

120822_M00609_0035_A000000000-A1CDK
|-- Config
|-- Data
|   |-- Intensities
|   |   |-- BaseCalls
|   |   |   |-- L001
|   |   |   |   |-- C1.1   #
|   |   |   |   |-- C10.1  # the BCL files live here
|   |   |   |   |-- C100.1 # (one per cycle per tile)
|   |   |   |   |-- and so on
|   |   |   |-- Matrix
|   |   |   `-- Phasing
|   |   |-- L001
|   |   |   |-- C1.1       #
|   |   |   |-- C10.1      # the FWHMMap files live here
|   |   |   |-- C100.1     # (one per cycle per tile per base)
|   |   |   |-- and so on
|   |   `-- Offsets
|   |-- RTALogs
|   `-- TileStatus
|-- Images
|   `-- Focus
|       `-- L001
|           |-- C1.1       #
|           |-- C10.1      # the TIF images live here
|           |-- C100.1     # (one per cycle per tile)
|           |-- and so on
|-- InterOp
|-- Logs
|-- Recipe
`-- Thumbnail_Images
    `-- L001
        |-- C1.1
        |-- C10.1
        |-- C100.1

        |-- and so on

1250 directories

3) Delete the SampleSheet.csv file -- really!  In my experience, and for my purposes, I just want to get the raw fastq files.  I would actually prefer it if the standard Illumina pipeline ALWAYS produced an unprocessed, undemultiplexed FASTQ file for each read (R1 and the index read if single end, or R1, R2, and the index read if paired end).  I find the SampleSheet to be fussy, often this data is not available to me when I am trying to extract the sequences, and anyway, I would like to see empirically what barcodes are present in the raw reads.  So that is why I recommend doing this without a SampleSheet file.

3) Run the configureBclToFastq.pl script to generate the make file (this is fast).  There are two tricks to getting this to work:  you must provide the Data/Intensities/BaseCalls directory, and you have to give it a "bases-mask".  The bases mask defines which cycles gets assigned to which reads.  The "y" in "y151" means, "Yes, I want the first 151 cycle", then "Yes, I wan the next 6 cyles", and "Yes, I want the next 151 cycles".  I think the standard way to run casava would be "y151,I6,y151" which renders the index read to the casava demultiplexing pipeline (in other words, you don't actually get to see the index reads).


$ configureBclToFastq.pl --input-dir 120822_M00609_0035_A000000000-A1CDK/Data/Intensities/BaseCalls/ --output-dir fastq --force --use-bases-mask y151,y6,y151
[2012-09-18 12:06:24]   [configureBclToFastq.pl]        INFO: Creating directory '/home/username/fastq'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Basecalling software: RTA
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO:              version: 1.14 (build 23)
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Original use-bases mask: y151,y6,y151
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Guessed use-bases mask: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy,yyyyyy,yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Creating directory '/home/username/fastq/Project_000000000-A1CDK/Sample_lane1'
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 1: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 2: length = 6: mask = yyyyyy
[2012-09-18 12:06:25]   [configureBclToFastq.pl]        INFO: Read 3: length = 151: mask = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
[2012-09-18 12:06:26]   [configureBclToFastq.pl]        INFO: Running self tests: 'make self_test'
[2012-09-18 12:06:27]   [configureBclToFastq.pl]        INFO: Running self tests on /home/username/fastq completed with no problems

Note that there is something about adding one cycle to the expected number of bases, when working with MiSeq data.  It has something to do with the way the cycles are reported back from the machine.  Basically, it means that if you are expecting 150 bp reads, you need to put in "y151".  Illumina can help you understand this, maybe.  Give them a call.


4) Go to the newly create fastq directory and type make (this is slow).

cd fastq
make

5) Your new fastq files should be in the "Project" directory (in my case, "Project_000000000-A1CDK"):


$ ls -l Project_000000000-A1CDK/Sample_lane1/
total 1523748
-rw-rw-r-- 1 username username 427960158 Sep 18 12:08 lane1_NoIndex_L001_R1_001.fastq.gz
-rw-rw-r-- 1 username username 342223846 Sep 18 12:09 lane1_NoIndex_L001_R1_002.fastq.gz
-rw-rw-r-- 1 username username  44789925 Sep 18 12:09 lane1_NoIndex_L001_R2_001.fastq.gz
-rw-rw-r-- 1 username username  35997119 Sep 18 12:10 lane1_NoIndex_L001_R2_002.fastq.gz
-rw-rw-r-- 1 username username 392045984 Sep 18 12:11 lane1_NoIndex_L001_R3_001.fastq.gz
-rw-rw-r-- 1 username username 315694744 Sep 18 12:12 lane1_NoIndex_L001_R3_002.fastq.gz

6) To demultiplex fastq files with a separate fastq file for the barcode read, refer to my recent post:  How to demultiplex fastq files with a dedicated, separate barcode file.

Wednesday, September 12, 2012

join command requires alphanumeric sort

You may already know that the linux join command requires sorted fields to work properly -- after all, it tells you so right in the man page:

Important: FILE1 and FILE2 must be sorted on the join fields.

But I just found out the hard way that it must be an alphanumeric sort, not a numeric sort.

Lesson learned.

Tuesday, September 4, 2012

Convert FASTQ to FASTA using sed command

Easiest possible solution, and I believe significantly faster than awk-based methods:

sed -n '1~4s/^@/>/p;2~4p' 


The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.

If you find this helpful, go up-vote my answer in StackOverflow.

Wednesday, April 11, 2012

Convert FASTA file RNA to DNA using sed on the linux command line

sed '/^[^>]/ y/uU/tT/' uracil.fasta > thymine.fasta 



Piece of cake.

Wednesday, April 4, 2012

lftp mirror: best way to download an entire directory from a FTP server

As much as I hate FTP, there situations where one must use it.  I finally have settled on the best way to get an entire directory, with recursive file retrieval:

lftp -e "mirror dir1/subdir2/subdir3" -u username,password ftphost.domain.com

Of course, FTP must die, but until then, I am glad there is lftp.

Tuesday, March 27, 2012

GNU parallel -- the best thing since sliced bread


When I first found GNU parallel last fall, I thought it was the bee's knees. So well conceived, so well executed, so well documented... so generous to the user, and most of all, so darn handy.


I've been using it extensively with the ::: notation to feed data files into a pipeline in parallel, and it does exactly what I want it to do.

But today it goes to a whole new level, as I realize there is a whole other side to the tool: the ability to feed parts of a single input file into a pipeline in parallel.

For example, BLAST can be effectively (embarrassingly) parallelized as follows:
cat temp.fa | parallel -j 10 --block 100k --recstart '>' --pipe blastn -evalue 0.01 -outfmt 6 -db refseq_genomic -query - > result.txt

The magic is in the pipe option.  Of course, we've all written blast parallelizers, but I've never seen a solution as neat and simple as this one liner!

This is from Michael Kuhn's excellent post on the subject.  Thanks!

Monday, March 19, 2012

Of bash and streams and pipes - One Tee To Rule Them All

There is a great answer on stack overflow showing how to feed one output stream into multiple processes.  It is a little bit of a hack of the excellent tool tee, which I've been using for some time.  Normally, tee take stout and writes it to a file, while passing it along to stout.  This solution provided on s.o. shows how to hack this with the bash trick >(process)  -- so as far as tee knows, it is writing to 2 or more files, but the "files" are actually background processes.

This syntax is bash-dependent, so won't work in sh, and this will likely come up if you're using system() in, say,  perl.  There is another excellent answer on stack overflow on how to deal with this.

Side note:  Using the <(command) trick, my post from March 16 can now be written:

paste <(grep -v "#" hsa.gff | cut -f 1,4,5,9) <(grep -v "#" hsa.gff | cut -f 6,7) | sed s/^/chr/ >mirbase18.bed

 

Friday, March 16, 2012

mirbase gff annotations in bed format

The mirbase.org folks in Manchester are doing a great job, I think, but they only release their annotations in gff format.  I think the following will convert that into an acceptable bed-format file:
## get mirbase18 annotation and convert to a bed like format
wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff
grep -v "#" hsa.gff | cut -f 1,4,5,9 >mirbase18.temp1
grep -v "#" hsa.gff | cut -f 6,7 >mirbase18.temp2
paste mirbase18.temp1 mirbase18.temp2 | sed s/^/chr/ >mirbase18.bed

Extra credit: Is there a clever shell trick to piping two separate stdins into paste, thus avoiding the temp files?

Thursday, March 8, 2012

2012 NAR Database Summary (Category List)

Bioinformatic databases grow like kudzu, so I'll no doubt be referring to this directory in the coming year.

http://www.oxfordjournals.org/nar/database/c/

 

How to get the entire recursive directory structure of an FTP site using wget

How can I use wget to get a recursive directory listing of an entire FTP site?

yes, wget --no-remove-listing ftp://myftpserver/ftpdirectory/ will get and retain a directory listing for a single directory (which was discussed here)

but, wget -r --no-remove-listing ftp://myftpserver/ftpdirectory/ will recurse and download an entire site

so, if you want to get the recursive directory structure, but not download the entire site, try wget -r --no-remove-listing --spider ftp://myftpserver/ftpdirectory/

 

Thursday, February 16, 2012

rRNA genes in the human genome

Larry Moran has an excellent discussion of the organization of rRNA genes in the human genome.  It's an old post, but a good one.

If you're involved in RNA-Seq, you should read it.  Dr. Moran's discussion makes it clear why it is necessary to map reads to a synthetic rRNA-ome before doing full-genome alignment -- if you want to accurately count, or filter out, rRNA reads (which can predominate in RNA-seq datasets).