BioCozy: 2011

Tuesday, December 20, 2011

rRNA genes in human genome

GRCh37/hg19

http://sandwalk.blogspot.com/2008/01/human-ribosomal-rna-genes.html

Wednesday, October 26, 2011

BWA on FPGA architecture

Convey Computing is offering a FPGA-based computational cluster that has been optimized for BWA.

Monday, October 3, 2011

I wanted to remove all sequences less than n long from a fastq file. This is a solution that uses sed "paragraphs" -- see my earlier post. It works... could probably be done more efficiently... but this works for me, for now.

Save this as a script, and then use in this manner:

./fastqlenfilt.sh 15 input.fastq >output.filtered.fastq

#!/bin/sh
sed -n '                                                                                                                                                     
 # thanks to http://www.grymoire.com/Unix/Sed.html                                                                                                           
 #                                                                                                                                                           
 # if matching description, check the paragraph                                                                                                              
 /^@/ b para                                                                                                                                                 
 # else add it to the hold buffer                                                                                                                            
 H                                                                                                                                                           
 # at end of file, check paragraph                                                                                                                           
 $ b para                                                                                                                                                    
 # now branch to end of script                                                                                                                               
 b                                                                                                                                                           
 # this is where a paragraph is checked for the pattern                                                                                                      
 :para                                                                                                                                                       
 # return the entire paragraph                                                                                                                               
 # into the pattern space                                                                                                                                    
 x                                                                                                                                                           
 # look for the pattern, if there - print                                                                                                                    
 # /'.*\n.{$1,}'/ p                                                                                                                                          
  /'.*\\n[ACGTURYKMSWBDHVNX]\\{$1,\\}'/ p                                                                                                                    
 ' $2

Thursday, September 22, 2011

Mac function keys working with debian / ubuntu terminal and byobu

A minor irritation, but as I have begun running longer jobs, and getting addicted to the excellent byobu adaptation of GNU screen, I've been frustrated that I can't access those easy function keys when I ssh in from my Mac.

I've pieced together a solution by reading this thread and this excellent resource.

Here is what works for me, using OS X 10.6, with Terminal declaring itself as "xterm-color": in Terminal.app, go to preferences, and replace the F1 thru F4 mappings to:

33[11~
33[12~
33[13~
33[14~

EDIT: Now I am using OS X 10.7, and my terminal is set to declare "xterm-256color", and what works for me now is the following:

33OP
33OQ
33OR
33OS

Where 33 is a <ctrl>[

Tuesday, September 13, 2011

Pull desired sequences out of a multiple FASTA file with regex pattern match

This simple problem stumped me for a while.

Say you want just the human sequences ("hsa") from the following multiple FASTA file:

>cel-mir-90 MI0000059 Caenorhabditis elegans miR-90 stem-loop
GGGCGCCAUUUCGAGCGGCUUUCAACGACGAUAUCAACCGACAACUCACACUUUUGCGUG
UUGAUAUGUUGUUUGAAUGCCCCUUGAAUUGGAUGCCA
>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2 MI0000061 Homo sapiens let-7a-2 stem-loop
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU
>dme-mir-13b-2 MI0000135 Drosophila melanogaster miR-13b-2 stem-loop
UAUUAACGCGUCAAAAUGACUGUGAGCUAUGUGGAUUUGACUUCAUAUCACAGCCAUUUU
GACGAGUUUG
>dme-mir-14 MI0000136 Drosophila melanogaster miR-14 stem-loop
UGUGGGAGCGAGACGGGGACUCACUGUGCUUAUUAAAUAGUCAGUCUUUUUCUCUCUCCU
AUA
>mmu-let-7g MI0000137 Mus musculus let-7g stem-loop
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGA
UAACUGUACAGGCCACUGCCUUGCCAGG
>hsa-mir-30d MI0000255 Homo sapiens miR-30d stem-loop
GUUGUUGUAAACAUCCCCGACUGGAAGCUGUAAGACACAGCUAAGCUUUCAGUCAGAUGU
UUGCUGCUAC
>mmu-mir-122 MI0000256 Mus musculus miR-122 stem-loop
AGCUGUGGAGUGUGACAAUGGUGUUUGUGUCCAAACCAUCAAACGCCAUUAUCACACUAA
AUAGCU

In other words, you want just these sequences:

>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2 MI0000061 Homo sapiens let-7a-2 stem-loop
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU
>hsa-mir-30d MI0000255 Homo sapiens miR-30d stem-loop
GUUGUUGUAAACAUCCCCGACUGGAAGCUGUAAGACACAGCUAAGCUUUCAGUCAGAUGU
UUGCUGCUAC

I knew I could do it with a script, using if/then/else testing. I know some people would say I should make use of BioPerl Bio::SeqIO module. But I wanted to do it in the bash shell, for the sake of simplicity and learning a little more advanced usage of the amazing text and regex tools available in almost any vanilla linux distro.

This posting from Austin Matzko's blog looked really useful, but ultimately The Grymoire awed me with its comprehensiveness and clarity, in this case, for all things sed.

I made a slight modification of the supplied example (see the "Working with Multiple Lines" section) and I came up with a one-line solution:

sed -n '/^>/ b para; H; $ b para; b; :para; x; /'hsa'/ p' inputfile.fasta

But it is probably better read as a script:

#!/bin/sh
 sed -n '
 # thanks to http://www.grymoire.com/Unix/Sed.html
 #
 # if matching description, check the paragraph
 /^>/ b para
 # else add it to the hold buffer
 H
 # at end of file, check paragraph
 $ b para
 # now branch to end of script
 b
 # this is where a paragraph is checked for the pattern
 :para
 # return the entire paragraph
 # into the pattern space
 x
 # look for the pattern, if there - print
 /'$1'/ p
 ' $2

Friday, September 9, 2011

indespensible bioinformatic resources

Essentials tools for working with high-throughput sequencing data:

bowtie -- bowtie-bio.sourceforge.net/
my current NGS aligner of choice

samtools -- samtools.sourceforge.net/
essential. the lingua franca for working with the output of alignment

picard -- picard.sourceforge.net/
also essential for working with alignment files. very nicely complements samtools.

fastx toolkit -- hannonlab.cshl.edu/fastx_toolkit/
this looks like it will be handy... though not yet part of any of the ubuntu software repositories I regularly use, it does look like it is included in the upcoming Oneiric Ocelot 11.10 upgrade.

cutadapt -- code.google.com/p/cutadapt/
may be made obsolete by fastx toolkit

Other useful resources, not necessarily singularly focused on sequencing data:

Bioinformatics manual at UC Riverside -- manuals.bioinformatics.ucr.edu/
Includes lots of good stuff on R and HTS data analysis

Statistical Computing pages at UCLA -- www.ats.ucla.edu/stat/
Provides very useful examples for different stats packages, very well organized.

Monday, August 22, 2011

refactor in R

I use this little R function almost daily. It just takes a factor, drops the unused levels, and otherwise keeps the same ordering.

refactor <- function(x) {
  x <- factor(x, levels=levels(x)[levels(x) %in% x] )
  return(x)
}

File this under "commands that should be included in R."

Monday, August 8, 2011

Package search tools: "provided by" and "what provides"

Ubuntu community help forum has a nice page if you need to be reminded how to search files provided by a particular package.

dpkg -L will show you what files came with an installed package

apt-file, which needs to be installed first, will tell you what package provides a given file. Actually, it can also replicate the same thing that dpkg -L does, see below:

$ dpkg -L picard-tools
/.
/usr
/usr/bin
/usr/bin/picard-tools
/usr/share
/usr/share/java
/usr/share/java/picard-1.27.jar
/usr/share/picard-tools
/usr/share/picard-tools/explain_sam_flags.py
/usr/share/doc
/usr/share/doc/picard-tools
/usr/share/doc/picard-tools/copyright
/usr/share/doc/picard-tools/README.Debian
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/picard-tools.1.gz
/usr/share/java/picard.jar

$ apt-file search picard.jar
picard-tools: /usr/share/java/picard.jar

$ apt-file list picard-tools
picard-tools: /usr/bin/picard-tools
picard-tools: /usr/share/doc/picard-tools/README.Debian
picard-tools: /usr/share/doc/picard-tools/copyright
picard-tools: /usr/share/java/picard-1.27.jar
picard-tools: /usr/share/java/picard.jar
picard-tools: /usr/share/man/man1/picard-tools.1.gz
picard-tools: /usr/share/picard-tools/explain_sam_flags.py


Addendum:  If you need to know what repository provides a specific package, use apt-cache showpkg:
$ apt-cache showpkg picard-tools
Package: picard-tools
Versions: 
1.27-1 (/var/lib/apt/lists/us.archive.ubuntu.com_ubuntu_dists_natty_universe_binary-amd64_Packages) (/var/lib/dpkg/status)
 Description Language: 
                 File: /var/lib/apt/lists/us.archive.ubuntu.com_ubuntu_dists_natty_universe_binary-amd64_Packages
                  MD5: 5ece67d6a9fa35d5b4adc3567de3557b


Reverse Depends: 
  med-bio,picard-tools
  libsam-java,picard-tools
Dependencies: 
1.27-1 - openjdk-6-jre (16 (null)) java-runtime (0 (null)) libsam-java (2 1.27-1) python (0 (null)) r-base-core (0 (null)) 
Provides: 
1.27-1 - 
Reverse Provides:

Thursday, August 4, 2011

Bio::SeqIO very very slow

While it may be a convenient and flexible solution when reading in 10s or 100s or 1000s of sequences, the SeqIO module of BioPerl is just not a workable solution for reading in Next Gen Sequencing files.

Consider reading in 100,000 short reads from a fastq file:

It takes 30 seconds to do it the bioperl way:

use Bio::SeqIO;
my $seqin = Bio::SeqIO->new(-format => "fastq",-file => "infilename",);
while( $seq = $seqin->next_seq() ) { $seqhash{$seq->seq() } ++; }

But only 1/2 of a second to do it the home-made way (using basic file IO module):

use File::Util;
my $f = File::Util->new();
$ifh = $f->open_handle('file' => $infilename, 'mode' => 'read');
while(<$ifh>)
{
if($_ =~ /\@.*/)
{
$counter++;
$trigger = 1;
}
elsif($trigger == 1)
{
$_ =~ /(\w+)/;
$seq = $1;
$seqhash{$seq} ++;
$trigger = 0;
}
}

Not as elegant, but it works much faster!!

Thursday, May 26, 2011

PLANdbAffy - a resource for mapping Affymetrix probes

I just discovered PLANdbAffy, which is a handy way to see exactly where Affy probes are interrogating a gene.

Nurtdinov RN*, Vasiliev MO, Ershova AS, Lossev IS, Karyagina AS.
PLANdbAffy: probe-level annotation database for Affymetrix expression microarrays.
Nucleic Acids Res., 2010, 38 (Database issue): D726-730. PMID: 19906711

Thanks, to all involved!

How to access Time Machine files from Linux UNIX

Found this very handy script that allows you to access a Time Machine created backup, which Apple engineered (almost) too cleverly.

Handy if you want to copy files off of a backup, directly to a linux computer, for example.