BioCozy: 2009

Monday, October 5, 2009

parallel computation using the caret package

Library caret is a wonderful R package for tuning a variety of machine learning classification and regression algorithms. But it can take a long time to run, since model tuning usually involves running multiple bootstrapped replicates for each point in your tuning grid.

If you have a multi-core desktop machine, you can speed up your calls to the caret function train by using explicit parallelism.

There were just a couple hitches to get it flying on my 64bit quad core Optiplex 960 running linux kernel 2.6.28-15 x86_64, and R version 2.9.2 (2009-08-24). I present these hitches here in hopes of saving you time.

First, use apt-get to install some necessary dependencies:

sudo apt-get install lam4-dev lam-runtime libopenmpi1 openmpi-common

Then sudo into R, and use the install.packages() function to install snow and Rmpi. (Do not install Rmpi using apt or synaptic. In general, it is always a better idea to get R packages directly from CRAN using the built-in function.)

> install.packages("Rmpi")
> install.packages("snow")

Anyway, after you have all of the above install, just follow the framework shown in the manual for train:

mpiClacs <- function(X, FUN, ...) {
theDots <- list(...)
parLapply(theDots$cl, X, FUN)
}

cl <- makeCluster(4, "MPI") ##### I am using 4 b/c I have a quad core processor

## This is how we inform "train" that we will have multiple processors available
mpiControl <- trainControl(workers = 4,
number = 25,
computeFunction = mpiClacs,
computeArgs = list(cl = cl))

set.seed(1)

tune <- train(method="rf", x = t(exprs(es)), y = es$dx, ntree = 10000,
tuneGrid=data.frame(.mtry=c(3:7)*200),
trControl = mpiControl)

stopCluster(cl)

Hope this helps!

Thursday, October 1, 2009

fork processes in R using package multicore

I just found this wicked cool way to fork processes (for multi-core CPUs) in R.... Works just like you'd expect!

library(multicore)

job1 <- mcparallel(bigMachingLeaningFunction1())
job2 <- mcparallel(bigMachingLeaningFunction2())
job3 <- mcparallel(bigMachingLeaningFunction3())
job4 <- mcparallel(bigMachingLeaningFunction4())

###### time goes by .....

results1 <- collect(job1)
results2 <- collect(job2)
results3 <- collect(job3)
results4 <- collect(job4)

Note to emacs ESS users -- just be sure all your libraries are already loaded, and set silent=TRUE in the mcparallel call.... that will keep your ESS buffer from going read-only.

Monday, September 21, 2009

Super sweet emacs keybindings and startup variables for ESS / R

I just stumbled upon Felipe Csaszar's blog and found his key bindings to be very very nice, if you use emacs with R a lot. Check it out:

http://www.emacswiki.org/emacs/EmacsSpeaksStatistics

Thursday, September 17, 2009

How to use the sqlite connection to AnnotationDbi packages

It is very fast and very straightforward:

library(hgu133plus2.db)
hgu133plus2_dbschema()

sql = "
select * from probes p, gene_info g, chromosome_locations c
where g._id=c._id and g._id=p._id limit 10
"

dbGetQuery(hgu133plus2_dbconn(), sql)

If you look at the last few page of the manual for AnnotationDbi [site, PDF], you will see some tips about joining different databases together.

Wednesday, August 5, 2009

Multiple lattice plots

Here's Owen's trick for displaying multiple lattice dotplots:

In the reshape package, the function melt() allows you to reshape data into variable/value columns, on one or more "id" column(s):


> library(reshape)
> library(lattice)
> d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))

> d
             a           b          c cond
1   0.06927895  0.98947264  1.7256042    2
2  -2.02604891 -0.33003778 -0.6595201    1
3   0.38082836 -0.98114426  1.5448757    2
4  -2.73939190 -1.30930261  0.1118557    1
5   0.37405184 -0.45263384  1.0484490    1
6  -0.68747094 -0.78446749  0.3484745    2
7   0.13011011  1.01275070 -0.5005770    1
8  -0.33158390  1.38201954  0.3096848    1
9   0.07920194  0.05604219 -1.8579733    2
10 -0.49055146 -1.17780285  0.6850589    1

> melt(d, id="cond")
   cond variable       value
1     2        a  0.06927895
2     1        a -2.02604891
3     2        a  0.38082836
4     1        a -2.73939190
5     1        a  0.37405184
6     2        a -0.68747094
7     1        a  0.13011011
8     1        a -0.33158390
9     2        a  0.07920194
10    1        a -0.49055146
11    2        b  0.98947264
12    1        b -0.33003778
13    2        b -0.98114426
14    1        b -1.30930261
15    1        b -0.45263384
16    2        b -0.78446749
17    1        b  1.01275070
18    1        b  1.38201954
19    2        b  0.05604219
20    1        b -1.17780285
21    2        c  1.72560425
22    1        c -0.65952015
23    2        c  1.54487566
24    1        c  0.11185573
25    1        c  1.04844895
26    2        c  0.34847452
27    1        c -0.50057699
28    1        c  0.30968476
29    2        c -1.85797331
30    1        c  0.68505890

> stripplot(value~cond|variable,melt(d, id="cond"),scales=list(relation="free"))

Setting parameter as.table=TRUE will plot from top left to bottom right.

Tuesday, July 21, 2009

Secret link for the annotation files on Bioconductor

http://bioconductor.org/packages/2.4/data/annotation/

enjoy!

PS. of course you can update this for newer releases of Bioconductor

http://bioconductor.org/packages/2.6/data/annotation/

Wednesday, April 15, 2009

Another workaround for Memory Issues in R / Bioconductor

This post could be entitled "Error: cannot allocate vector of size 256.0 Mb". R provides this maddening response for even the most trivial seeming tasks. I have a 4 Gb linux system, and I have encountered this error message for "vector of size" in the 10s of Mb. And system wide, there appears to be plenty of memory still available, and the swap space hasn't been touched.

What's going on with R memory management??

There are many issues and many variables -- the overall amount of memory in your system, whether you have a 32bit or 64bit systems, how much memory is allocated to the R process (especially under windows), and of course, the size of the "vector" that R is trying to allocate.

But for me, the most salient topic is memory fragmentation, aka the swiss-cheese effect. Matthew Keller has an excellent discussion of the topic here, and I won't try to go into all the details. But basically, even if you have plenty of "free memory," you can run into this memory allocation error if the "free" memory is discontinuous. The above link suggests some workarounds, but here is one that I have not seen posted anywhere:

Quit R. Yes, just quit! But don't worry we will come back again!

> q()
Save workspace image? [y/n/c]: y

Then, immediately start R again, and allow it to load in all the objects you had before. I have been delighted to see that this actually gets me around the vector allocation error. I guess what is happening, is that the objects are all being read back in in a neatly packed manner, making more continuous free memory.

Thursday, April 9, 2009

The Girke R Bioconductor manual

I came across Thomas Girke's wonderful R/Bioconductor manual today. If I had had this UC Riverside professors notes when I started, I would have saved a lot of time.

Wednesday, March 25, 2009

CRAN Task Views: Your guide to R packages

If you're like me, you've at times been frustrated by the looseness of the R package system. These are some of the facts of life in the R world:

There are often many different functions with overlapping goals [for example, princomp() and prcomp()].
The names of a packages may have little or no relationship to the functions in that package (for example, package "e1071" provides the tune() function for parameter tuning).
There doesn't seem to be any organized, curated guide to the packages.

Well, at least the last point on this list of frustrations has been remedied. Be sure to check out the CRAN Task Views pages, which has exactly that: lists of packages with a brief blurb about each one, organized by topics (more than 20 at this writing), curated by (I assume) expert R users/developers.

My favorite guides at the moment are Machine Learning, and Multivariate Statistics.

Keep in mind, the above guides do not include BioConductor packages -- for that, you will still go to the BioConductor website.

R variable class (data types)

R does some amazing things, and often seems to know just want you want to do. Data type "coercion" can also backfire.

I just discovered a great way to check to see how R has interpreted the columns of a file you have just loaded.

> sapply(dataframeobject, class)
gender type score
"factor" "factor" "numeric"

Integrating BioMart queries into BioConductor with the biomaRt package

I've just realized the considerable power and convenience of a web-based bioinformatic resource I have overlooked in the past: biomart.org.

What gets my attention now is the ability to integrate BioMart queries right into your BioConductor pipeline, with a package called biomaRt

The Seattle 2009 BioConductor workshop looks like it was great -- sadly, I missed it. But you can still access some presentation materials on biomaRt.

Installation of biomaRt is accomplished like any other BioConductor pacakge:

    source("http://bioconductor.org/biocLite.R")
  biocLite("biomaRt")

I am looking forward to learning more about these resources.

Thursday, March 19, 2009

Pulling data from GEO into R Bioconductor expression sets (eSet)

I just discovered how easy Bioconductor makes it to import data from the Gene Expression Omnibus (GEO).

First, be sure you have the GEOquery package from Bioconductor. You will probably need to install the curl4 devel library package using apt-get or whatever you use to do such things. On my Ubuntu 8.10 distro, the requisite package is called libcurl4-gnutls-dev. After you have this, you should be able to install GEOquery like any other Bioconductor package, ie:

source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Then, pulling a GEO series (GSE) into an eSet object is as simple as:

gseObj <- getGEO("GSE10667")
eSet <- gseObj[[1]]

Note the [[1]] -- necessary because the GSE comes in as a list of eSets, apparently.

See the GEOquery vignette for more details, and for conversions from GDS format.