BioCozy: Bioconductor

Showing posts with label Bioconductor. Show all posts

Monday, October 5, 2009

parallel computation using the caret package

Library caret is a wonderful R package for tuning a variety of machine learning classification and regression algorithms. But it can take a long time to run, since model tuning usually involves running multiple bootstrapped replicates for each point in your tuning grid.

If you have a multi-core desktop machine, you can speed up your calls to the caret function train by using explicit parallelism.

There were just a couple hitches to get it flying on my 64bit quad core Optiplex 960 running linux kernel 2.6.28-15 x86_64, and R version 2.9.2 (2009-08-24). I present these hitches here in hopes of saving you time.

First, use apt-get to install some necessary dependencies:

sudo apt-get install lam4-dev lam-runtime libopenmpi1 openmpi-common

Then sudo into R, and use the install.packages() function to install snow and Rmpi. (Do not install Rmpi using apt or synaptic. In general, it is always a better idea to get R packages directly from CRAN using the built-in function.)

> install.packages("Rmpi")
> install.packages("snow")

Anyway, after you have all of the above install, just follow the framework shown in the manual for train:

mpiClacs <- function(X, FUN, ...) {
theDots <- list(...)
parLapply(theDots$cl, X, FUN)
}

cl <- makeCluster(4, "MPI") ##### I am using 4 b/c I have a quad core processor

## This is how we inform "train" that we will have multiple processors available
mpiControl <- trainControl(workers = 4,
number = 25,
computeFunction = mpiClacs,
computeArgs = list(cl = cl))

set.seed(1)

tune <- train(method="rf", x = t(exprs(es)), y = es$dx, ntree = 10000,
tuneGrid=data.frame(.mtry=c(3:7)*200),
trControl = mpiControl)

stopCluster(cl)

Hope this helps!

Wednesday, April 15, 2009

Another workaround for Memory Issues in R / Bioconductor

This post could be entitled "Error: cannot allocate vector of size 256.0 Mb". R provides this maddening response for even the most trivial seeming tasks. I have a 4 Gb linux system, and I have encountered this error message for "vector of size" in the 10s of Mb. And system wide, there appears to be plenty of memory still available, and the swap space hasn't been touched.

What's going on with R memory management??

There are many issues and many variables -- the overall amount of memory in your system, whether you have a 32bit or 64bit systems, how much memory is allocated to the R process (especially under windows), and of course, the size of the "vector" that R is trying to allocate.

But for me, the most salient topic is memory fragmentation, aka the swiss-cheese effect. Matthew Keller has an excellent discussion of the topic here, and I won't try to go into all the details. But basically, even if you have plenty of "free memory," you can run into this memory allocation error if the "free" memory is discontinuous. The above link suggests some workarounds, but here is one that I have not seen posted anywhere:

Quit R. Yes, just quit! But don't worry we will come back again!

> q()
Save workspace image? [y/n/c]: y

Then, immediately start R again, and allow it to load in all the objects you had before. I have been delighted to see that this actually gets me around the vector allocation error. I guess what is happening, is that the objects are all being read back in in a neatly packed manner, making more continuous free memory.

Thursday, April 9, 2009

The Girke R Bioconductor manual

I came across Thomas Girke's wonderful R/Bioconductor manual today. If I had had this UC Riverside professors notes when I started, I would have saved a lot of time.

Thursday, March 19, 2009

Pulling data from GEO into R Bioconductor expression sets (eSet)

I just discovered how easy Bioconductor makes it to import data from the Gene Expression Omnibus (GEO).

First, be sure you have the GEOquery package from Bioconductor. You will probably need to install the curl4 devel library package using apt-get or whatever you use to do such things. On my Ubuntu 8.10 distro, the requisite package is called libcurl4-gnutls-dev. After you have this, you should be able to install GEOquery like any other Bioconductor package, ie:

source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Then, pulling a GEO series (GSE) into an eSet object is as simple as:

gseObj <- getGEO("GSE10667")
eSet <- gseObj[[1]]

Note the [[1]] -- necessary because the GSE comes in as a list of eSets, apparently.

See the GEOquery vignette for more details, and for conversions from GDS format.