Library caret is a wonderful R package for tuning a variety of machine learning classification and regression algorithms. But it can take a long time to run, since model tuning usually involves running multiple bootstrapped replicates for each point in your tuning grid.
If you have a multi-core desktop machine, you can speed up your calls to the caret function train by using explicit parallelism.
There were just a couple hitches to get it flying on my 64bit quad core Optiplex 960 running linux kernel 2.6.28-15 x86_64, and R version 2.9.2 (2009-08-24). I present these hitches here in hopes of saving you time.
First, use apt-get to install some necessary dependencies:
sudo apt-get install lam4-dev lam-runtime libopenmpi1 openmpi-common
Then sudo into R, and use the install.packages() function to install snow and Rmpi. (Do not install Rmpi using apt or synaptic. In general, it is always a better idea to get R packages directly from CRAN using the built-in function.)
> install.packages("Rmpi")
> install.packages("snow")
Anyway, after you have all of the above install, just follow the framework shown in the manual for train:
mpiClacs <- function(X, FUN, ...) {
theDots <- list(...)
parLapply(theDots$cl, X, FUN)
}
cl <- makeCluster(4, "MPI") ##### I am using 4 b/c I have a quad core processor
## This is how we inform "train" that we will have multiple processors available
mpiControl <- trainControl(workers = 4,
number = 25,
computeFunction = mpiClacs,
computeArgs = list(cl = cl))
set.seed(1)
tune <- train(method="rf", x = t(exprs(es)), y = es$dx, ntree = 10000,
tuneGrid=data.frame(.mtry=c(3:7)*200),
trControl = mpiControl)
stopCluster(cl)
Hope this helps!
Monday, October 5, 2009
Thursday, October 1, 2009
fork processes in R using package multicore
I just found this wicked cool way to fork processes (for multi-core CPUs) in R.... Works just like you'd expect!
library(multicore)
job1 <- mcparallel(bigMachingLeaningFunction1())
job2 <- mcparallel(bigMachingLeaningFunction2())
job3 <- mcparallel(bigMachingLeaningFunction3())
job4 <- mcparallel(bigMachingLeaningFunction4())
###### time goes by .....
results1 <- collect(job1)
results2 <- collect(job2)
results3 <- collect(job3)
results4 <- collect(job4)
Note to emacs ESS users -- just be sure all your libraries are already loaded, and set silent=TRUE in the mcparallel call.... that will keep your ESS buffer from going read-only.
library(multicore)
job1 <- mcparallel(bigMachingLeaningFunction1())
job2 <- mcparallel(bigMachingLeaningFunction2())
job3 <- mcparallel(bigMachingLeaningFunction3())
job4 <- mcparallel(bigMachingLeaningFunction4())
###### time goes by .....
results1 <- collect(job1)
results2 <- collect(job2)
results3 <- collect(job3)
results4 <- collect(job4)
Note to emacs ESS users -- just be sure all your libraries are already loaded, and set silent=TRUE in the mcparallel call.... that will keep your ESS buffer from going read-only.
Monday, September 21, 2009
Super sweet emacs keybindings and startup variables for ESS / R
I just stumbled upon Felipe Csaszar's blog and found his key bindings to be very very nice, if you use emacs with R a lot. Check it out:
http://www.emacswiki.org/emacs/EmacsSpeaksStatistics
http://www.emacswiki.org/emacs/EmacsSpeaksStatistics
Thursday, September 17, 2009
How to use the sqlite connection to AnnotationDbi packages
It is very fast and very straightforward:
library(hgu133plus2.db)
hgu133plus2_dbschema()
sql = "
select * from probes p, gene_info g, chromosome_locations c
where g._id=c._id and g._id=p._id limit 10
"
dbGetQuery(hgu133plus2_dbconn(), sql)
If you look at the last few page of the manual for AnnotationDbi [site, PDF], you will see some tips about joining different databases together.
library(hgu133plus2.db)
hgu133plus2_dbschema()
sql = "
select * from probes p, gene_info g, chromosome_locations c
where g._id=c._id and g._id=p._id limit 10
"
dbGetQuery(hgu133plus2_dbconn(), sql)
If you look at the last few page of the manual for AnnotationDbi [site, PDF], you will see some tips about joining different databases together.
Wednesday, August 5, 2009
Multiple lattice plots
Here's Owen's trick for displaying multiple lattice dotplots:
In the reshape package, the function melt() allows you to reshape data into variable/value columns, on one or more "id" column(s):
Setting parameter as.table=TRUE will plot from top left to bottom right.
In the reshape package, the function melt() allows you to reshape data into variable/value columns, on one or more "id" column(s):
> library(reshape)
> library(lattice)
> d <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10), cond=factor(sample(1:2,10,replace=TRUE)))
> d
a b c cond
1 0.06927895 0.98947264 1.7256042 2
2 -2.02604891 -0.33003778 -0.6595201 1
3 0.38082836 -0.98114426 1.5448757 2
4 -2.73939190 -1.30930261 0.1118557 1
5 0.37405184 -0.45263384 1.0484490 1
6 -0.68747094 -0.78446749 0.3484745 2
7 0.13011011 1.01275070 -0.5005770 1
8 -0.33158390 1.38201954 0.3096848 1
9 0.07920194 0.05604219 -1.8579733 2
10 -0.49055146 -1.17780285 0.6850589 1
> melt(d, id="cond")
cond variable value
1 2 a 0.06927895
2 1 a -2.02604891
3 2 a 0.38082836
4 1 a -2.73939190
5 1 a 0.37405184
6 2 a -0.68747094
7 1 a 0.13011011
8 1 a -0.33158390
9 2 a 0.07920194
10 1 a -0.49055146
11 2 b 0.98947264
12 1 b -0.33003778
13 2 b -0.98114426
14 1 b -1.30930261
15 1 b -0.45263384
16 2 b -0.78446749
17 1 b 1.01275070
18 1 b 1.38201954
19 2 b 0.05604219
20 1 b -1.17780285
21 2 c 1.72560425
22 1 c -0.65952015
23 2 c 1.54487566
24 1 c 0.11185573
25 1 c 1.04844895
26 2 c 0.34847452
27 1 c -0.50057699
28 1 c 0.30968476
29 2 c -1.85797331
30 1 c 0.68505890
> stripplot(value~cond|variable,melt(d, id="cond"),scales=list(relation="free"))
Setting parameter as.table=TRUE will plot from top left to bottom right.
Tuesday, July 21, 2009
Secret link for the annotation files on Bioconductor
http://bioconductor.org/packages/2.4/data/annotation/
enjoy!
PS. of course you can update this for newer releases of Bioconductor
http://bioconductor.org/packages/2.6/data/annotation/
enjoy!
PS. of course you can update this for newer releases of Bioconductor
http://bioconductor.org/packages/2.6/data/annotation/
Wednesday, April 15, 2009
Another workaround for Memory Issues in R / Bioconductor
This post could be entitled "Error: cannot allocate vector of size 256.0 Mb". R provides this maddening response for even the most trivial seeming tasks. I have a 4 Gb linux system, and I have encountered this error message for "vector of size" in the 10s of Mb. And system wide, there appears to be plenty of memory still available, and the swap space hasn't been touched.
What's going on with R memory management??
There are many issues and many variables -- the overall amount of memory in your system, whether you have a 32bit or 64bit systems, how much memory is allocated to the R process (especially under windows), and of course, the size of the "vector" that R is trying to allocate.
But for me, the most salient topic is memory fragmentation, aka the swiss-cheese effect. Matthew Keller has an excellent discussion of the topic here, and I won't try to go into all the details. But basically, even if you have plenty of "free memory," you can run into this memory allocation error if the "free" memory is discontinuous. The above link suggests some workarounds, but here is one that I have not seen posted anywhere:
Quit R. Yes, just quit! But don't worry we will come back again!
> q()
Save workspace image? [y/n/c]: y
Then, immediately start R again, and allow it to load in all the objects you had before. I have been delighted to see that this actually gets me around the vector allocation error. I guess what is happening, is that the objects are all being read back in in a neatly packed manner, making more continuous free memory.
What's going on with R memory management??
There are many issues and many variables -- the overall amount of memory in your system, whether you have a 32bit or 64bit systems, how much memory is allocated to the R process (especially under windows), and of course, the size of the "vector" that R is trying to allocate.
But for me, the most salient topic is memory fragmentation, aka the swiss-cheese effect. Matthew Keller has an excellent discussion of the topic here, and I won't try to go into all the details. But basically, even if you have plenty of "free memory," you can run into this memory allocation error if the "free" memory is discontinuous. The above link suggests some workarounds, but here is one that I have not seen posted anywhere:
Quit R. Yes, just quit! But don't worry we will come back again!
> q()
Save workspace image? [y/n/c]: y
Then, immediately start R again, and allow it to load in all the objects you had before. I have been delighted to see that this actually gets me around the vector allocation error. I guess what is happening, is that the objects are all being read back in in a neatly packed manner, making more continuous free memory.
Thursday, April 9, 2009
The Girke R Bioconductor manual
I came across Thomas Girke's wonderful R/Bioconductor manual today. If I had had this UC Riverside professors notes when I started, I would have saved a lot of time.
Wednesday, March 25, 2009
CRAN Task Views: Your guide to R packages
If you're like me, you've at times been frustrated by the looseness of the R package system. These are some of the facts of life in the R world:
My favorite guides at the moment are Machine Learning, and Multivariate Statistics.
Keep in mind, the above guides do not include BioConductor packages -- for that, you will still go to the BioConductor website.
- There are often many different functions with overlapping goals [for example, princomp() and prcomp()].
- The names of a packages may have little or no relationship to the functions in that package (for example, package "e1071" provides the tune() function for parameter tuning).
- There doesn't seem to be any organized, curated guide to the packages.
My favorite guides at the moment are Machine Learning, and Multivariate Statistics.
Keep in mind, the above guides do not include BioConductor packages -- for that, you will still go to the BioConductor website.
R variable class (data types)
R does some amazing things, and often seems to know just want you want to do. Data type "coercion" can also backfire.
I just discovered a great way to check to see how R has interpreted the columns of a file you have just loaded.
> sapply(dataframeobject, class)
gender type score
"factor" "factor" "numeric"
I just discovered a great way to check to see how R has interpreted the columns of a file you have just loaded.
> sapply(dataframeobject, class)
gender type score
"factor" "factor" "numeric"
Integrating BioMart queries into BioConductor with the biomaRt package
I've just realized the considerable power and convenience of a web-based bioinformatic resource I have overlooked in the past: biomart.org.
What gets my attention now is the ability to integrate BioMart queries right into your BioConductor pipeline, with a package called biomaRt
The Seattle 2009 BioConductor workshop looks like it was great -- sadly, I missed it. But you can still access some presentation materials on biomaRt.
Installation of biomaRt is accomplished like any other BioConductor pacakge:
What gets my attention now is the ability to integrate BioMart queries right into your BioConductor pipeline, with a package called biomaRt
The Seattle 2009 BioConductor workshop looks like it was great -- sadly, I missed it. But you can still access some presentation materials on biomaRt.
Installation of biomaRt is accomplished like any other BioConductor pacakge:
source("http://bioconductor.org/biocLite.R")I am looking forward to learning more about these resources.
biocLite("biomaRt")
Thursday, March 19, 2009
Pulling data from GEO into R Bioconductor expression sets (eSet)
I just discovered how easy Bioconductor makes it to import data from the Gene Expression Omnibus (GEO).
First, be sure you have the GEOquery package from Bioconductor. You will probably need to install the curl4 devel library package using apt-get or whatever you use to do such things. On my Ubuntu 8.10 distro, the requisite package is called libcurl4-gnutls-dev. After you have this, you should be able to install GEOquery like any other Bioconductor package, ie:
source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")
Then, pulling a GEO series (GSE) into an eSet object is as simple as:
gseObj <- getGEO("GSE10667")
eSet <- gseObj[[1]]
Note the [[1]] -- necessary because the GSE comes in as a list of eSets, apparently.
See the GEOquery vignette for more details, and for conversions from GDS format.
First, be sure you have the GEOquery package from Bioconductor. You will probably need to install the curl4 devel library package using apt-get or whatever you use to do such things. On my Ubuntu 8.10 distro, the requisite package is called libcurl4-gnutls-dev. After you have this, you should be able to install GEOquery like any other Bioconductor package, ie:
source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")
Then, pulling a GEO series (GSE) into an eSet object is as simple as:
gseObj <- getGEO("GSE10667")
eSet <- gseObj[[1]]
Note the [[1]] -- necessary because the GSE comes in as a list of eSets, apparently.
See the GEOquery vignette for more details, and for conversions from GDS format.
Subscribe to:
Posts (Atom)