Wednesday, March 25, 2009

CRAN Task Views: Your guide to R packages

If you're like me, you've at times been frustrated by the looseness of the R package system. These are some of the facts of life in the R world:
  • There are often many different functions with overlapping goals [for example, princomp() and prcomp()].
  • The names of a packages may have little or no relationship to the functions in that package (for example, package "e1071" provides the tune() function for parameter tuning).
  • There doesn't seem to be any organized, curated guide to the packages.
Well, at least the last point on this list of frustrations has been remedied. Be sure to check out the CRAN Task Views pages, which has exactly that: lists of packages with a brief blurb about each one, organized by topics (more than 20 at this writing), curated by (I assume) expert R users/developers.

My favorite guides at the moment are Machine Learning, and Multivariate Statistics.

Keep in mind, the above guides do not include BioConductor packages -- for that, you will still go to the BioConductor website.

R variable class (data types)

R does some amazing things, and often seems to know just want you want to do. Data type "coercion" can also backfire.

I just discovered a great way to check to see how R has interpreted the columns of a file you have just loaded.

> sapply(dataframeobject, class)
gender type score
"factor" "factor" "numeric"

Integrating BioMart queries into BioConductor with the biomaRt package

I've just realized the considerable power and convenience of a web-based bioinformatic resource I have overlooked in the past: biomart.org.

What gets my attention now is the ability to integrate BioMart queries right into your BioConductor pipeline, with a package called biomaRt

The Seattle 2009 BioConductor workshop looks like it was great -- sadly, I missed it. But you can still access some presentation materials on biomaRt.

Installation of biomaRt is accomplished like any other BioConductor pacakge:

    source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
I am looking forward to learning more about these resources.

Thursday, March 19, 2009

Pulling data from GEO into R Bioconductor expression sets (eSet)

I just discovered how easy Bioconductor makes it to import data from the Gene Expression Omnibus (GEO).

First, be sure you have the GEOquery package from Bioconductor. You will probably need to install the curl4 devel library package using apt-get or whatever you use to do such things. On my Ubuntu 8.10 distro, the requisite package is called libcurl4-gnutls-dev. After you have this, you should be able to install GEOquery like any other Bioconductor package, ie:

source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Then, pulling a GEO series (GSE) into an eSet object is as simple as:

gseObj <- getGEO("GSE10667")
eSet <- gseObj[[1]]

Note the [[1]] -- necessary because the GSE comes in as a list of eSets, apparently.

See the GEOquery vignette for more details, and for conversions from GDS format.