Skip to content.

bioconductor.org

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.

Sections

Lab12.rnw

% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 12} %\VignetteDepends{wavethresh,ts,Rwave} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\title{Lab 12: Dimension reduction in R}

\begin{document}

\maketitle

In this lab, we present some dimension reduction algorithms that are available in R and that are useful for classification of microarray data. The first example in this section will rely on the data reported in Hedenfalk et al. (2001) where one of the goals is to find genes that are differentially expressed between BRCA1-mutation-positive tumors and BRCA2-mutation-positive tumors by obtaining several microarrays from each cell type.

We first load the neccessary libraries for this lab.

<>= library(mda) library(dr) library(Milan) @

Expression measures are stored in an \Robject{hedenfalk} object available through the \Rpackage{brca} package.

<>= data(brca) names(brca)=c("plate", "cid","BC11","BC15","BC13","BC17","BC12","BC14","BC210a","BC29","BC28","BC210b","SP16","SP17","SP15","SP18","SP19","SP21","SP20","BC16","BC213","BC214","BC211","BC212") brca <- (brca[,c(-1,-2)]) brca.class <- c(rep(1,6),rep(2,4),rep(0,7),1,rep(2,4)) @

We formulate the effects of gene expression on class type using the {\em multinomial logistic regression model}: $$ \log \frac{\PP (Y_{i}=r)}{\PP (Y_{i}= 0)} = \mathbf{X}_{i\cdot} \bbeta_{r0} ,\quad r=1,\dots, G-1,$$ where $\bbeta_{r0}$ is a $p$-dimensional vector of unknown regression coefficients. Since $p >> n$ it is not possible to estimate the parameters of the above model using standard statistical methods. A principal component study becomes then a suitable first step to reduce the dimension of $\bbeta_{r0}$.

We will first apply a singular value decomposition based logistic regression method to classify the cells. It is much like principal component regression analysis (PCR). In order to setup the model we define first the design matrix for discrimination.

<>= brcam=matrix(brca) brda=as.vector(brca$BC11[[1]]) for (i in 2:22) brda=cbind(brda,brcam[[i]][[1]]) @

We now apply the SVD logistic procedure, made by Ghosh (2001) and available in the \Rpackage{Milan}.

<>= da1 <- svdrfda(factor(brca.class)~t(brda),K=3) plot.fda(da1) confusion(predict(da1),factor(brca.class)) @

SVD produces orthogonal class descriptors that reduce the high dimensional data (supergenes). This is achieved without regards to the response variation and may be inefficient. This way of reducing the regressor dimensionality is totally independent of the output variable. Another method is PLS, where the PLS components are chosen so that the sample covariance between the response and a linear combination of the $p$ predictors (genes) is maximum.

<>= da2 <- svdpls1fda(factor(brca.class)~t(brda),K=3) plot.fda(da2) confusion(predict(da1),factor(brca.class)) @

However, PLS is really designed to handle continuous responses and especially for models that do not really suffer from conditional heteroscedasticity as it is the case for binary or multinomial data, as it is the case here. An extension of the standard PLS algorithm to Generalized linear Modelsis also available from Marx. Here is an example on how to use such a procedure.

<>= y <- c(rep(1,6),rep(0,11),1,rep(0,4)) fit<-gpls(y,t(brda), components = 4,family = "binomial", link = "logit") # plot(fit$glm.details) z=fit$mu>0.5 z=as.integer(z) confusion(z,y) @

The purpose of the following commands is to apply other type of dimension reduction methods to the BRCA data. They are based on sliced inverse regression methods such as those implemented in \Rpackage{dr}. The following commands estimate the central dimension reduction subspace and perform a two-class discrimination. Note that since we are dealing with binary responses the number of slices must be two.

<>= fitdr=dr(y~t(brda),method="Sir",nslices=2) plot(fitdr,mark.by.y=T) @

When X are normally distributed, SIR is equivalent to Linear Discriminant Analysis in the sense that they both estimate the same discriminant linear combinations of the predictors and SAVE is equivalent to Quadratic Discriminant Analysis.

<>= fitdr=dr(y~t(brda),method="Save",nslices=2) dr.permutation.test(fitdr,npermute=100) plot(fitdr,mark.by.y=T) @

We have applied sliced inverse regression methods to binary data but note that SIR and SAVE can also be applied to problems with multinomial or multi-valued responses.

\end{document}

News
2009-10-26

BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.

2009-01-07

R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.