Skip to content.

Bioconductor is an open source and open development software project
for the analysis and comprehension of genomic data.



% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{Lab 9} %\VignetteDepends{wavethresh,ts,Rwave} %\VignetteKeywords{Microarray} \documentclass[12pt]{article}

\usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref}

\textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}}


\title{Lab 11: Penalized Logistic Regression}



In this lab we introduce you to penalized logistic regression algorithms that are available in R. The examples in this section will rely on the data reported in \citet{Golub99}. The basic comparisons here are between ALL to AML.

<>= library(Biobase) library(annotate) library(golubEsets) library(genefilter) library(Design) library(Hmisc) @

We will follow similar steps as in Lab4 to prefilter the data, i.e. we transform the data in much the same way that \citet{Golub99} did. The sequence of commands needed are given below.

<>= data(golubTrain) data(golubTest) LS<-exprs(golubTrain) cl<-golubTrain$ALL.AML TS<-exprs(golubTest) clts<-golubTest$ALL.AML

LS[LS<100]<-100 TS[TS<100]<-100 LS[LS>16000]<-16000 TS[TS>16000]<-16000

mmfilt <- function(r=5, d=500, na.rm=TRUE) { function(x) { minval <- min(x, na.rm=na.rm) maxval <- max(x, na.rm=na.rm) (maxval/minval > r) && (maxval-minval > d) } }

mmfun <- mmfilt()

ffun <- filterfun(mmfun) good <- genefilter(cbind(LS, TS), ffun ) sum(good) ## should get 3571

LSsub <- log10(LS[good,]) # log transform of the data TSsub <- log10(TS[good,]) # log transform of the data LTsub <- cbind(LSsub, TSsub) # concatenating the two sets


Once we have subset the data to the appropriate cases we next perform some specific filtering to remove any genes that show little variation across samples. We explicitly remove the non-informative ones before applying logistic regression, using either $t$-test statistics or using an ANOVA like F-test on the learning sample (beware of bias selection).

<>= gf <- gapFilter(900, 1500, .1) ff <- filterfun(gf) xx <- 10^LSsub ##to get back to original units good <- genefilter(xx, gf) sum(good) # should be 200

# Reducing the number of genes

LS2 <- LSsub[good,] TS2 <- TSsub[good,] @

To illustrate how to fit a simple logistic regression model we select one gene form the list and plot the resulting fit in what follows. Notice the lack of discrimination power.

<>= y=as.integer(cl)-1 x=LS2[99,] aux=sort(x) fit=glm(y~x,family="binomial") auxy=fit$coef[1]+fit$coef[2]*sort(x) auxy=plogis(auxy) plot(x,y) lines(sort(x),auxy) abline(0.5,0) @

We will now perform a penalized logistic regression fit on the training sample usig the appropriate procedures in the \Rpackage{Design}. The penalty parameter is chosen by cross validation and the next commands may take some time for their execution.

<>= X = t(LS2) Y =as.integer(cl)-1 n =nrow(X) p = ncol(X) X.val=t(TS2) Y.val=as.integer(clts)-1 pm <- diag(diag(var(X))) # penalty matrix Penalty <- seq(10,10^5,length=10) reps <- length(Penalty) effective.df <- effective.df2 <- aic <- aic2 <- deviance.val <- Lpenalty <- single(reps) n.t <- round(n^.75) ncv <- 10 # # try 10 reps in cross-val. deviance <- matrix(NA,nrow=reps,ncol=length(ncv)) for(i in 1:reps) { pen <- log10(Penalty[i]) cat(format(pen),"") f.full <-, Y, penalty.matrix=pen*pm) Lpenalty[i] <- pen t(f.full$coef[-1]) %% pm %% f.full$coef[-1] f.full.nopenalty <-, Y, penalty.matrix=0.05pm, maxit=1) info.matrix.unpenalized <- solve(f.full.nopenalty$var) effective.df[i] <- sum(diag(info.matrix.unpenalized %*% f.full$var)) - 1 lrchisq <- f.full.nopenalty$stats["Model L.R."] # lrm does all this penalty adjustment automatically (for var, d.f., # chi-square) aic[i] <- lrchisq - 2*effective.df[i] # pred <- plogis(f.full$linear.predictors) score.matrix <- cbind(1,X) (Y - pred) sum.u.uprime <- t(score.matrix) %% score.matrix effective.df2[i] <- sum(diag(f.full$var %*% sum.u.uprime)) aic2[i] <- lrchisq - 2*effective.df2[i] # #Shao suggested averaging 2*n cross-validations, but let's do only 40 #and stop along the way to see if fewer is OK dev <- 0 for(j in 1:max(ncv)) { s <- sample(1:n, n.t) cof <-[s,],Y[s],penalty.matrix=pen*pm)$coef pred <- cof[1] + (X[-s,] %% cof[-1]) dev <- dev -2sum(Y[-s]*pred + log(1-plogis(pred))) for(k in 1:length(ncv)) if(j==ncv[k]) deviance[i,k] <- dev/j } # pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) prob.val <- plogis(pred.val) deviance.val[i] <- -2sum(Y.valpred.val + log(1-prob.val)) } par(mfrow=c(1,3)) # to be printed as they are finished plot(Penalty, effective.df, type="l") lines(Penalty, effective.df2, lty=2) plot(Penalty, Lpenalty, type="l") title("Penalty on -2 log L") plot(Penalty, aic, type="l") lines(Penalty, aic2, lty=2) @

We can now exploit the previous result and adjust a penalized regression model with the appropriate amount of penalization. The resulting fit is then used to predict the cells in the test sample.

<>= par(mfrow=c(2,1)) pen=200 f.full <-, Y, penalty.matrix=pen*pm) predLS <- plogis(f.full$linear.predictors) yLS=as.integer(cl)-1 yTS=as.integer(clts)-1 val.prob(predLS,yLS,m=1,cex=.5) # plotting the result pred.val <- f.full$coef[1] + (X.val %*% f.full$coef[-1]) predTS <- plogis(pred.val) val.prob(predTS,yTS,m=1,cex=.5) # plotting the result @

We can summarize the results from the following concordance table.

<>= con <- function(x,y) { tab <- table(x,y) print(tab) diag(tab) <- 0 cat("error rate = ", round(100*sum(tab)/length(x),2),"%\n") invisible() }

predProbLS=(predLS>0.5) predProbTS=(predTS>0.5) con(as.integer(predProbLS),yLS) # concordance table in the LS con(as.integer(predProbTS),yTS) # concordance table in the TS @

Finally to get a feeling on how the procedue penalized the coeffcients we display them in several plots.

<>= par(mfrow=c(2,2)) plot.ts(abs(f.full$coef[-1])) plot.ts(sqrt(diag(var(X)))abs(f.full$coef[-1])) hist(f.full$coef[-1],breaks=30) hist(sqrt(diag(var(X)))f.full$coef[-1],breaks=30) @



BioC 2.5, consisting of 352 packages and designed to work with R 2.10.z, was released today.


R, the open source platform used by Bioconductor, featured in a series of articles in the New York Times.