--- title: "Integrated pathway analysis of multiple 'omics datasets" author: "Aedin Culhane" email: "aedin@jimmy.harvard.edu" affiliation: "Dana-Farber Cancer Institute / Harvard School of Public Health" date: "Friday August 1, 2014" output: html_document --- ```{r setup, echo=FALSE,error=FALSE,message=FALSE} datadir= "/data/IntPathAnalysis/BRCA/" library(knitcitations) library(bibtex) library(knitr) library(AnnotationDbi) library(org.Hs.eg.db) allbib = read.bibtex(file.path(".", "bioc.bib")) opts_chunk$set(cache=TRUE, fig.path='figure/', cache.path = 'cache-local/') ``` Introduction -------------- Rapid advances and reducing costs have enabled laboratories to apply a multi-omic approach to biological systems. Studies frequently measure several biological molecules, including mRNA, proteins, metabolites, lipids and phosphoproteins using different technological platforms generating multiple datasets on the same set of biological samples. We have described the following different exploratory data analysis methods that enable one to identify co-relationships between high dimensional datasets. - Multiple Co-Intertia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale and allows one identify the co-structure between datasets (`r citet(allbib[["meng_multivariate_2014"]])`). - iBBiG is a biclustering algorithm which we wrote for analysis of noisy binary data (`r citet(allbib[["gusenleitner_ibbig:_2012"]])`). We developed it to identify biclusters in results from GSEA/pathway analysis from >1,000 different studies. Taking several thousand vectors which each contain p-values enrichment scores for many thousands of pathways, we discretize these to create a binary matrix and apply iBBiG to find pathways associated with covariates across many studies. Today, I will provide an overview of the first MCIA. TCGA Breast Cancer Dataset ---------------- In this workshop we will examine data from the the Cancer Genome Atlas (TCGA) Breast Cancer project (`r citet(allbib[["tcga_comprehensive_2012"]])`). We prepared a subset of the TCGA breast cancer study from the Nature 2012 publication. These are contained in a list and are multiple molecular data of 79 primary breast tumors. These data were downloaded from https://tcga-data.nci.nih.gov/docs/publications/brca_2012/. The data are miRNA, miRNAprecursor, RNAseq, Methylation, proteins measures on an RPPA array, and GISTIC SNP calls. I split the latter GISTIC data into 2 datasets; CNA and LOH (loss of heterzygousity), as these may be dervived by different biological mechanism and represent distinct molecular events in the cell (`r citet(allbib[["wang2012profiles"]])`). The eighth data.frame is the clinical data on these 79 cases. These data are in matrix or data.frame format. (Note: more recently I have been using TCGA Assembler (`r citet(allbib[["Tcga-Assembler2014"]])`)) to download and process TCGA data. I am happy to describe this if there is interest. I am also happy to review/demo how we reproduced the figures from our recent paper (`r citet(allbib[["meng_multivariate_2014"]])`) and have uploaded code to regenerate the figures on the Bioconductor website. ```{r loadBRCA} load(file.path(datadir, "BRCA_TCGA_79_filtered_matchedNames_Basal+LuminalA.RData")) summary(BRCA_TCGA_79) ``` ```{r codeToCreateDataset, echo=FALSE, eval=FALSE, message=FALSE} # Files downloaded using wget from https://tcga-data.nci.nih.gov/docs/publications/brca_2012/ # Read Each file miRNA<-read.table("BRCA.348.mimat.txt", header=TRUE, sep=",", row.names=1) miRNAprecursor<-read.table("BRCA.348.precursor.txt", header=TRUE, sep=",", row.n ames=1) RNAseq<-read.table("BRCA.exp.348.med.txt", sep="\t", header=TRUE, row.names=1, q uote="\"") Methyl<-read.table("BRCA.Methylation.574probes.802.txt", sep="\t", header=TRUE) GISTIC<-read.table("brca_scna_all_thresholded.by_genes.txt", sep="\t", header=TR UE) RPPA<-read.table("rppaData-403Samp-171Ab-Trimmed.txt", sep="\t", header=TRUE, ro w.names=1) # Create a vector of the names of these data matrices rawData<-ls() # In GISTIC data, the first 3 columns are gene annotation. Omit these. GISTIC_fData= GISTIC[,1:3] GISTIC= GISTIC[,-c(1:3)] # Create a list of sampleNames for each Dataset colNames<-lapply(rawData, function(x) cbind(colnames(get(x)), substr(colnames(ge t(x)), 1, 12))) names(colNames) = rawData # Get a list of intersecting sampleNames BRCA_348<-list(intersect(colNames[["RPPA"]][,2], colNames[["miRNAprecursor"]][,2])) # Filter each dataset to 348 intersecting samples for (i in rawData) { colNames[[i]]<-colNames[[i]][colNames[[i]][,2]%in%BRCA_348[[1]],] rownames(colNames[[i]]) <-colNames[[i]][,2] colNames[[i]]<-colNames[[i]][BRCA_348[[1]],] } rm(i) # Filter Datasets BRCA_TCGA<- list(GISTIC=GISTIC[, colNames[["GISTIC"]][,1]], miRNA=miRNA[, colNames[["miRNA"]][,1]], miRNAprecursor=miRNAprecursor[, colNames[["miRNAprecursor"]][,1]], RNAseq=RNAseq[, colNames[["RNAseq"]][,1]], Methyl=Methyl[, colNames[["Methyl"]][,1]], RPPA=RPPA[, colNames[["RPPA"]][,1]]) BRCA_TCGA$LOH= BRCA_TCGA$GISTIC BRCA_TCGA$LOH= apply(BRCA_TCGA$LOH, 2, function(x) ifelse(x== -1, 1,0)) BRCA_TCGA$CNA= BRCA_TCGA$GISTIC BRCA_TCGA$CNA = apply(BRCA_TCGA$CNA,2, function(x) ifelse(x==-1, 0,x)) lapply(BRCA_TCGA, dim) lapply(BRCA_TCGA, function(x) sum(is.na(x))) Thres=0.9 xx<-apply(BRCA_TCGA$LOH, 1, function(x) sum(x==0)quantile(xx)[2] BRCA_TCGA$RNAseq<-BRCA_TCGA$RNAseq[xx,] # Get rid of NA and 0 tt<-apply(BRCA_TCGA$RNAseq, 2, function(x)ifelse(is.na(x), 10^-6,x)) tt<-apply(tt, 2, function(x)ifelse(x==0, 10^-6,x)) BRCA_TCGA$RNAseq<-tt lapply(BRCA_TCGA, function(x) sum(is.na(x))) lapply(BRCA_TCGA, dim) save(BRCA_TCGA, file="BRCA_TCGA_348_filtered.RData") system("wget http://www.nature.com/nature/journal/v490/n7418/extref/nature11412-s2.zip") system("unzip nature11412-s2.zip") system("xls2csv nature11412-s2/Supplementary Tables 1-4.xls") clin= read.csv("Supplementary Tables 1.csv", header=TRUE, row.names=1, skip=1) gsub("\\.", "-", BRCA_348[[1]]) %in% rownames(clin) BRCA_TCGA$clin<-clin[gsub("\\.", "-", BRCA_348[[1]]),] save(BRCA_TCGA, file="BRCA_TCGA_348_filtered.RData") ## Make all of the sample names the same (for Omicade4) and save the sampleNameData lapply(BRCA_TCGA[1:7], function(x) all(substr(colnames(x),1,12) == BRCA_348[[1]] )) for (i in 1:7) colnames(BRCA_TCGA[[i]])<- gsub("\\.", "-", BRCA_348[[1]]) colNames<-lapply(colNames, function(x) cbind(x, gsub("\\.", "-", BRCA_348[[1]])) ) for (i in colNames) rownames(i) <- rownames(BRCA_TCGA$clin) BRCA_TCGA$sampleNameInfo<-colNames save(BRCA_TCGA, file="BRCA_TCGA_348_filtered_matchedNames.RData") # Small subset for faster computation lum<-BRCA_TCGA$clin$RPPA.Clusters=="LumA" & BRCA_TCGA$clin$PAM50.mRNA=="Luminal A" & BRCA_TCGA$clin$ER.Status=="Positive" basal<- BRCA_TCGA$clin$RPPA.Clusters=="Basal" & BRCA_TCGA$clin$PAM50.mRNA=="Basal-like" & BRCA_TCGA$clin$ER.Status=="Negative" table(basal) table(lum) table(basal | lum) BRCA_TCGA_79<-lapply(BRCA_TCGA[1:7], function(x) return(x[,rownames(BRCA_TCGA$cl in)[basal|lum]])) BRCA_TCGA_79$clin<- BRCA_TCGA$clin[colnames(BRCA_TCGA_79[[1]]),] save(BRCA_TCGA_79, file="BRCA_TCGA_79_filtered_matchedNames_Basal+LuminalA.RData") ``` Brief Background to ordination (PCA/COA) --------------------------------------- MCIA can be computed using principcal component analysis (pca), correspondence analysis (coa) or one of several other ordination approaches. There are many functions in Bioconductor and R for dimension reduction analysis. Within ade4 these are computed using the duality diagram (dudi) framework (`r citet(allbib[c("Dray:Dufour:2007:JSSOBK:v22i04","delacruz2011")])`) and hence are called dudi.pca(), dudi.coa() etc. ```{r pca} library(ade4) BRCApca<-dudi.pca(BRCA_TCGA_79$RNAseq, scannf=FALSE, nf=5) print(BRCApca) ``` (In ade4, \$co are columns and \$li are lines or rows) ```{r pcasumm} summary(BRCApca) ``` We have a BioC packages called made4 which is a wrapper around the ade4 that utilizes Bioconductor data classes such as Expression Set. In made4 the simplest way to view results from the function ord is to use the function plot. This will draw a plot of the eigenvalues, along with plots of the variables (genes) and a plot of the cases (microarray samples). ```{r plotcoa,message=FALSE, warning=FALSE} library(made4) cl<-as.character(BRCA_TCGA_79$clin$PAM50.mRNA) BRCAord= ord(BRCA_TCGA_79$RNAseq, classvec=factor(cl)) summary(BRCAord$ord) plot(BRCAord, nlab=3, arraylabels=rep("T", 79)) ``` figure 1: Correspondence analysis plots. A. plot of the eigenvalues, B. projection of microarray samples C. projection of genes (gray filled circles) and D. biplot showing both genes and samples. Samples and genes with a strong association are projected in the same direction from the origin. The greater the distance from the origin the stronger the association . The distinction between molecular subtypes is captured on the first eigenvector. The case and gene projections can be visualised with functions plotarrays and plotgenes (sorry the names of these functions steam from the olden days of microarrays). These are wrappers around ade4 plotting functions. The number of genes that are labelled at the end of the axis can be defined. The default is 10. These plots are simplistic and could be rendered more useful with shiny/ggplots/etc ```{r plotarrays2, message=FALSE} par(mfrow=c(2,1)) plotarrays(BRCAord$ord$co, classvec=BRCA_TCGA_79$clin$PAM50.mRNA) plotgenes(BRCAord, n=5, col="red") ``` Figure 2. Plot of A) sample coordinates b) feature coordinates To extract a list of variables with greatest loadings or weights on an axes, (ie those at the end of an axes), use function topgenes. For example, to get a list of the 5 genes at the negative and postive ends of axes 1. ```{r topgenes} ax1<- topgenes(BRCAord, axis=1, n=5) ``` CIA: Two datasets ---------------------- Coinertia analysis (CIA) identifies trends or co-relationships in two datasets which contain the same samples. Either the rows or the columns of a matrix must be "matchable". CIA can be applied to datasets where the number of variables exceed the number of samples. CIA is related to canonical correlation analysis, but CIA optimizes the squared covariance between the eigenvectors whereas CCA optimizes the correlation. There was a nice comparison of sparse CCA and CIA by LeCao et al., LeCao2009 (`r citet(allbib[["LeCao2009"]])`) The function cia calls the function coinertia in the R package ade4 . ```{r BRCAciaplot, echo=FALSE, warning=FALSE} BRCAcia<-cia(BRCA_TCGA_79$RNAseq, BRCA_TCGA_79$RPPA) BRCAcia$coinertia plot(BRCAcia, classvec=BRCA_TCGA_79$clin$PAM50.mRNA, nlab=3, clab=0, cpoint=3 ) ``` figure 4. Plot of coinertia analysis of gene and protein expression datasets. The correlation between the coa axes and the coinertia axes are in $aX and $aY: ```{r ciaMoreCocircle} par(mfrow=c(1,2)) s.corcircle(BRCAcia$coinertia$aX) s.corcircle(BRCAcia$coinertia$aY) ``` fig 5 Scatter diagram of a correlation circle of the coa and coinertia axes. The angle between two vectors gives the degree of correlation (adjacent = highly correlated, orthogonal (90°) = uncorrelated, and opposite (180°) = negatively correlated). The RV coefficient is a measure of global similarity between the datasets. The RV coefficient of this coinertia analysis is: `r BRCAcia$coinertia$RV`. MCIA ------- Multiple Co-Interia Analysis (MCIA) simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, enabling one to extract the most variant from each dataset (`r citet(allbib[["meng_multivariate_2014"]])`). ```{r mcia, cache=TRUE, eval=FALSE} # Check that there are no zero or low variant low count rows. for (i in 1:7) BRCA_TCGA_79[[i]]<-BRCA_TCGA_79[[i]][!apply(BRCA_TCGA_79[[i]],1, sum)==min(BRCA_TCGA_79[[i]]),] library(omicade4) mcia79<-mcia(BRCA_TCGA_79[1:7]) save(mcia79, file="mciaRes.RData") ``` ```{r loadmCia, echo=FALSE} require(omicade4) load(file.path(datadir,"mciaRes.RData")) ``` ```{r plotmcia, warning=FALSE, message=FALSE} plot(mcia79, axes=1:2, sample.lab=FALSE, sample.legend=FALSE, phenovec=as.numeric(BRCA_TCGA_79$clin$PAM50.mRNA), gene.nlab=2, df.color=c("navy", "cyan", "magenta", "red4", "brown","yellow", "orange"),df.pch=2:8) ``` It is useful to examine the RV coefficient, first examine the reference structure, what is the joint co-structure between each dataset and the reference, or each pair of datasets. Not surprisingly the proteomics and gene expression data share high similarity ```{r rv, echo=FALSE} # RV to reference. Computes the RV coefficient between the representations of individual Cases ($Tl1) with the synthetic variables (reference, $SynVar). RV.mcoa <- function(m,...){ # see RV.rtest # Thanks Pierre Bady (Lyon) # require(ade4) if (!inherits(m, "mcoa")) stop("non convenient data") blo <- sort(unique(m$TL[, 1])) nblo <- length(blo) res <- NULL for(i in 1:nblo){ X <- scale(m$SynVar, scale = FALSE) Y <- scale(m$Tl1[m$TL[,1]==i,], scale = FALSE) X <- X/(sum(svd(X)$d^4)^0.25) Y <- Y/(sum(svd(Y)$d^4)^0.25) X <- as.matrix(X) Y <- as.matrix(Y) w <- sum(svd(t(X) %*% Y)$d^2) res <- c(res,w) } names(res)<- row.names(m$cov2) return(res) } ``` ```{r rvm} RV.mcoa(mcia79$mcoa) ``` # Between each pair of datasets ```{r rvm2} mcia79$mcoa$RV ``` PseduoEigen values ```{r assesingMCIAcov} mcia79$mcoa$cov2 plot(mcia79$mcoa$cov2, xlab = "pseudoeig 1", ylab = "pseudoeig 2", pch=19, col="red") text(mcia79$mcoa$cov2, labels=rownames(mcia79$mcoa$cov2), cex=0.8, adj=0) ``` Eigenvalues (computed on the separate analysis) are in $lambda. These will either be weighted by the first eigenvalues (option="lambda1"), or weighted by the total inertia (option="inertia") depending on the parameter "option". See more in the help for the function ade4:::mcoa. ```{r assessingMCIA, echo=FALSE, eval=FALSE} mcia79$mcoa$lambda plot(mcia79$mcoa$lambda) ``` Examine the projects of the axes using the correlation circle ```{r mciaAxes, message=FALSE, warning=FALSE,fig.keep='all'} mcia79$mcoa$Tax dev.off() par(mfrow=c(4,2)) xx<-by(mcia79$mcoa$Tax, substr(rownames(mcia79$mcoa$Tax),1,3), s.corcircle) ``` Coordinates of reference samples (scores against which the sq covariance is maximized) are in mcia79$mcoa$SynVar. The reference and each data projection can be visualized using kplot. Again the graphics could be so much better. ```{r samplescores} #plotarrays(mcia79$mcoa$SynVar, classvec=BRCA_TCGA_79$clin$PAM50.mRNA) kplot(mcia79$mcoa, mfrow = c(3,4), clab = .8, csub = 3, cpoi = 3) ``` Extracting tranformed features from MCIA results ----------------------------- All of the data have been transformed onto the same scale and the coordinates of genes tranformed in this space are avaialble in mcia79$mcoa$axis ```{r genescores} summary(mcia79$mcoa$axis) par(mfrow=c(1,2)) plot(mcia79$mcoa$axis[,1]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC1", xlab="", las=2) plot(mcia79$mcoa$axis[,2]~factor(mcia79$mcoa$TC[,1]), col=1:7, names=names(mcia79$coa), ylab="Gene Scores PC2", xlab="", las=2) ``` For example if we look at the top 10 scores of features on the negative end of PC1, it contains features from the miRNA (#1), miRNAprecursor (#2) and RNAseq (#3) datasets. In mcia79$mcoa$axis, each feature ID has a suffix number which refers to the dataset from which it originated. ```{r features} mcia79$mcoa$axis[order(mcia79$mcoa$axis[,1]),][1:10,1, drop=FALSE] ## Dataset suffix cbind(1:7,rownames(mcia79$mcoa$cov2)) ``` Project annotation on plot --------------------------------------------------------------------------------------------------------- ```{r catData} ## To "concatentate"" data, mm<-function(x) substr(x, 1, nchar(x)-2) ids<-mm(rownames(mcia79$mcoa$axis)) # Whilst it would be great, to have alll of our data mapped to genome co-ordinates and really take the union of everything, to keep it simple, I will only look at the Gene Symbols (RNAseq, RPPA) library(HGNChelper) library(Biobase) idsFix<- checkGeneSymbols(ids) ## Get PC1 idsPC<-cbind(idsFix, PC=mcia79$mcoa$axis) GOs<-select(org.Hs.eg.db, columns="GO",keytype="SYMBOL",keys=idsPC$Suggested.Symbol) #Get Coordinates for GOs terms res<-tapply(GOs$SYMBOL, GOs$GO, function(Syms) colMeans(idsPC[idsPC$Suggested.Symbol%in%Syms,c("PC.Axis1","PC.Axis2")])) res<-do.call(rbind,res) res[1:2,] plot(mcia79$mcoa$axis, col =as.numeric(mcia79$mcoa$TC[,1]), pch=as.numeric(mcia79$mcoa$TC[,1])) tt<-topgenes(res, n=5) points(res[tt,], pch=19, col="gray") text(res[tt,], labels=tt, cex=0.8, adj=0) ``` Pathway ---------- Or we can perform pathway analysis on the features. There are many tools within BioC for performing pathway enrichments, for examples we could perform ssGSVA pcs ```{r gsva, eval=FALSE} #Exclude NA idsPC<-idsPC[!is.na(idsPC$Suggested.Symbol),] #Reduce to GeneSymbol and MCIA score, taking max idsPC1<-tapply(idsPC$PC.Axis1, idsPC$Suggested.Symbol,max) idsPC2<-tapply(idsPC$PC.Axis2, idsPC$Suggested.Symbol,max) if( !all(names(idsPC1)==names(idsPC2))) stop() idsPCs<-cbind(idsPC1, idsPC2) # The "built-in" gsva library are mapped to entrezIDs so map symbols entrezPC1<-select(org.Hs.eg.db, columns="ENTREZID",keytype="SYMBOL",keys=rownames(idsPCs)) entrezPC1<-entrezPC1[!duplicated(entrezPC1[,1]),] table(rownames(idsPCs)==entrezPC1[,1]) rownames(idsPCs)= entrezPC1[,2] # Run any enrichment test with gsva require(GSVA) require(GSVAdata) gsva(idsPCs, c2BroadSets) ``` Any Questions? ---------------------------------------------------------------------- Aedin@jimmy.harvard.edu