Contents

1 Purpose

Coordinate Covariation Analysis (COCOA) is a method to understand variation among samples and can be used with data that includes genomic coordinates such as DNA methylation. To describe the method on a high level, COCOA uses a database of “region sets” and principal component analysis (PCA) of the data to identify sources of variation among samples. A region set is a set of genomic regions that share a biological annotation, for instance transcription factor (TF) binding regions, histone modification regions, or open chromatin regions. COCOA can identify region sets where there is intersample variation, giving you biologically meaningful insight into variation in your data. For a more in-depth description of the method, see the “Method details” section of this vignette. In contrast to some other common techniques, COCOA is unsupervised, meaning that samples do not have to be divided into groups such as case/control or healthy/disease, although COCOA works in those situations as well. Also, COCOA focuses on continuous variation between samples instead of having cutoffs. Because of this, COCOA can be used as a complementary method alongside “differential” methods that find discrete differences between groups of samples and it can also be used in situations where there are no groups.

2 Our test case

In this vignette, we will see how COCOA can find meaningful sources of intersample variation in breast cancer patients. We will use data from The Cancer Genome Atlas: 450k DNA methylation microarrays for 657 breast cancer patients (TCGA-BRCA, https://portal.gdc.cancer.gov/). This vignette begins after PCA has been performed on the DNA methylation data. As you can see below, PC1 and PC4 cluster patients based on estrogen receptor status which is an important prognostic marker for breast cancer:

Based on this, we might expect that the variation captured by these PCs would be associated with estrogen receptor (consistent with interpatient variation in estrogen receptor status). We are using an easy example for this vignette but COCOA will still work even if there are not distinct clusters in your PCA plot and if you do not have known groups of samples. Outside of this vignette, COCOA was performed with a region set database of over 2000 region sets. To keep things simple for this vignette, we will only use two of the highest scoring region sets and two of the lowest scoring region sets from that analysis to demonstrate how COCOA works.

3 Inputs

There are three main inputs for COCOA:

  1. PCA loadings (prcomp()$rotation).
  2. Genomic coordinates associated with the raw data used for the PCA and with the PCA loadings.
  3. Region sets (normally a database of region sets).

We did PCA with DNA methylation data for all autosomal chromosomes but only the DNA methylation data and loadings for cytosines in chromosome 1 are included in this vignette to minimize memory requirements. brcaLoadings1 has the chr1 loadings. brcaMCoord1 has the coordinates for the chr1 DNA methylation data, which are also the coordinates for the chr1 loadings. For more on the relationship between the loadings and the original data see the “Method details” section. In this vignette, we will use only a few region sets to demonstrate how to use the COCOA functions. For actual use, COCOA should be done with many region sets (ie hundreds or > 1000). Publicly available collections of region sets can be found online (eg http://databio.org/regiondb).

4 Running COCOA

First, we load the example data and required packages:

library(COCOA)
library(data.table)
library(ggplot2)
data("esr1_chr1")
data("gata3_chr1")
data("nrf1_chr1")
data("atf3_chr1")
data("brcaLoadings1")
data("brcaMCoord1")
data("brcaPCScores")
data("brcaMethylData1")

We have region sets for four transcription factors from ChIP-seq experiments (with the same reference genome as our breast cancer data): Esr1 in MCF-7 cells, Gata3 in MCF-7 cells, Nrf1 in HepG2 cells, and Atf1 in K562 cells.

# prepare data
GRList <- GRangesList(esr1_chr1, gata3_chr1, nrf1_chr1, atf3_chr1) 
regionSetName <- c("esr1_chr1", "gata3_chr1", "nrf1_chr1", "atf3_chr1")

Now let’s give each region set a score with runCOCOA() to quantify how much it is associated with each principal component:

PCsToAnnotate <- paste0("PC", 1:4)
regionSetScores <- runCOCOA(loadingMat=brcaLoadings1, 
                            signalCoord=brcaMCoord1, 
                            GRList=GRList, 
                            PCsToAnnotate=PCsToAnnotate, 
                            scoringMetric="regionMean")
regionSetScores$regionSetName <- regionSetName

As an easy way to visualize the results, we can see how the region sets are ranked for each PC:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4), 
               rsNameCol = "regionSetName",
               orderByPC = "PC1", 
               column_title = "Region sets ordered by score for PC1")

We can see that Gata3 had the highest score for PCs 1, 3, and 4 but that Esr1 had a higher score for PC2. While you might expect that Esr1 would be ranked first for all PCs, Gata3 is known to be associated with breast cancer and Esr1 activity so these results still make sense. As mentioned before, this vignette is showing two of the top scoring and bottom scoring region sets from a previous COCOA analysis with over 2200 region sets. So despite Esr1 being ranked second out of four for some PCs in this vignette which may not seem very impressive, it had one of the best scores in the previous larger analysis. On a practical note, if you want to arrange the heatmap by region set scores for another PC, you can just change the orderByPC parameter like so:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4),
               rsNameCol = "regionSetName",
               orderByPC = "PC2", 
               column_title = "Region sets ordered by score for PC2")

5 Understanding the results

We can further understand the variability in these region sets in several ways:

  1. Look at whether variability is specific to the regions of interest compared to the genome around these regions.
  2. Look at the genomic signal in these regions, in our case DNA methylation, and whether it follows the same trends as the PC scores.
  3. Look at the loadings in each region of a given region set to see whether all regions have high loadings or just a subset the regions. Also we can see if the same regions have high loadings for multiple PCs.

5.1 Specificity of variation to the regions of interest

We can see whether variability along the PC is specific to the region of interest by comparing the region of interest to the surrounding genome. To do this, we will calculate the average loadings of a wide area surrounding the regions of interest.

wideGRList <- lapply(GRList, resize, width=14000, fix="center")
loadProfile <- lapply(wideGRList, function(x) getLoadingProfile(loadingMat=brcaLoadings1,
                                                                signalCoord=brcaMCoord1,
                                                                regionSet=x, 
                                                                PCsToAnnotate=PCsToAnnotate,
                                                                binNum=21))

We will normalize the result for each PC so we can better compare them. Here we normalize by subtracting the mean absolute loading of each PC from the region set profiles for the corresponding PC. Then we get the plot scale so we can easily compare the different profiles. These normalization steps are helpful for comparing the loading profiles but not necessarily required so it’s not essential that you understand the below code.

# average loading value from each PC to normalize so PCs can be compared with each other
avLoad <- apply(X=brcaLoadings1[, PCsToAnnotate], 
                MARGIN=2, 
                FUN=function(x) mean(abs(x)))

# normalize
loadProfile <- lapply(loadProfile, 
                      FUN=function(x) as.data.frame(mapply(FUN = function(y, z) x[, y] - z, 
                                                           y=PCsToAnnotate, z=avLoad)))
binID = 1:nrow(loadProfile[[1]])
loadProfile <- lapply(loadProfile, FUN=function(x) cbind(binID, x))

# for the plot scale
maxVal <- max(sapply(loadProfile, FUN=function(x) max(x[, PCsToAnnotate])))
minVal <- min(sapply(loadProfile, FUN=function(x) min(x[, PCsToAnnotate])))

# convert to long format for plots
loadProfile <- lapply(X=loadProfile, FUN=function(x) tidyr::gather(data=x, key="PC", value="loading_value", PCsToAnnotate))
loadProfile <- lapply(loadProfile, 
                      function(x){x$PC <- factor(x$PC, levels=PCsToAnnotate); return(x)})

Let’s look at the plots!

wrapper <- function(x, ...) paste(strwrap(x, ...), collapse="\n") 
profilePList <- list()
for (i in seq_along(loadProfile)) {
    
    thisRS <- loadProfile[[i]]
    
    profilePList[[i]] <- ggplot(data=thisRS, 
                                mapping=aes(x=binID , y=loading_value)) + 
        geom_line() + ylim(c(minVal, maxVal)) + facet_wrap(facets="PC") + 
        ggtitle(label=wrapper(regionSetName[i], width=30)) + 
        xlab("Genome around region set, 14 kb") + 
        ylab("Normalized loading value") + 
        theme(panel.grid.major.x=element_blank(), 
              panel.grid.minor.x=element_blank(), 
              axis.text.x=element_blank(), 
              axis.ticks.x=element_blank())
    profilePList[[i]]

}
profilePList[[1]]

profilePList[[2]]