Contents

1 Introduction

This vignette demonstrates how to use the package sSNAPPY to compute directional single sample pathway perturbation scores by incorporating pathway topologies, utilize sample permutation to test the significance of individual scores and compare average pathway activities across treatments.

The package also provides a function to visualize overlap between pathway genes contained in perturbed biological pathways as network plots.

2 Installation

The package sSNAPPY can be installed using the package BiocManager, along with other packages required for this vignette (tidyverse, magrittr, ggplot2, cowplot, and DT).

if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
pkg <- c("tidyverse", "magrittr", "ggplot2", "cowplot", "DT")
BiocManager::install(pkg)
BiocManager::install("sSNAPPY")

3 Load packages

library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(cowplot)
library(DT)

4 load data

The example dataset used for this tutorial can loaded with data() as shown below. It’s a subset of data retrieved from Singhal H et al. 2016, where ER-positive primary breast cancer tumor tissues collected from 12 patients were split into fragments of equal sizes for different treatments. For the purpose of this tutorial, we only took the RNA-seq data from samples that were treated with vehicle, E2 OR E2 + R5020 for 48 hrs. They were from 5 different patients, giving rise to 15 samples in total. More detailed description of the dataset can be assessed through the help page (?logCPM_example and ?metadata_example).

data(logCPM_example)
data(metadata_example)
# check if samples included in the logCPM matrix and metadata dataframe are identical
setequal(colnames(logCPM_example), metadata_example$sample)
## [1] TRUE
# View sample metadata
datatable(metadata_example,  filter = "top")

5 Compute weighted single-sample logFCs (ssLogFCs)

It is expected that the logCPM matrix will be filtered to remove undetectable genes and normalised to correct for library sizes or other systematic artefacts, such as gene length or GC contents, prior to applying this method. Filtration and normalisation has been performed on the example dataset.

Before single-sample logFCs (ssLogFCs) can be computed, row names of the logCPM matrix need to be converted to entrez ID. This is because all the pathway topology information retrieved will be in entrez ID. The conversion can be achieved through bioconductor packages AnnotationHub and ensembldb.

head(logCPM_example)
##       Vehicle_N2_48 E2+R5020_N2_48 R5020_N2_48 Vehicle_N3_48 E2+R5020_N3_48
## 7105       4.081669       5.109147    4.695391      4.673548       5.217774
## 57147      5.211258       3.844185    4.343678      3.601125       3.427482
## 55732      2.111865       1.366069    2.471200      1.975589       1.713835
## 2268       6.930587       4.341791    4.388130      5.309135       5.066294
## 90529      1.725384       2.997631    3.035200      3.161472       3.320000
## 22875      4.099909       3.635736    3.845961      4.701901       4.886946
##       R5020_N3_48 Vehicle_P4_48 E2+R5020_P4_48 R5020_P4_48 Vehicle_P5_48
## 7105     4.296947      4.667555       4.741936    4.621193      5.484910
## 57147    4.277160      5.128271       4.486209    4.624768      3.549639
## 55732    2.222948      2.437893       3.147235    3.327889      2.368816
## 2268     3.775364      2.912324       3.911222    4.258518      3.029545
## 90529    2.719585      4.578478       3.624418    3.660976      3.477681
## 22875    4.624176      5.453882       4.616053    5.200638      4.080996
##       E2+R5020_P5_48 R5020_P5_48 Vehicle_P6_48 E2+R5020_P6_48 R5020_P6_48
## 7105        5.678579    5.118092      5.636142       5.434636    5.753488
## 57147       4.113434    3.802290      3.815933       3.965867    3.914067
## 55732       2.438389    2.594323      1.028753       1.493969    2.365653
## 2268        1.530079    3.044646      2.904204       3.796048    3.336565
## 90529       3.307986    3.073607      3.577813       3.450174    3.613297
## 22875       3.696430    3.267462      4.436550       4.363616    4.402518

To compute the ssLogFCs, samples must be in matching pairs. In the example, treated samples are matched to the corresponding control sample that were derived from the same patients. So the factor parameter of the weight_ss_fc() functions needs to be set to be “patient”. The function also requires the control treatment level to be specified, which was “Vehicle” in this case.

weight_ss_fc() requires both the logCPM matrix and sample metadata as input. The column names of the logCPM matrix should be sample name, matching to a column called sample in the metadata. The metadata must also contain a treatment column, and a column corresponding to the factor parameter (i.e.. patient in this case).

#compute weighted single sample logFCs
weightedFC <- weight_ss_fc(logCPM_example, metadata = metadata_example,
factor = "patient", control = "Vehicle")

The weight_ss_fc() function firstly computes raw ssLogFCs for each gene by subtracting logCPM values of control sample from the logCPM values of treated samples for each patient.

It has been demonstrated previously that in RNA-seq data, lowly expressed genes turn to have larger variance, which is also demonstrated by the plots below. To reduce the impact of this artefact, weight_ss_fc also weight each ssLogFCs by estimating the relationship between the variance in ssLogFCs and mean logCPM, and defining the gene-wise weight to be the inverse of the predicted variance of that gene’s mean logCPM value.

perSample_FC <- lapply(levels(metadata_example$patient), function(x){
    temp <- logCPM_example[seq_len(1000),str_detect(colnames(logCPM_example), x)] 
    ratio <- temp[, str_detect(colnames(temp), "Vehicle", negate = TRUE)] - temp[, str_detect(colnames(temp), "Vehicle")] 
}) %>%
    do.call(cbind,.)
aveCPM <- logCPM_example[seq_len(1000),] %>%
    rowMeans() %>%
    enframe(name = "gene_id", 
            value = "aveCPM")
p1 <- perSample_FC %>%
    as.data.frame() %>%
    rownames_to_column("gene_id") %>%
    pivot_longer(cols = -"gene_id",
                 names_to = "name",
                 values_to = "logFC") %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, logFC)) +
    geom_point() +
    labs(y = "sslogFC", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
p2 <- data.frame(
    gene_id = rownames(perSample_FC),
    variance = perSample_FC %>%
        apply(1,var)) %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, variance)) +
    geom_point() +
    geom_smooth(method = "loess") +
    labs(y = "Variance in ssLogFCs", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
plot_grid(p1, p2)