Contents

1 Introduction

This vignette demonstrates how to use the package sSNAPPY to compute directional single sample pathway perturbation scores by incorporating pathway topologies, utilize sample permutation to test the significance of individual scores and compare average pathway activities across treatments.

The package also provides a function to visualise overlap between pathway genes contained in perturbed biological pathways as network plots.

2 To get ready

2.1 Installation

The package sSNAPPY can be installed using the package BiocManager

if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
# Other packages required in this vignette
pkg <- c("tidyverse", "magrittr", "ggplot2", "cowplot", "DT")
BiocManager::install(pkg)
BiocManager::install("sSNAPPY")
install.packages("htmltools")

2.2 Load packages

library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(cowplot)
library(DT)
library(htmltools)

2.3 Load data

The example dataset used for this tutorial can be loaded with data() as shown below. It’s a subset of data retrieved from Singhal H et al. 2016, where ER-positive primary breast cancer tumour tissues collected from 12 patients were split into tissue fragments of equal sizes for different treatments.

For this tutorial, we are only looking at the RNA-seq data from samples that were treated with vehicle, R5020(progesterone) OR E2(estrogen) + R5020 for 48 hrs. They were from 5 different patients, giving rise to 15 samples in total. A more detailed description of the dataset can be assessed through the help page (?logCPM_example and ?metadata_example).

data(logCPM_example)
data(metadata_example)
# check if samples included in the logCPM matrix and metadata dataframe are identical
setequal(colnames(logCPM_example), metadata_example$sample)
## [1] TRUE
# View sample metadata
datatable(metadata_example,  filter = "top")

3 sSNAPPY workflow

3.1 Compute weighted single-sample logFCs (ssLogFCs)

It is expected that the logCPM matrix will be filtered to remove undetectable genes and normalised to correct for library sizes or other systematic artefacts, such as gene length or GC contents, prior to applying the sSNAPPY workflow. Filtration and normalisation have been performed on the example dataset.

Before single-sample logFCs (ssLogFCs) can be computed, row names of the logCPM matrix need to be converted to entrez ID. This is because all the pathway topology information retrieved will be in entrez ID. The conversion can be achieved through bioconductor packages AnnotationHub and ensembldb.

head(logCPM_example)

To compute the ssLogFCs, samples must be in matching pairs. In our example data, treated samples were matched to the corresponding control samples derived from the same patients. Therefore the factor parameter of the weight_ss_fc() functions will be set to be “patient”. The function also requires the control treatment level to be specified, which was “Vehicle” in this case.

weight_ss_fc() requires both the logCPM matrix and sample metadata as input. The column names of the logCPM matrix should be sample names, matching a column called sample in the metadata. The metadata must also contain a treatment column, and a column corresponding to the factor parameter (i.e.patient in this case).

#compute weighted single sample logFCs
weightedFC <- weight_ss_fc(logCPM_example, metadata = metadata_example,
factor = "patient", control = "Vehicle")

The weight_ss_fc() function firstly computes raw ssLogFCs for each gene by subtracting logCPM values of the control sample from the logCPM values of treated samples for each patient.

It has been demonstrated previously that in RNA-seq data, lowly expressed genes turn to have a larger variance, which is also demonstrated by the plots below. To reduce the impact of this artefact, weight_ss_fc also weights each ssLogFCs by estimating the relationship between the variance in ssLogFCs and mean logCPM, and defining the gene-wise weight to be the inverse of the predicted variance of that gene’s mean logCPM value.

perSample_FC <- lapply(levels(metadata_example$patient), function(x){
    temp <- logCPM_example[seq_len(1000),str_detect(colnames(logCPM_example), x)] 
    ratio <- temp[, str_detect(colnames(temp), "Vehicle", negate = TRUE)] - temp[, str_detect(colnames(temp), "Vehicle")] 
}) %>%
    do.call(cbind,.)
aveCPM <- logCPM_example[seq_len(1000),] %>%
    rowMeans() %>%
    enframe(name = "gene_id", 
            value = "aveCPM")
p1 <- perSample_FC %>%
    as.data.frame() %>%
    rownames_to_column("gene_id") %>%
    pivot_longer(cols = -"gene_id",
                 names_to = "name",
                 values_to = "logFC") %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, logFC)) +
    geom_point() +
    labs(y = "sslogFC", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
p2 <- data.frame(
    gene_id = rownames(perSample_FC),
    variance = perSample_FC %>%
        apply(1,var)) %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, variance)) +
    geom_point() +
    geom_smooth(method = "loess") +
    labs(y = "Variance in ssLogFCs", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
plot_grid(p1, p2)