# Contents

Authors: Diego Morais [aut, cre], Rodrigo Dalmolin [aut]
Version: 1.0.0

# 1 Overview

The transcriptogramer package is designed for transcriptional analysis based on transcriptograms, a method to analyze transcriptomes that projects expression values on a set of ordered proteins, arranged such that the probability that gene products participate in the same metabolic pathway exponentially decreases with the increase of the distance between two proteins of the ordering. Transcriptograms are, hence, genome wide gene expression profiles that provide a global view for the cellular metabolism, while indicating gene sets whose expression are altered (Silva et al. 2014; Rybarczyk-Filho et al. 2011).

Methods are provided to analyze topological properties of an interactome, to generate transcriptograms, to detect and to display differentially expressed gene clusters, and to perform a functional enrichment analysis on these clusters.

As a set of ordered proteins is required in order to run the methods, datasets are available for four species (Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Rattus norvegicus). Each species has three datasets, originated from STRINGdb release 10.5 protein network data, with combined scores greater than or equal to 700, 800 and 900 (see Hs900, Hs800, Hs700, Mm900, Mm800, Mm700, Sc900, Sc800, Sc700, Rn900, Rn800 and Rn700 datasets). Custom sets of ordered proteins can be generated from protein network data using The transcriptogramer on Windows, or the Seriation package on Mac and Linux.

# 2 Quick start

The first step is to create a Transcriptogram object by running the transcriptogramPreprocess() function. This example uses a subset of the Homo sapiens protein network data, from STRINGdb release 10.5, containing only associations of proteins of combined score greater than or equal to 900 (see Hs900 and association datasets).

library(transcriptogramer)
t <- transcriptogramPreprocess(association = association, ordering = Hs900)

## 2.1 Topological analysis

There are two methods to perform topological analysis, connectivityProperties() calculates average graph properties as a function of the node connectivity, and orderingProperties() calculates graph properties projected on the ordered proteins. Some methods, such as orderingProperties(), uses a window, region of n (radius * 2 + 1) proteins centered at a protein, whose radius changes the output. The Transcriptogram object has a radius slot that can be setted during, or after, its preprocessing (see Transcriptogram-class documentation).

## during the preprocessing

## creating the object and setting the radius as 0
t <- transcriptogramPreprocess(association = association, ordering = Hs900)

## creating the object and setting the radius as 50
t <- transcriptogramPreprocess(association = association, ordering = Hs900,
radius = 50)
## after the preprocessing

## modifying the radius of an existing Transcriptogram object

## getting the radius of an existing Transcriptogram object
r <- radius(object = t)

The output of the orderingProperties() method is partially affected by the radius slot.

oPropertiesR25 <- orderingProperties(object = t, nCores = 1)

## this output is partially different comparing to oPropertiesR25
oPropertiesR30 <- orderingProperties(object = t, nCores = 1)

As the connectivityProperties() method does not uses a window, its output is not affected by the radius slot.

cProperties <- connectivityProperties(object = t)

## 2.2 Transcriptogram

A transcriptogram is generated in two steps and requires expression values, from microarray or RNA-Seq assays, and a dictionary. This example uses the datasets GSE9988, which contains normalized expression values of 3 cases and 3 controls referring to the innate immune responses to TREM-1 activation, and GPL570, a mapping between ENSEMBL Peptide ID and Affymetrix Human Genome U133 Plus 2.0 Array probe identifier.

The methods to generate a transcriptogram are transcriptogramStep1() and transcriptogramStep2(). The transcriptogramStep1() assigns to each protein, of each transcriptome sample, the average of the expression values of all the identifiers related to it.

t <- transcriptogramStep1(object = t, expression = GSE9988,
dictionary = GPL570, nCores = 1)

To each position of the ordering, the transcriptogramStep2() method assigns a value equal to the average of the expression values inside a window, which considers periodic boundary conditions to deal with proteins near the ends of the ordering, in order to reduce random noise.

t <- transcriptogramStep2(object = t, nCores = 1)

The Transcriptogram object has slots to store the outputs of the transcriptogramStep1() and transcriptogramStep2() methods, called transcriptogramS1 and transcriptogramS2 respectively. As the output of some methods are affected by the content of the transcriptogramS2 slot, it can be recalculated using the content of the transcriptogramS1 slot.

radius(object = t) <- 50
t <- transcriptogramStep2(object = t, nCores = 1)

## 2.3 Functional enrichment analysis

As nearby genes of a transcriptogram have a high probability to interact with each other, gene sets whose expression are altered can be identified using the limma package. The differentiallyExpressed() method uses the limma package to identify differentially expressed genes, for the contrast “case-control”, grouping as a cluster a set of genes which positions are within a radius range specified by the content of the radius slot.

For this example, the p-value threshold for false discovery rate will be setted as 0.005. If the name of a species is provided on the input, the biomaRt package will be used to translate the ENSEMBL Peptide ID to Symbol (Gene Name), alternatively, a data frame can be provided and used instead. The levels argument classify the columns of the transcriptogramS2 slot referring to samples, as there are 6 columns (see dataset GSE9988), is created a logical vector that uses TRUE to label the columns referring to controls samples, and FALSE to label the columns referring to case samples.

levels <- c(rep(FALSE, 3), rep(TRUE, 3))
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005)
## translating ENSEMBL Peptide IDs to Symbols using the biomaRt package
## Internet connection is required for this command
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005,
species = "Homo sapiens")

## translating ENSEMBL Peptide IDs to Symbols using the DEsymbols dataset
t <- differentiallyExpressed(object = t, levels = levels, pValue = 0.005,
species = DEsymbols)

This method also produces a plot referring to its output. In this case, four clusters were detected, and each one is represented by a color. It is important to mention that not all the colored genes were detected as differentially expressed, but, as they were within the radius specified by the content of the radius slot, they were included in a cluster. The genes that are above the horizontal blue line are upregulated, and the genes that are below are downregulated.

The differentially expressed genes identified by this method are stored in the DE slot of the Transcriptogram object, its content can be obtained using the DE method. By default, the p-values are adjusted by the Benjamini-Hochberg procedure.

DE <- DE(object = t)

The clusterVisualization() method uses the RedeR package to display graphs of the differentially expressed clusters and returns an object of the RedPort Class, allowing interactions through functions of the RedeR package. This method may take some time depending on the number of clusters, and nodes per cluster, and requires the Java Runtime Environment (>= 6). If the DE slot of the Transcriptogram object has a column named Symbol, its contents will be used as node alias.

rdp <- clusterVisualization(object = t)

The clusterEnrichment() method perform a functional enrichment analysis using the topGO package. By default, the universe is composed by all the proteins present in the transcriptogramS2 slot, the ontology is setted to biological process, the algorithm is setted to classic, the statistic is setted to fisher, and the p-values are adjusted by the Benjamini-Hochberg procedure. For this example, the p-value threshold for false discovery rate will be setted as 0.005. This method uses the biomaRt package to build a gene2GO list if the name of a species is provided on the input, alternatively, a data frame can be provided and used instead.

## using the HsBPTerms dataset to create the gene2GO list
terms <- clusterEnrichment(object = t, species = HsBPTerms,
pValue = 0.005, nCores = 1)
## using the biomaRt package to create the gene2GO list
## Internet connection is required for this command
terms <- clusterEnrichment(object = t, species = "Homo sapiens",
pValue = 0.005, nCores = 1)
head(terms)
##        GO.ID                                        Term Annotated
## 1 GO:0006355 regulation of transcription, DNA-templat...      2202
## 2 GO:1903506 regulation of nucleic acid-templated tra...      2219
## 3 GO:2001141      regulation of RNA biosynthetic process      2231
## 4 GO:0051252         regulation of RNA metabolic process      2308
## 5 GO:0006351                transcription, DNA-templated      2336
## 6 GO:0097659        nucleic acid-templated transcription      2348
##   Significant Expected  pValue ClusterNumber
## 1          30    10.48 8.9e-10             3
## 2          30    10.56 1.1e-09             3
## 3          30    10.62 1.2e-09             3
## 4          30    10.98 2.9e-09             3
## 5          30    11.11 4.0e-09             3
## 6          30    11.17 4.6e-09             3

# 3 Session info

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] transcriptogramer_1.0.0 BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13         compiler_3.4.2       iterators_1.0.8
##  [4] prettyunits_1.0.2    bitops_1.0-6         tools_3.4.2
##  [7] progress_1.1.2       biomaRt_2.34.0       digest_0.6.12
## [10] bit_1.1-12           lattice_0.20-35      RSQLite_2.0
## [13] evaluate_0.10.1      memoise_1.1.0        tibble_1.3.4
## [16] doSNOW_1.0.15        pkgconfig_2.0.1      rlang_0.1.2
## [19] graph_1.56.0         foreach_1.4.3        igraph_1.1.2
## [22] DBI_0.7              yaml_2.1.14          parallel_3.4.2
## [25] SparseM_1.77         topGO_2.30.0         stringr_1.2.0
## [28] knitr_1.17           S4Vectors_0.16.0     IRanges_2.12.0
## [31] grid_3.4.2           stats4_3.4.2         rprojroot_1.2
## [34] bit64_0.9-7          data.table_1.10.4-3  Biobase_2.38.0
## [37] R6_2.2.2             snow_0.4-2           AnnotationDbi_1.40.0
## [40] XML_3.98-1.9         rmarkdown_1.6        bookdown_0.5
## [43] limma_3.34.0         GO.db_3.4.2          RedeR_1.26.0
## [46] blob_1.1.0           magrittr_1.5         matrixStats_0.52.2
## [49] codetools_0.2-15     backports_1.1.1      htmltools_0.3.6
## [52] BiocGenerics_0.24.0  assertthat_0.2.0     stringi_1.1.5
## [55] RCurl_1.95-4.8
warnings()
## NULL

# References

Rybarczyk-Filho, José Luiz, Mauro A A Castro, Rodrigo J S Dalmolin, José C F Moreira, Leonardo G. Brunnet, and Rita M C De Almeida. 2011. “Towards a genome-wide transcriptogram: The Saccharomyces cerevisiae case.” Nucleic Acids Research 39 (8): 3005–16. doi:10.1093/nar/gkq1269.

Silva, Samoel RM da, Gabriel C Perrone, João M Dinis, and Rita MC de Almeida. 2014. “Reproducibility enhancement and differential expression of non predefined functional gene sets in human genome.” BMC Genomics 15 (1): 1181. doi:10.1186/1471-2164-15-1181.