Contents

1 Read

1.1 Platforms

Autonomics offers platform-specific readers to import omics data into an analysis-ready SummarizedExperiment:

require(autonomics)
  rnafile <- system.file( 'extdata/billing19.rnacounts.txt',     package = 'autonomics' )
  profile <- system.file( 'extdata/billing19.proteingroups.txt', package = 'autonomics' )
  fosfile <- system.file( 'extdata/billing19.phosphosites.txt' , package = 'autonomics' )
diannfile <- download_data( 'dilution.report.tsv' )
 somafile <- system.file('extdata/atkin.somascan.adat',          package = 'autonomics')
 metafile <- system.file('extdata/atkin.metabolon.xlsx',         package = 'autonomics')

  rnaobj <- read_rnaseq_counts(         file = rnafile)
  proobj <- read_maxquant_proteingroups(file = profile)
  fosobj <- read_maxquant_phosphosites( fosfile = fosfile, profile = profile)
diannobj <- read_diann_proteingroups(   file = diannfile)
 somaobj <- read_somascan(              file = somafile)
 metaobj <- read_metabolon(             file = metafile)

1.2 SummarizedExperiment

Accessing SummarizedExperiment content is easy, as illsutrated for the proteomics dataset of Fukuda et al. (2020). This study compared the proteome of Zebrafish Embryos (30 days post-fertilization) with that of Adults. Mass Spectra were processed using MaxQuant (Cox and Mann 2008), and a proteinGroups file with expression values obtained. Import into autonomics and access to the Expression Matrix is easy:

file <- system.file('extdata/fukuda20.proteingroups.txt', package = 'autonomics')
object <- read_maxquant_proteingroups(file)
values(object)[1:3, 1:5]
##                  30dpt.R1 30dpt.R2 30dpt.R3 Adult.R1 Adult.R2
## Q4L216_DANRE     32.55979 31.91666 32.60497 32.22717 31.87129
## A0A0R4IKT8_DANRE 35.81691 35.51418 35.38075 35.55024 35.76000
## A0A0R4IQW7_DANRE 29.56772 28.23717 29.08924 28.58896 29.34068

Access to the sample data.table is also easy:

sdt(object)[1:3, ]
##    sample_id subgroup replicate                  mqcol
##       <char>   <fctr>    <char>                 <char>
## 1:  30dpt.R1   X30dpt        R1 LFQ intensity 30dpt.R1
## 2:  30dpt.R2   X30dpt        R2 LFQ intensity 30dpt.R2
## 3:  30dpt.R3   X30dpt        R3 LFQ intensity 30dpt.R3

as is access to the feature data.table:

fdt(object)[1:3, c(1, 2, 8)]
##     proId       feature_id reverse
##    <char>           <char>  <char>
## 1:     45     Q4L216_DANRE        
## 2:    471 A0A0R4IKT8_DANRE        
## 3:    599 A0A0R4IQW7_DANRE

1.3 Filter Features

Non-informative features are filtered out during reading, with details recorded in the object:

n <- analysis(object)$nfeature

In this case 18 proteingroups (out of 20 identified) were retained for analysis, after applying the following filters:

##               Filter     n
##               <char> <int>
## 1:                      20
## 2:     reverse == ""    19
## 3: contaminant == ""    18

2 Explore

2.1 Features

A feature (e.g. protein) distribution shows how the values of a single feature are distributed across all samples. Feature distributions can be visualized with density, violin, or box plots, as shown below.

require(ggplot2)
d <- plot_feature_densities(object, n = 4) + guides(fill = 'none')
v <- plot_feature_violins(  object, n = 4) + guides(fill = 'none')
b <- plot_feature_boxplots( object, n = 4) + guides(fill = 'none')
gridExtra::grid.arrange(d, v, b, nrow = 1)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

2.2 Samples

A sample distribution shows how the values of a single sample are distributed across all features (proteins). Sample distributions can also be visualized with density, violin, or box plots:

require(ggplot2)
d <- plot_sample_densities(object) + guides(fill = 'none')
v <- plot_sample_violins(object)   + guides(fill = 'none')
b <- plot_sample_boxplots(object)  + guides(fill = 'none')
gridExtra::grid.arrange(d, v, b, nrow = 1)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

2.3 Principle Component Analysis

In an experiment with p quantified features (here proteingroups), each sample can be thought of as a data point in a p-dimensional space. Principal Component Analysis (Pearson 1901) projects these sample points onto that lower (e.g. 2) dimensional space which maximizes the variance between samples. This two-dimensional biplot greatly aids in comprehending the overall sample similarity structure, as shown below for the dataset under consideration, where subgroup is reassuringly observed to be the major source of variation.

biplot(pca(object))

3 Impute

3.1 Systematic versus Random

systematic <- sum(systematic_nas(object))
random     <-     sum(random_nas(object))
no         <-         sum(no_nas(object))

5 proteingroups have systematic NAs: missing completely in some subgroups but detected in others (for at least half of the samples). These represent potential switch-like responses. They require prior imputation for statistical analysis to return p (rather than NA) values. Note that the apparent systematic nature of these NAs can arise due to chance. Increasing sample size gives greater confidence into the systematic nature of these NA values.

6 proteingroups have random NAs. They are missing in some samples, but the missingness is unrelated to subgroup. These samples do not require require imputation for statistical analyis to return pvalues.

7 proteingroups have no NAs. They are present in all samples.

3.2 Plotting NA structure

The NA structure can also be summarized graphically with either of the two functions below.

p1 <- plot_sample_nas(object) + ggplot2::guides(fill = 'none')
p2 <- plot_subgroup_nas(object)
gridExtra::grid.arrange(p1, p2, nrow = 1)

plot_sample_nas shows NA structure at sample resolution, plotting systematic and random NAs (white) as well as full detections (bright color).
plot_subgroup_nas summarizes NA structure at the subgroup level, differentiating systematic NA values (white) from random NA values and full detections (color).

3.3 Imputing systematic NAs

Proteingroups with systematic misses require prior imputation for statistical analysis to return pvalues (rather than missing values).

require(magrittr)
object %<>% impute(plot = TRUE)
Left: sample distributions, Right: detections

Figure 1: Left: sample distributions, Right: detections

The sample distributions (left) show how imputed values are drawn from a normal distribution, 2.5 standard deviations away from the sample mean, 0.3 standard deviations wide.

The detection plot (right) shows imputed values with a lighter color.

4 Analyze

4.1 fit_lm: General Linear Model

\(t\) statistic and \(p\) value

Diffferential Expression Analysis quantifies whether subgroup differences are significant. The current example dataset has two subgroups (X30dpt and Adults), each with three replicates.

table(object$subgroup)
## 
## X30dpt  Adult 
##      3      3

The t-statistic expresses the difference between two subgroups in standard errors (SE) (i.e. standard devation, normalized for sample size):

\[t = \frac{\textrm{difference}}{\frac{\textrm{sd}}{\sqrt n}}\] When samples from two subgroups are many standard errors away from each other, the \(t\) value will be large, and the difference likely arose due to true subgroup differences.
When samples from two different subgroups are close to each other, on the other hand, the \(t\) value will be small, and the probability that the difference arose due to random sampling is high. This probability (that the difference arose due to random sampling) is known as the p value. The p value expresses a signal (difference) to noise (standard error) ratio, and is very useful for feature (protein) prioritization. A general convention is to call \(p\) < 0.05 differences significant.

General Linear Model and lm

The General Linear Model generalizes the (two-subgroup) t-test to multiple subgroups (e.g. \(t_0\), \(t_1\), \(t_2\)), multiple factors (e.g. time and concentration), as well as numerical covariates (e.g. age and bmi) in a unified modeling framework. In R its classical implementation is the lm modeling engine, to which autonomics offers direct access:

  require(magrittr)
  object %<>% fit_lm()
##     LinMod
##               Code subgroup
##                                level
##                     coefficient X30dpt Adult
##                       Intercept  1      .   
##                       Adult     -1      1
##               Filter
##                       Keep 17/18 features with 3+ values per subgroup
##               lm(~subgroup)
##                          coefficient    fit downfdr upfdr downp   upp
##                               <fctr> <fctr>   <int> <int> <int> <int>
##                       1:   Intercept     lm       0    17     0    17
##                       2:       Adult     lm       2     4     2     5

In the example dataset lm found age-associated downregulations and upregulations. Running lm on 18 proteins took no more that 0.443 seconds, a feat achieved through a performant backend that integrates lm into a data.table environment.

Defining the model: from simple to advanced

Autonomics provides three ways to specify the model, aimed at serving the tastes of laymen as well as experts, as well as the level between.

The simplest approach is to rely on the automated defaults, which build a model (with intercept) using the sample variable ‘subgroup’. A more flexible option is to use the formula interface, allowing to drop intercept, include multiple factors, or numeric covariates. These different options are illustrated below.

object %<>% fit_lm()
object %<>% fit_lm(formula = ~ subgroup)

Changing the coefficient meanings

By default R uses treatment coding, which means the Intercept represents the level of the first subgroup, and subsequent coefficients differences to that first subgroup:

plot_design(object)
##     LinMod
##               Code subgroup
##                                level
##                     coefficient X30dpt Adult
##                       Intercept  1      .   
##                       Adult     -1      1

4.2 fit_limma: Generalized Contrasts and Moderated GLM

Alternative coding schemes are a more advanced topic. And though several such alternative coding are available in R (contr.sum compares each subgroup to the global mean, contr.helmert compares each subgroup level to the average of the previous levels, etc.) it is not always straightforward to find the coding scheme that is appropriate for the scientific question under focus. The coding is also a bit verbose. All of that was made much easier with the arrival of limma (Smyth 2004), to which autonomics offers direct access through fit_limma. The development of limma was motivated by the shifting nature of data in Bioinformatics Studies: a sample was no longer associated with a single value but rather with thousands of values for many different features being measured in parallel (genes, transcripts, proteins, …). This brought challenges, such as false discoveries became much more likely. It also brought wonderful opportunities: since most of the features are typically not differential expressed between two samples (and only a minority are), this background can be used to estimate a residual standard deviation, which is then used to ‘moderate’ the t-statistic: adding this residual standard devation sd0 creates a moderated \(t\) statistic less subject to inflation due to small standard deviation (rather than decent effect sizes).

\[t = \frac{\textrm{difference}}{\frac{\textrm{sd + sd0}}{\sqrt n}}\]

This moderated t statistic was then extended into a General Linear Model. Very interestingly this moderated General Linear Model was formulated in terms of generalized contrasts rather than original coefficients, and an interface was offered to express any scientific question as a custom contrasts of model coefficients. Returning to the simpler zebrafish dataset (Fukuda 2020), limma offers an very intuitive way to formulate custom contrasts, in combination with a model with no intercept:

file <- system.file('extdata/fukuda20.proteingroups.txt', package = 'autonomics')
object <- read_maxquant_proteingroups(file)
object %<>% fit_limma(formula   = ~ 0 + subgroup, contrasts = c('Adult - X30dpt'))

References

Cox, Jürgen, and Matthias Mann. 2008. “MaxQuant Enables High Peptide Identification Rates, Individualized Ppb-Range Mass Accuracies and Proteome-Wide Protein Quantification.” Nature Biotechnology 26 (12): 1367–72.

Fukuda, Ryuichi, Rubén Marı́n-Juez, Hadil El-Sammak, Arica Beisaw, Radhan Ramadass, Carsten Kuenne, Stefan Guenther, et al. 2020. “Stimulation of Glycolysis Promotes Cardiomyocyte Proliferation After Injury in Adult Zebrafish.” EMBO Reports 21 (8): e49752.

Pearson, Karl. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11): 559–72.

Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1). https://doi.org/doi:10.2202/1544-6115.1027.