# 1 The scp package

The scp package is used to process and analyse mass spectrometry (MS)-based single cell proteomics (SCP) data. The functions rely on a specific data structure that wraps QFeatures objects (Gatto (2020)) around SingleCellExperiment objects (Amezquita et al. (2019)). This data structure could be seen as Matryoshka dolls were the SingleCellExperiment objects are small dolls contained in the bigger QFeatures doll.

The SingleCellExperiment class provides a dedicated framework for single-cell data. The SingleCellExperiment serves as an interface to many cutting-edge methods for processing, visualizing and analysis single-cell data. More information about the SingleCellExperiment class and associated methods can be found in the OSCA book.

The QFeatures class is a data framework dedicated to manipulate and process MS-based quantitative data. It preserves the relationship between the different levels of information: peptide to spectrum match (PSM) data, peptide data and protein data. The QFeatures package also provides an interface to many utility functions to streamline the processing MS data. More information about MS data analysis tools can be found in the RforMassSpectrometry project.

Before running the vignette we need to load the scp package.

library("scp")

We also load ggplot2, magrittr and dplyr for convenient data manipulation and plotting.

library("ggplot2")
library("magrittr")
library("dplyr")

# 2 Before you start

This vignette will guide you through some common steps of mass spectrometry-based single-cell proteomics (SCP) data analysis. SCP is an emerging field and further research is required to develop a principled analysis workflow. Therefore, we do not guarantee that the steps presented here are the best steps for this type of data analysis. This vignette performs the steps that were described in the SCoPE2 landmark paper (Specht et al. (2021)). We hope to convince the reader that, although the workflow is probably not optimal, scp has the full potential to perform standardized and principled data analysis. All functions presented here are comprehensively documented, highly modular, can easily be extended with new algorithms. Suggestions, feature requests or bug reports are warmly welcome. Feel free to open an issue in the GitHub repository.

# 3 Read in SCP data

The first step is to read in the PSM quantification table generated by, for example, MaxQuant (Tyanova, Temu, and Cox (2016)). We created a small example data by subsetting the MaxQuant evidence.txt table provided in the SCoPE2 landmark paper (Specht et al. (2021)). The mqScpData table is a typical example of what you would get after reading in a CSV file using read.csv or read.table. See ?mqScpData for more information about the table content.

data("mqScpData")
dim(mqScpData)
#> [1] 1361  149
mqScpData[1:10, 1:5]
#>                                                                                 uid
#> 1          _(Acetyl (Protein N-term))ATNFLAHEK_ 2 0.00052636 190321S_LCA10_X_FP97AG
#> 2           _(Acetyl (Protein N-term))ATNFLAHEK_ 2 0.00058789 190222S_LCA9_X_FP94BM
#> 3  _(Acetyl (Protein N-term))SHTILLVQPTK_ 2 4.0315e-24 190914S_LCB3_X_16plex_Set_21
#> 4         _(Acetyl (Protein N-term))SHTILLVQPTK_ 2 4.7622e-06 190222S_LCA9_X_FP94BM
#> 5        _(Acetyl (Protein N-term))SHTILLVQPTK_ 2 6.8709e-09 190321S_LCA10_X_FP97AG
#> 6               _(Acetyl (Protein N-term))SLVIPEK_ 2 0.053705 190222S_LCA9_X_FP94BM
#> 7          _(Acetyl (Protein N-term))SLVIPEK_ 2 0.0595 190914S_LCB3_X_16plex_Set_21
#> 8            _(Acetyl (Protein N-term))VDREQLVQK_ 2 0.004872 190321S_LCA10_X_FP97AG
#> 9                                        _AAGLALK_ 2 0.020042 190222S_LCA9_X_FP94BM
#> 10                                      _AAGLALK_ 2 0.038752 190321S_LCA10_X_FP97AG
#>       Sequence Length           Modifications
#> 1    ATNFLAHEK      9 Acetyl (Protein N-term)
#> 2    ATNFLAHEK      9 Acetyl (Protein N-term)
#> 3  SHTILLVQPTK     11 Acetyl (Protein N-term)
#> 4  SHTILLVQPTK     11 Acetyl (Protein N-term)
#> 5  SHTILLVQPTK     11 Acetyl (Protein N-term)
#> 6      SLVIPEK      7 Acetyl (Protein N-term)
#> 7      SLVIPEK      7 Acetyl (Protein N-term)
#> 8    VDREQLVQK      9 Acetyl (Protein N-term)
#> 9      AAGLALK      7              Unmodified
#> 10     AAGLALK      7              Unmodified
#>                         Modified.sequence
#> 1    _(Acetyl (Protein N-term))ATNFLAHEK_
#> 2    _(Acetyl (Protein N-term))ATNFLAHEK_
#> 3  _(Acetyl (Protein N-term))SHTILLVQPTK_
#> 4  _(Acetyl (Protein N-term))SHTILLVQPTK_
#> 5  _(Acetyl (Protein N-term))SHTILLVQPTK_
#> 6      _(Acetyl (Protein N-term))SLVIPEK_
#> 7      _(Acetyl (Protein N-term))SLVIPEK_
#> 8    _(Acetyl (Protein N-term))VDREQLVQK_
#> 9                               _AAGLALK_
#> 10                              _AAGLALK_

In order to convert this tabular data to a scp-compatible QFeatures object, we need to provide a metadata table where rows contain sample information and columns must contain at least:

• The name of the batch the sample was acquired in
• The name of the channel the sample was acquired in

Any additional information about the samples will be stored in the colData.

We provide an example of such a data frame. It was formatted from the annotation table provided in the SCoPE2 article. See ?sampleAnnotation for more information about the table content.

data("sampleAnnotation")
sampleAnnotation
#>                         Raw.file               Channel SampleType lcbatch
#> 1          190222S_LCA9_X_FP94BM  Reporter.intensity.1    Carrier    LCA9
#> 2          190222S_LCA9_X_FP94BM  Reporter.intensity.2  Reference    LCA9
#> 3          190222S_LCA9_X_FP94BM  Reporter.intensity.3     Unused    LCA9
#> 4          190222S_LCA9_X_FP94BM  Reporter.intensity.4   Monocyte    LCA9
#> 5          190222S_LCA9_X_FP94BM  Reporter.intensity.5      Blank    LCA9
#> 6          190222S_LCA9_X_FP94BM  Reporter.intensity.6   Monocyte    LCA9
#> 7          190222S_LCA9_X_FP94BM  Reporter.intensity.7 Macrophage    LCA9
#> 8          190222S_LCA9_X_FP94BM  Reporter.intensity.8 Macrophage    LCA9
#> 9          190222S_LCA9_X_FP94BM  Reporter.intensity.9 Macrophage    LCA9
#> 10         190222S_LCA9_X_FP94BM Reporter.intensity.10 Macrophage    LCA9
#> 11         190222S_LCA9_X_FP94BM Reporter.intensity.11 Macrophage    LCA9
#> 12         190222S_LCA9_X_FP94BM Reporter.intensity.12     Unused    LCA9
#> 13         190222S_LCA9_X_FP94BM Reporter.intensity.13     Unused    LCA9
#> 14         190222S_LCA9_X_FP94BM Reporter.intensity.14     Unused    LCA9
#> 15         190222S_LCA9_X_FP94BM Reporter.intensity.15     Unused    LCA9
#> 16         190222S_LCA9_X_FP94BM Reporter.intensity.16     Unused    LCA9
#> 17        190321S_LCA10_X_FP97AG  Reporter.intensity.1    Carrier   LCA10
#> 18        190321S_LCA10_X_FP97AG  Reporter.intensity.2  Reference   LCA10
#> 19        190321S_LCA10_X_FP97AG  Reporter.intensity.3     Unused   LCA10
#> 20        190321S_LCA10_X_FP97AG  Reporter.intensity.4 Macrophage   LCA10
#> 21        190321S_LCA10_X_FP97AG  Reporter.intensity.5   Monocyte   LCA10
#> 22        190321S_LCA10_X_FP97AG  Reporter.intensity.6 Macrophage   LCA10
#> 23        190321S_LCA10_X_FP97AG  Reporter.intensity.7 Macrophage   LCA10
#> 24        190321S_LCA10_X_FP97AG  Reporter.intensity.8 Macrophage   LCA10
#> 25        190321S_LCA10_X_FP97AG  Reporter.intensity.9 Macrophage   LCA10
#> 26        190321S_LCA10_X_FP97AG Reporter.intensity.10 Macrophage   LCA10
#> 27        190321S_LCA10_X_FP97AG Reporter.intensity.11 Macrophage   LCA10
#> 28        190321S_LCA10_X_FP97AG Reporter.intensity.12     Unused   LCA10
#> 29        190321S_LCA10_X_FP97AG Reporter.intensity.13     Unused   LCA10
#> 30        190321S_LCA10_X_FP97AG Reporter.intensity.14     Unused   LCA10
#> 31        190321S_LCA10_X_FP97AG Reporter.intensity.15     Unused   LCA10
#> 32        190321S_LCA10_X_FP97AG Reporter.intensity.16     Unused   LCA10
#> 33  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.1    Carrier    LCB3
#> 34  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.2  Reference    LCB3
#> 35  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.3     Unused    LCB3
#> 36  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.4     Unused    LCB3
#> 37  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.5 Macrophage    LCB3
#> 38  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.6 Macrophage    LCB3
#> 39  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.7      Blank    LCB3
#> 40  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.8   Monocyte    LCB3
#> 41  190914S_LCB3_X_16plex_Set_21  Reporter.intensity.9 Macrophage    LCB3
#> 42  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.10   Monocyte    LCB3
#> 43  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.11      Blank    LCB3
#> 44  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.12 Macrophage    LCB3
#> 45  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.13 Macrophage    LCB3
#> 46  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.14 Macrophage    LCB3
#> 47  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.15 Macrophage    LCB3
#> 48  190914S_LCB3_X_16plex_Set_21 Reporter.intensity.16 Macrophage    LCB3
#> 49 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.1      Blank   LCA10
#> 50 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.2      Blank   LCA10
#> 51 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.3      Blank   LCA10
#> 52 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.4      Blank   LCA10
#> 53 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.5      Blank   LCA10
#> 54 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.6      Blank   LCA10
#> 55 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.7      Blank   LCA10
#> 56 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.8      Blank   LCA10
#> 57 190321S_LCA10_X_FP97_blank_01  Reporter.intensity.9      Blank   LCA10
#> 58 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.10      Blank   LCA10
#> 59 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.11      Blank   LCA10
#> 60 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.12      Blank   LCA10
#> 61 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.13      Blank   LCA10
#> 62 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.14      Blank   LCA10
#> 63 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.15      Blank   LCA10
#> 64 190321S_LCA10_X_FP97_blank_01 Reporter.intensity.16      Blank   LCA10
#>    sortday digest
#> 1       s8      N
#> 2       s8      N
#> 3       s8      N
#> 4       s8      N
#> 5       s8      N
#> 6       s8      N
#> 7       s8      N
#> 8       s8      N
#> 9       s8      N
#> 10      s8      N
#> 11      s8      N
#> 12      s8      N
#> 13      s8      N
#> 14      s8      N
#> 15      s8      N
#> 16      s8      N
#> 17      s8      Q
#> 18      s8      Q
#> 19      s8      Q
#> 20      s8      Q
#> 21      s8      Q
#> 22      s8      Q
#> 23      s8      Q
#> 24      s8      Q
#> 25      s8      Q
#> 26      s8      Q
#> 27      s8      Q
#> 28      s8      Q
#> 29      s8      Q
#> 30      s8      Q
#> 31      s8      Q
#> 32      s8      Q
#> 33      s9      R
#> 34      s9      R
#> 35      s9      R
#> 36      s9      R
#> 37      s9      R
#> 38      s9      R
#> 39      s9      R
#> 40      s9      R
#> 41      s9      R
#> 42      s9      R
#> 43      s9      R
#> 44      s9      R
#> 45      s9      R
#> 46      s9      R
#> 47      s9      R
#> 48      s9      R
#> 49      s8   <NA>
#> 50      s8   <NA>
#> 51      s8   <NA>
#> 52      s8   <NA>
#> 53      s8   <NA>
#> 54      s8   <NA>
#> 55      s8   <NA>
#> 56      s8   <NA>
#> 57      s8   <NA>
#> 58      s8   <NA>
#> 59      s8   <NA>
#> 60      s8   <NA>
#> 61      s8   <NA>
#> 62      s8   <NA>
#> 63      s8   <NA>
#> 64      s8   <NA>

The two tables are supplied to the readSCP function.

colData = sampleAnnotation,
channelCol = "Channel",
batchCol = "Raw.file")
#> Splitting data based on 'Raw.file'
#> Formatting data as a 'QFeatures' object

As indicated by the output on the console, readSCP proceeds as follows:

1. If featureData is the path to a CSV file, it reads the file using read.csv. The table is converted to a SingleCellExperiment object. readSCP needs to know in which field(s) the quantitative data is stored. Those field name(s) is/are provided by the channelCol field in the metaData table. So in this example, the column names holding the quantitative data in mqScpData are stored in the column named Channel in sampleAnnotation.

2. The SingleCellExperiment object is then split according to batch. The split is performed depending on the batchCol field in featureData, in this case the data will be split according to the Raw.file column in mqScpData. Raw.file contains the names of the acquisition runs that was used by MaxQuant to retrieve the raw data files.

3. The sample metadata is generated from the supplied colData. Note that in order for readSCP to correctly match the feature data with the metadata, colData must contain the same batchCol field with batch names.

4. Finally, the split feature data and the sample metadata are stored in a single QFeatures object.

Here is a compact overview of the data:

scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 16 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 16 columns
#>  [3] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 16 columns
#>  [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns

We can see that the scp object we created is a QFeatures object containing 4 assays. Each assay has an associated name, this is the batch name that was used for splitting. We can also see that each assay is a SingleCellExperiment object. The rows represent the peptide to spectrum matches (PSMs), the number vary depending on the batch. Finally, all three assays contains 16 columns that correspond to the 16 TMT channels recorded during the 4 MS runs.

## 3.1 Special case: empty columns

The SCoPE2 data we provide as an example contains runs that were acquired with two TMT protocols. The 3 first assays were acquired using the TMT-11 protocol and the last assay was acquired using a TMT-16 protocol. When exporting the table, the authors combined the data in a single table, were missing channels in the TMT-11 data are filled with NA. This is essential when working in table format, but since scp keeps the runs separated we can allow for different number of channels per run. readSCP can automatically detect and remove columns that contain only NAs, by supplying the argument removeEmptyCols = TRUE.

colData = sampleAnnotation,
channelCol = "Channel",
batchCol = "Raw.file",
removeEmptyCols = TRUE)
#> Splitting data based on 'Raw.file'
#> Formatting data as a 'QFeatures' object
scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 11 columns
#>  [3] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 11 columns
#>  [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns

See here that the 3 first assays contain 11 columns that correspond to the TMT-11 labels and the last assay contains 16 columns that correspond to the TMT-16 labels.

# 4 Clean missing data

All single-cell data contain many zeros. The zeros can be biological zeros or technical zeros and differentiating between the two types is not a trivial task. To avoid artefacts in downstream steps, we replace the zeros by the missing value NA. The zeroIsNA function takes the QFeatures object and the name(s) or index/indices of the assay(s) to clean and automatically replaces any zero in the selected quantitative data by NA.

scp <- zeroIsNA(scp, i = 1:4)

# 5 Filter PSMs

A common steps in SCP is to filter out low-confidence PSMs. Each PSM assay contains feature meta-information that are stored in the rowData of the assays. The QFeatures package allows to quickly filter the rows of an assay by using these information. The available variables in the rowData are listed below for each assay.

rowDataNames(scp)
#> CharacterList of length 4
#> [["190222S_LCA9_X_FP94BM"]] uid Sequence Length ... residual participated
#> [["190321S_LCA10_X_FP97AG"]] uid Sequence Length ... residual participated
#> [["190321S_LCA10_X_FP97_blank_01"]] uid Sequence ... residual participated
#> [["190914S_LCB3_X_16plex_Set_21"]] uid Sequence ... residual participated

## 5.1 Filter features based on feature metadata

Below are some examples of criteria that are used to identify low-confidence. The information is readily available since this was computed by MaxQuant:

• Remove PSMs that are matched to contaminants
• Remove PSMs that are matched to the decoy database
• Keep PSMs that exhibit a high PIF (parental ion fraction), indicative of the purity of a spectrum

We can perform this filtering using the filterFeatures function from QFeatures. filterFeatures automatically accesses the feature metadata and selects the rows that meet the provided condition(s). For instance, Reverse != "+" keeps the rows for which the Reverse variable in the rowData is not "+" (i.e. the PSM is not matched to the decoy database).

scp <- filterFeatures(scp,
~ Reverse != "+" &
Potential.contaminant != "+" &
!is.na(PIF) & PIF > 0.8)

## 5.2 Filter assays based on detected features

To avoid proceeding with failed runs, another interesting filter is to remove assays with too few features. If a batch contains less than, for example, 150 features we can then suspect something wrong happened in that batch and it should be removed. Using dims, we can query the dimensions (hence the number of features and the number of samples) of all assays contained in the dataset.

dims(scp)
#>      190222S_LCA9_X_FP94BM 190321S_LCA10_X_FP97AG 190321S_LCA10_X_FP97_blank_01
#> [1,]                   283                    318                            60
#> [2,]                    11                     11                            11
#>      190914S_LCB3_X_16plex_Set_21
#> [1,]                          200
#> [2,]                           16

Actually, a QFeatures object can be seen as a three-order array: $$features \times samples \times assay$$. Hence, QFeatures supports three-order subsetting x[rows, columns, assays]. We first select the assays that have sufficient PSMs (the number of rows is greater than 150), and then subset the scp object for the assays that meet the criterion.

keepAssay <- dims(scp)[1, ] > 150
scp <- scp[, , keepAssay]
#> Warning: 'experiments' dropped; see 'metadata'
#> harmonizing input:
#>   removing 11 sampleMap rows not in names(experiments)
#>   removing 11 colData rownames not in sampleMap 'primary'
scp
#> An instance of class QFeatures containing 3 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 283 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 318 rows and 11 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 200 rows and 16 columns

Notice the 190321S_LCA10_X_FP97_blank_01 sample was removed because it did not contain sufficient features, as expected from a blank run. This could also have been the case for failed runs.

## 5.3 Filter features based on SCP metrics

Another type of filtering is specific to SCP. In the SCoPE2 analysis, the authors suggest a filters based on the sample to carrier ratio (SCR), that is the reporter ion intensity of a single-cell sample divided by the reporter ion intensity of the carrier channel (200 cells) from the same batch. It is expected that the carrier intensities are much higher than the single-cell intensities.

The SCR can be computed using the computeSCR function from scp. The function must be told which channels are the samples that must be divided and which channel contains the carrier. This information is provided in the sample metadata and is accessed using the colData, under the SampleType field.

table(colData(scp)[, "SampleType"])
#>
#>      Blank    Carrier Macrophage   Monocyte  Reference     Unused
#>          3          3         20          5          3          4

In this dataset, SampleType gives the type of sample that is present in each TMT channel. The SCoPE2 protocole includes 5 types of samples:

• The carrier channels (Carrier) contain 200 cell equivalents and are meant to boost the peptide identification rate.
• The normalization channels (Reference) contain 5 cell equivalents and are used to partially correct for between-run variation.
• The unused channels (Unused) are channels that are left empty due to isotopic cross-contamination by the carrier channel.
• The blank channels (Blank) contain samples that do not contain any cell but are processed as single-cell samples.
• The single-cell sample channels contain the single-cell samples of interest, that are macrophage (Macrophage) or monocyte (Monocyte).

The computeSCR function expects the following input:

• The QFeatures dataset
• The assay name(s) or index/indices for which the SCR should be computed
• The sample metadata variable pointing to the channel annotation
• A string pattern (following regular expression syntax) that uniquely identifies the carrier channel in each batch
• A string pattern (following regular expression syntax) that identifies the samples to divide

The function creates a new field in the rowData of the assays. We store the computed SCR in the rowData and call it MeanSCR.

scp <- computeSCR(scp,
i = 1:3,
colDataCol = "SampleType",
carrierPattern = "Carrier",
samplePattern = "Macrophage|Monocyte",
rowDataName = "MeanSCR")

Before applying the filter, we plot the distribution of the average SCR. We can extract the MeanSCR variable from the rowData of several assays using the rowDataToDF. It takes the rowData field(s) of interest and returns a DataFrame table.

scp %>%
rowDataToDF(i = 1:3,
vars = "MeanSCR") %>%
data.frame %>%
ggplot(aes(x = MeanSCR)) +
geom_histogram() +
geom_vline(xintercept = c(1/200, 0.1),
lty = c(2, 1)) +
scale_x_log10()

The expected ratio between single cells and the carrier is 1/200 (dashed line). We can see that the distribution mode is slightly shifted towards higher ratios with a mode around 0.01. However, there are a few PSMs that stand out of the distribution and have a much higher signal than expected, indicating something wrong happened during the quantification of those PSMs. We therefore filter out PSMs with an average SCR higher than 0.1 (solide line). This is again easily performed using the filterFeatures functions.

scp <- filterFeatures(scp,
~ !is.na(MeanSCR) &
MeanSCR < 0.1)

## 5.4 Filter features to control for FDR

Finally, we might also want to control for false discovery rate (FDR). MaxQuant already computes posterior error probabilities (PEP), but filtering on PEPs is too conservative (Käll et al. (2008)) so we provide the pep2qvalue function to convert PEPs to q-values that are directly related to FDR. We here compute the q-values from the PEP (dart_PEP) across all 3 assays. dart_PEP contains the PEP values that have been updated using the DART-ID algorithm (Chen, Franks, and Slavov (2019)). The function will store the results in the rowData, we here asked to name the new column qvalue_PSMs.

scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
rowDataName = "qvalue_PSMs")

We also allow to compute q-values at peptide or protein level rather than PSM. In this case, you need to supply the groupBy argument. Suppose we want to compute the q-values at protein level, we can fetch the protein information stored under protein in the rowData. This time, we store the q-values in a new field called qvalue_proteins.

scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
groupBy = "protein",
rowDataName = "qvalue_proteins")

We can now filter the PSM to control, let’s say, the protein FDR at 1%. This can be performed using filterFeatures because the q-values were stored in the rowData.

scp <- filterFeatures(scp,
~ qvalue_proteins < 0.01)

# 6 Process the PSM data

## 6.1 Relative reporter ion intensity

In order to partialy correct for between-run variation, SCoPE2 suggests computing relative reporter ion intensities. This means that intensities measured for single-cells are divided by the reference channel containing 5-cell equivalents. We use the divideByReference function that divides channels of interest by the reference channel. Similarly to computeSCR, we can point to the samples and the reference columns in each assay using the annotation contained in the colData.

We here divide all columns (using the regular expression wildcard .) by the reference channel (Reference).

scp <- divideByReference(scp,
i = 1:3,
colDataCol = "SampleType",
samplePattern = ".",
refPattern = "Reference")

# 7 Aggregate PSM data to peptide data

Now that the PSM assays are processed, we can aggregate them to peptides. This is performed using the aggregateFeaturesOverAssays function. For each assay, the function aggregates several PSMs into a unique peptide. This is best illustrated by the figure below.

Remember there currently are three assays containing the PSM data.

scp
#> An instance of class QFeatures containing 3 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 186 rows and 16 columns

The PSMs are aggregated over the fcol feature variable, here peptides. We also need to supply an aggregating function that will tell how to combine the quantitative data of the PSMs to aggregate. We here aggregate the PSM data using the median value per sample thanks to the matrixStats:colMedians function. Other functions can be used and we refer to ?aggregateFeatures for more information about available aggregation functions. The aggregateFeaturesOverAssays function will create a new assay for each aggregated assay. We name the aggregated assays using the original names and appending peptides_ at the start.

scp <- aggregateFeaturesOverAssays(scp,
i = 1:3,
fcol = "peptide",
name = paste0("peptides_", names(scp)),
fun = matrixStats::colMedians, na.rm = TRUE)

Notice that 3 new assays were created in the scp object. Those new assays contain the aggregated features while the three first assays are unchanged. This allows to keep track of the data processing.

scp
#> An instance of class QFeatures containing 6 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 186 rows and 16 columns
#>  [4] peptides_190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [5] peptides_190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [6] peptides_190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 185 rows and 16 columns

Under the hood, the QFeatures architecture preserves the relationship between the aggregated assays. See ?AssayLinks for more information on relationships between assays.

# 8 Join the SCoPE2 sets in one assay

Up to now, we kept the data belonging to each MS run in separate assays. We now combine all batches into a single assay. This is done using the joinAssays function from the QFeatures package. Note that we now use the aggregated assays, so assay 4 to 6.

scp <- joinAssays(scp,
i = 4:6,
name = "peptides")

In this case, one new assay is created in the scp object that combines the data from assay 4 to 6. The samples are always distinct so the number of column in the new assay (here $$48$$) will always equals the sum of the columns in the assays to join (here $$16 + 16 + 16$$). The feature in the joined assay might contain less features than the sum of the rows of the assays to join since common features between assays are joined in a single row.

scp
#> An instance of class QFeatures containing 7 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 186 rows and 16 columns
#>  [4] peptides_190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [5] peptides_190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [6] peptides_190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 185 rows and 16 columns
#>  [7] peptides: SingleCellExperiment with 432 rows and 38 columns

# 9 Filter single-cells

Another common step in single-cell data analysis pipelines is to remove low-quality cells. After subsetting for the samples of interest, we will use 2 metrics: the median relative intensities per cell and the median coefficient of variation (CV) per cell.

## 9.1 Filter samples of interest

We can subset the cells of interest, that is the blank samples (sc_0), the macrophages (sc_m0) and the monocytes (sc_u). This can easily be done by taking advantage of the colData and the subsetting operators. Recall that QFeatures objects support three-order subsetting, x[rows, columns, assays], where columns are the samples of interest.

scp
#> An instance of class QFeatures containing 7 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 186 rows and 16 columns
#>  [4] peptides_190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 11 columns
#>  [5] peptides_190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 11 columns
#>  [6] peptides_190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 185 rows and 16 columns
#>  [7] peptides: SingleCellExperiment with 432 rows and 38 columns
scp <- scp[, scp\$SampleType %in% c("Blank", "Macrophage", "Monocyte"), ]

The subsetting removes unwanted samples from all assays. The filtered data set contains the same number of assays with the same number of features, but the number of columns (hence sampled) decreased.

scp
#> An instance of class QFeatures containing 7 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 8 columns
#>  [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 8 columns
#>  [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 186 rows and 12 columns
#>  [4] peptides_190222S_LCA9_X_FP94BM: SingleCellExperiment with 239 rows and 8 columns
#>  [5] peptides_190321S_LCA10_X_FP97AG: SingleCellExperiment with 277 rows and 8 columns
#>  [6] peptides_190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 185 rows and 12 columns
#>  [7] peptides: SingleCellExperiment with 432 rows and 28 columns

## 9.2 Filter based on the median relative intensity

We compute the median relative reporter ion intensity for each cell separately and apply a filter based on this statistic. This procedure recalls that of library size filtering commonly performed in scRNA-Seq data analysis, where the library size is the sum of the counts in each single cell. We will store the median intensity in the colData. The medians are computed for every sample from the quantitative data using the apply function. The data matrix can be extracted from a SingleCellExperiment using the assay function. Setting MARGIN = 2 means that we compute the median by column.

scp[["peptides"]] %>%
assay %>%
apply(MARGIN = 2,
FUN = median,
na.rm = TRUE) ->
medians
#> 190222S_LCA9_X_FP94BMReporter.intensity.4
#>                                 0.7572816
#> 190222S_LCA9_X_FP94BMReporter.intensity.5
#>                                 0.4766553
#> 190222S_LCA9_X_FP94BMReporter.intensity.6
#>                                 0.8314786
#> 190222S_LCA9_X_FP94BMReporter.intensity.7
#>                                 0.4127720
#> 190222S_LCA9_X_FP94BMReporter.intensity.8
#>                                 1.0302978
#> 190222S_LCA9_X_FP94BMReporter.intensity.9
#>                                 0.8830915

The computed medians are then inserted in the colData.

colData(scp)[names(medians), "MedianRI"] <- medians

Looking at the distribution of the median per cell can highlight low-quality cells.

scp %>%
colData %>%
data.frame %>%
ggplot() +
aes(x = MedianRI,
y = SampleType,
fill = SampleType) +
geom_boxplot() +
scale_x_log10()