Contents

1 Installation of the package

SubCellBarCode can be installed through BiocManager package as follows:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("SubCellBarCode")

2 Load the package

library(SubCellBarCode)
## 

3 Data preparation and classification

3.1 Example Data

As example data we here provide the publicly available HCC827 (human lung adenocarcinoma cell line) TMT10plex labelled proteomics dataset (Orre et al. 2019, Molecular Cell). The data.frame consists of 10480 proteins as rows (rownames are gene -centric protein ids) and 5 fractions with duplicates as columns (replicates must be named “.A.” and “.B.”, repectively).

head(hcc827Ctrl)
##        FS1.A.HCC827 FS1.B.HCC827 FS2.A.HCC827 FS2.B.HCC827 FP1.A.HCC827
## A2M          6.6567       4.8238       0.8265       0.8279       0.4475
## A2ML1        1.2876       1.0878       0.6390       0.7828       0.8760
## A4GALT       0.4711       0.4106       0.2742       0.2689       0.8389
## AAAS         0.5108       0.4514       0.4470       0.4752       1.6576
## AACS         4.5593       4.4522       1.5694       1.6417       0.5294
## AAED1        0.8170       0.7031       0.5415       0.5902       0.9429
##        FP1.B.HCC827 FP2.A.HCC827 FP2.B.HCC827 FP3.A.HCC827 FP3.B.HCC827
## A2M          0.5803       0.6414       0.6014       0.6497       0.6216
## A2ML1        0.9138       2.6957       1.3606       0.7715       0.7859
## A4GALT       0.8043       1.9637       1.8340       1.6739       1.6848
## AAAS         1.7316       1.3773       1.4688       1.0071       1.0457
## AACS         0.5197       0.5109       0.5232       0.4197       0.4243
## AAED1        0.9505       1.9035       1.8475       1.1060       1.1636

3.2 Marker Proteins

The classification of protein localisation using the SubCellBarCode method is dependent on 3365 marker proteins as defined in Orre et al.  The markerProteins data.frame contain protein names (gene symbol), associated subcellular localization (compartment), color code for the compartment and the median normalized fractionation profile (log2) based on five different human cell lines (NCI-H322, HCC827, MCF7, A431and U251) here called the “5CL marker profile”.

head(markerProteins)
##          Proteins Compartments       Cyto        Nsol       NucI        Horg
## AAAS         AAAS           S4 -1.0033518 -1.15489468  0.9303367  0.48554266
## AACS         AACS           C4  2.1716569  0.13246046 -1.0394634 -1.13265585
## AAK1         AAK1           C3  1.8556445  0.10015281 -0.6605511 -0.36985132
## AARS         AARS           C4  2.1012831  0.05855811 -1.0439250 -1.08451971
## AASDHPPT AASDHPPT           C5  1.7065897  0.51507061 -0.7196594 -0.65267201
## AATF         AATF           N3 -0.8922053 -1.10101864  1.2202070  0.05316807
##                Lorg       Colour
## AAAS      0.1618329      tomato2
## AACS     -1.3217619 deepskyblue2
## AAK1     -0.6741959         cyan
## AARS     -1.2180979 deepskyblue2
## AASDHPPT -1.0059841   turquoise3
## AATF      0.1898676       grey50

3.3 Load and normalize data

Input data.frame is checked with “NA” values and for the correct format. If there is any “NA” value, corresponding row is deleted. Then, data frame is log2 transformmed.

df <- loadData(protein.data = hcc827Ctrl)
cat(dim(df))
## 10480 10
head(df)
##        FS1.A.HCC827 FS1.B.HCC827 FS2.A.HCC827 FS2.B.HCC827 FP1.A.HCC827
## A2M       2.7348072    2.2701701   -0.2749133   -0.2724716  -1.16004041
## A2ML1     0.3646845    0.1214133   -0.6461122   -0.3532843  -0.19099723
## A4GALT   -1.0858948   -1.2841945   -1.8666995   -1.8948583  -0.25342925
## AAAS     -0.9691696   -1.1475217   -1.1616533   -1.0733933   0.72909591
## AACS      2.1888123    2.1545184    0.6502131    0.7151905  -0.91756990
## AAED1    -0.2915920   -0.5081982   -0.8849668   -0.7607242  -0.08482332
##        FP1.B.HCC827 FP2.A.HCC827 FP2.B.HCC827 FP3.A.HCC827 FP3.B.HCC827
## A2M     -0.78512917   -0.6407037   -0.7336032  -0.62215439  -0.68594159
## A2ML1   -0.13004965    1.4306600    0.4442430  -0.37426194  -0.34758234
## A4GALT  -0.31419437    0.9735745    0.8749936   0.74321334   0.75257734
## AAAS     0.79210571    0.4618428    0.5546380   0.01020694   0.06446902
## AACS    -0.94424904   -0.9688872   -0.9345656  -1.25256963  -1.23684342
## AAED1   -0.07324147    0.9286546    0.8855744   0.14535139   0.21859520

Additional step: We use gene symbols for the protein identification. Therefore, we require gene symbols for the identifiaction. However, if the input data has other identifier e.g. UNIPROT, IPI, Entrez ID, you can convert it to gene symbol by our defined function.
Please be aware of possible (most likely few) id loss during the conversion to one another.

##Run if you have another identifier than gene symbols.
##Function will convert UNIPROT identifier to gene symbols.
##Deafult id is "UNIPROT", make sure you change it if you use another.

#df <- convert2symbol(df = df, id = "UNIPROT")

For the downstream analysis, we used the randomly selected subset data.

set.seed(2)
df <- df[sample(nrow(df), 6000),]

3.4 Calculate covered marker proteins

The overlap between marker proteins (3365) and input data.frame is calculated and visualized for each compartment by a bar plot.

Note that we recommend at least 20% coverage of marker proteins for each compartment. If certain compartments are underrrepresented we recommend you to perform the cell fractionation again. If all compartments are low in coverage we recommend increasing the analytical depth of the MS-analysis.

c.prots <- calculateCoveredProtein(proteinIDs = rownames(df), 
                        markerproteins = markerProteins[,1]) 

## Overall Coverage of marker proteins :  0.58

3.5 Quality control of the marker proteins

To avoid reduced classification accuracy, marker proteins with noisy quantification and marker proteins that are not representative of their associated compartment (e.g.due to cell type specific localization) are filtered out by a two-step quality control.

  1. Marker proteins with pearson correlations less than 0.8 between A and B duplicates for each cell line were filtered out (Figure A).

  2. Pairwise correlations between 5CL marker profile and input data for each protein (A and B replicate experiments separately) were calculated using both Pearson and Spearman correlation. The lowest value for each method were then used for filtering with cut-offs set to 0.8 and 0.6 respectively, to exclude non-representative marker proteins (Figure B).

r.markers <- markerQualityControl(coveredProteins = c.prots,protein.data = df)
## Number of removed replicate-wise proteins: 0
## Number of removed sample-wise proteins: 1
## Number of total removed marker proteins: 1