SubCellBarCode
can be installed through BiocManager
package as follows:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("SubCellBarCode")
library(SubCellBarCode)
##
As example data we here provide the publicly available HCC827 (human lung
adenocarcinoma cell line) TMT10plex labelled proteomics dataset
(Orre et al. 2019, Molecular Cell). The data.frame
consists of
10480 proteins as rows (rownames are gene -centric protein ids)
and 5 fractions with duplicates as columns
(replicates must be named “.A.” and “.B.”, repectively).
head(hcc827Ctrl)
## FS1.A.HCC827 FS1.B.HCC827 FS2.A.HCC827 FS2.B.HCC827 FP1.A.HCC827
## A2M 6.6567 4.8238 0.8265 0.8279 0.4475
## A2ML1 1.2876 1.0878 0.6390 0.7828 0.8760
## A4GALT 0.4711 0.4106 0.2742 0.2689 0.8389
## AAAS 0.5108 0.4514 0.4470 0.4752 1.6576
## AACS 4.5593 4.4522 1.5694 1.6417 0.5294
## AAED1 0.8170 0.7031 0.5415 0.5902 0.9429
## FP1.B.HCC827 FP2.A.HCC827 FP2.B.HCC827 FP3.A.HCC827 FP3.B.HCC827
## A2M 0.5803 0.6414 0.6014 0.6497 0.6216
## A2ML1 0.9138 2.6957 1.3606 0.7715 0.7859
## A4GALT 0.8043 1.9637 1.8340 1.6739 1.6848
## AAAS 1.7316 1.3773 1.4688 1.0071 1.0457
## AACS 0.5197 0.5109 0.5232 0.4197 0.4243
## AAED1 0.9505 1.9035 1.8475 1.1060 1.1636
The classification of protein localisation using the SubCellBarCode
method is dependent on 3365 marker proteins as defined in Orre et al.
The markerProteins data.frame
contain protein names (gene symbol),
associated subcellular localization (compartment), color code for
the compartment and the median normalized fractionation
profile (log2) based on five different human cell lines
(NCI-H322, HCC827, MCF7, A431and U251) here called the
“5CL marker profile”.
head(markerProteins)
## Proteins Compartments Cyto Nsol NucI Horg
## AAAS AAAS S4 -1.0033518 -1.15489468 0.9303367 0.48554266
## AACS AACS C4 2.1716569 0.13246046 -1.0394634 -1.13265585
## AAK1 AAK1 C3 1.8556445 0.10015281 -0.6605511 -0.36985132
## AARS AARS C4 2.1012831 0.05855811 -1.0439250 -1.08451971
## AASDHPPT AASDHPPT C5 1.7065897 0.51507061 -0.7196594 -0.65267201
## AATF AATF N3 -0.8922053 -1.10101864 1.2202070 0.05316807
## Lorg Colour
## AAAS 0.1618329 tomato2
## AACS -1.3217619 deepskyblue2
## AAK1 -0.6741959 cyan
## AARS -1.2180979 deepskyblue2
## AASDHPPT -1.0059841 turquoise3
## AATF 0.1898676 grey50
Input data.frame
is checked with “NA” values and for the correct
format. If there is any “NA” value, corresponding row is deleted.
Then, data frame is log2
transformmed.
df <- loadData(protein.data = hcc827Ctrl)
cat(dim(df))
## 10480 10
head(df)
## FS1.A.HCC827 FS1.B.HCC827 FS2.A.HCC827 FS2.B.HCC827 FP1.A.HCC827
## A2M 2.7348072 2.2701701 -0.2749133 -0.2724716 -1.16004041
## A2ML1 0.3646845 0.1214133 -0.6461122 -0.3532843 -0.19099723
## A4GALT -1.0858948 -1.2841945 -1.8666995 -1.8948583 -0.25342925
## AAAS -0.9691696 -1.1475217 -1.1616533 -1.0733933 0.72909591
## AACS 2.1888123 2.1545184 0.6502131 0.7151905 -0.91756990
## AAED1 -0.2915920 -0.5081982 -0.8849668 -0.7607242 -0.08482332
## FP1.B.HCC827 FP2.A.HCC827 FP2.B.HCC827 FP3.A.HCC827 FP3.B.HCC827
## A2M -0.78512917 -0.6407037 -0.7336032 -0.62215439 -0.68594159
## A2ML1 -0.13004965 1.4306600 0.4442430 -0.37426194 -0.34758234
## A4GALT -0.31419437 0.9735745 0.8749936 0.74321334 0.75257734
## AAAS 0.79210571 0.4618428 0.5546380 0.01020694 0.06446902
## AACS -0.94424904 -0.9688872 -0.9345656 -1.25256963 -1.23684342
## AAED1 -0.07324147 0.9286546 0.8855744 0.14535139 0.21859520
Additional step:
We use gene symbols for the protein identification. Therefore, we require
gene symbols for the identifiaction. However, if the input data has
other identifier e.g. UNIPROT, IPI, Entrez ID, you can convert it to gene
symbol by our defined function.
Please be aware of possible (most likely few) id loss during the
conversion to one another.
##Run if you have another identifier than gene symbols.
##Function will convert UNIPROT identifier to gene symbols.
##Deafult id is "UNIPROT", make sure you change it if you use another.
#df <- convert2symbol(df = df, id = "UNIPROT")
For the downstream analysis, we used the randomly selected subset data.
set.seed(2)
df <- df[sample(nrow(df), 6000),]
The overlap between marker proteins (3365) and input data.frame is calculated and visualized for each compartment by a bar plot.
Note that we recommend at least 20% coverage of marker proteins for each compartment. If certain compartments are underrrepresented we recommend you to perform the cell fractionation again. If all compartments are low in coverage we recommend increasing the analytical depth of the MS-analysis.
c.prots <- calculateCoveredProtein(proteinIDs = rownames(df),
markerproteins = markerProteins[,1])
## Overall Coverage of marker proteins : 0.58
To avoid reduced classification accuracy, marker proteins with noisy quantification and marker proteins that are not representative of their associated compartment (e.g.due to cell type specific localization) are filtered out by a two-step quality control.
Marker proteins with pearson correlations less than 0.8 between A and B duplicates for each cell line were filtered out (Figure A).
Pairwise correlations between 5CL marker profile and input data for each protein (A and B replicate experiments separately) were calculated using both Pearson and Spearman correlation. The lowest value for each method were then used for filtering with cut-offs set to 0.8 and 0.6 respectively, to exclude non-representative marker proteins (Figure B).
r.markers <- markerQualityControl(coveredProteins = c.prots,protein.data = df)
## Number of removed replicate-wise proteins: 0
## Number of removed sample-wise proteins: 1
## Number of total removed marker proteins: 1