library(PRONE)
PRONE also offers some additional functionalities for the evaluation of normalization techniques on spike-in data sets with a known ground truth. These functionalities will be delineated subsequently. Additionally, all functionalities detailed in the context of real-world data sets remain applicable to the SummarizedExperiment object associated with spike-in data sets.
The example spike-in data set is from (Cox et al. 2014). For the illustration of the package functionality, a smaller subset of the data consisting of 1500 proteins was used. For the complete data set, please refer to the original publication.
data_path <- readPRONE_example("Ecoli_human_MaxLFQ_protein_intensities.csv")
md_path <- readPRONE_example("Ecoli_human_MaxLFQ_metadata.csv")
data <- read.csv(data_path)
md <- read.csv(md_path)
Before loading the data into a SummarizedExperiment object, it is important to check the organism of the protein groups. Proteins should be uniquely assigned to either the spike-in organism or the background organism. If some protein groups are mixed, they should be removed from the data since these can’t be used for classification into true positives, false positives, etc.
In our example, we can extract the information from the “Fasta.headers” column:
# Check if some protein groups are mixed
mixed <- grepl("Homo sapiens.*Escherichia|Escherichia.*Homo sapiens",data$Fasta.headers)
data <- data[!mixed,]
table(mixed)
#> mixed
#> FALSE
#> 1000
In this case, all proteins were assigned either as a spike-in or background protein. Hence, a SummarizedExperiment container of the data can be created. However, before we need to add a column of the actual organism that can be used to calculate true positives, false positives, etc.
data$Spiked <- rep("HUMAN", nrow(data))
data$Spiked[grepl("ECOLI", data$Fasta.headers)] <- "ECOLI"
In contrast to the real-world data sets, you need to specify the “spike_column”, “spike_value”, “spike_concentration”, and utilize the load_spike_data()
function for this purpose.
Here, the “spike_column” denotes the column in the protein intensity data file encompassing information whether a proteins is classified as spike-in or background protein. The “spike_value” determines the value/identifier that is used to identify the spike-in proteins in the “spike_column”, and the “spike_concentration” identifies the column containing the concentration values of the spike-in proteins (in this case the different conditions that will be tested in DE analysis).
se <- load_spike_data(data,
md,
spike_column = "Spiked",
spike_value = "ECOLI",
spike_concentration = "Concentration",
protein_column = "Protein.IDs",
gene_column = "Gene.names",
ref_samples = NULL,
batch_column = NULL,
condition_column = "Condition",
label_column = "Label")
To get an overview on the number of identified (non-NA) spike-in proteins per sample, you can use the plot_identified_spiked_proteins()
function.
plot_identified_spiked_proteins(se)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
To compare the distributions of spike-in and human proteins in the different sample groups (here high-low), use the function plot_histogram_spiked()
. Again, “condition = NULL” means that the condition specified by loading the data is used, but you can also specify any other column of the meta data.
plot_histogram_spiked(se, condition = NULL)
#> Condition of SummarizedExperiment used!
If you want to have a look at the amount of actual measure spike-in, you can use the plot_profiles_spiked()
function. Moreover, you can analyze whether the intensities of the background proteins, here HUMAN proteins, are constant across the different spike-in concentrations.
plot_profiles_spiked(se, xlab = "Concentration")