yaml
CPSM is an R package that provides a computational pipeline for predicting the survival probability of cancer patients. It encompasses several key steps, including data processing, splitting data into training and test subsets, data normalization, selecting significant features based on univariate survival analysis, generating LASSO PI scores, and developing predictive models for survival probability. Additionally, CPSM visualizes results through survival curves based on predicted probabilities and bar plots depicting the predicted mean and median survival times of patients.
To install this package, start R (version “4.4”) and enter the code provided:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("CPSM")
The example input data object,
Example_TCGA_LGG_FPKM_data
, contains data
for 184 LGG cancer samples as rows and various features
as columns. Gene expression data is represented in FPKM
values. The dataset includes 11 clinical and
demographic features, 4 types of survival data
(with both time and event information), and 19,978
protein-coding genes. The clinical and demographic features in
the dataset include Age
, subtype
,
gender
, race
,
ajcc_pathologic_tumor_stage
,
histological_type
, histological_grade
,
treatment_outcome_first_course
,
radiation_treatment_adjuvant
, sample_type
, and
type
. The four types of survival data included are
Overall Survival (OS), Progression-Free
Survival (PFS), Disease-Specific Survival
(DSS), and Disease-Free Survival (DFS). In the
dataset, the columns labeled OS, PFS,
DSS, and DFS represent event
occurrences, while the columns OS.time,
PFS.time, DSS.time, and
DFS.time provide survival times (in days).
library(CPSM)
library(SummarizedExperiment)
set.seed(7) # set seed
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
Example_TCGA_LGG_FPKM_data
#> class: SummarizedExperiment
#> dim: 2005 184
#> metadata(0):
#> assays(1): expression
#> rownames(2005): A1BG A1CF ... BAZ1B BAZ2A
#> rowData names(1): gene
#> colnames(184): TCGA-TM-A7CA-01 TCGA-DU-A6S3-01 ... TCGA-E1-A7YM-01
#> TCGA-DH-5143-01
#> colData names(20): Age subtype ... PFI.time sample
The data_process_f function converts OS time (in
days) into months and removes samples where OS/OS.time information is
missing. ## Required inputs To use this function, the input data should
be provided in TSV format. Additionally, you need to define
col_num
(the column number at which clinical, demographic,
and survival information ends, e.g., 20), surv_time
(the
name of the column that contains survival time information, e.g.,
OS.time
), and output
(the desired name for the
output, e.g., “New_data”).
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
combined_df <- cbind(
as.data.frame(colData(Example_TCGA_LGG_FPKM_data))
[, -ncol(colData(Example_TCGA_LGG_FPKM_data))],
t(as.data.frame(assay(
Example_TCGA_LGG_FPKM_data,
"expression"
)))
)
New_data <- data_process_f(combined_df, col_num = 20, surv_time = "OS.time")
str(New_data[1:10])
#> 'data.frame': 176 obs. of 10 variables:
#> $ Age : num 44.9 60.3 57.9 45.7 70.7 ...
#> $ subtype : chr "PN" "PN" NA "PN" ...
#> $ gender : chr "Male" "Male" "Female" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Astrocytoma" "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G2" "G2" "G3" "G3" ...
#> $ treatment_outcome_first_course: chr "Complete Remission/Response" "Stable Disease" NA NA ...
#> $ radiation_treatment_adjuvant : chr "NO" "NO" NA "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After data processing, the output object
New_data
is generated, which contains 176
samples. This indicates that the function has removed 8 samples where
OS/OS.time information was missing. Moreover, a new 21st column,
OS_month
, is added to the data, containing
OS time values in months.
Before proceeding further, we need to split the data into training
and test subsets for feature selection and model development. ##
Required inputs The output from the previous step,
New_data
, serves as the input for this
process. Next, you need to define the fraction (e.g., 0.9) by which to
split the data into training and test sets. For example, setting
fraction = 0.9
will divide the data into 90% for training
and 10% for testing. Additionally, you should specify names for the
training and test outputs (e.g., train_FPKM
and
test_FPKM
).
data(New_data, package = "CPSM")
# Call the function
result <- tr_test_f(data = New_data, fraction = 0.9)
# Access the train and test data
train_FPKM <- result$train_data
str(train_FPKM[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 53.1 54.3 38.1 25.8 46.2 ...
#> $ subtype : chr "PN" "ME" "NE" "NE" ...
#> $ gender : chr "Female" "Male" "Female" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "NOT AVAILABLE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligodendroglioma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G3" "G3" "G2" ...
#> $ treatment_outcome_first_course: chr NA "Progressive Disease" NA "Complete Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "NO" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
test_FPKM <- result$test_data
str(test_FPKM[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ Age : num 70.7 34.6 32.4 61 34.4 ...
#> $ subtype : chr "PN" NA "PN" "PN" ...
#> $ gender : chr "Male" "Female" "Male" "Male" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G2" ...
#> $ treatment_outcome_first_course: chr "Stable Disease" "Stable Disease" "Partial Remission/Response" "Complete Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr NA NA "YES" "YES" ...
#> $ sample_type : chr "Recurrent" "Primary" "Primary" "Primary" ...
After the train-test split, two new output objects are generated:
train_FPKM
and
test_FPKM
. The
train_FPKM
object contains 158 samples,
while test_FPKM
contains 18 samples. This
indicates that the tr_test_f
function
splits the data in a 90:10 ratio.
In order to select features and develop ML models, the data must be
normalized. Since the expression data is available in terms of FPKM
values, the train_test_normalization_f
function will first convert the FPKM values into a log scale using the
formula [log2(FPKM+1)], followed by quantile normalization. The training
data will be used as the target matrix for the quantile normalization
process. ## Required inputs For this function, you need to provide the
training and test datasets obtained from the previous step (Train/Test
Split). Additionally, you must specify the column number where clinical
information ends (e.g., 21) in the input datasets. Finally, you need to
define output names for the resulting datasets:
train_clin_data
(which contains only
clinical information from the training data),
test_clin_data
(which contains only
clinical information from the test data),
train_Normalized_data_clin_data
(which
contains both clinical information and normalized gene expression values
for the training samples), and
test_Normalized_data_clin_data
(which
contains both clinical information and normalized gene expression values
for the test samples).
# Step 3 - Data Normalization
# Normalize the training and test data sets
data(train_FPKM, package = "CPSM")
data(test_FPKM, package = "CPSM")
Result_N_data <- train_test_normalization_f(
train_data = train_FPKM,
test_data = test_FPKM,
col_num = 21
)
# Access the Normalized train and test data
Train_Clin <- Result_N_data$Train_Clin
Test_Clin <- Result_N_data$Test_Clin
Train_Norm_data <- Result_N_data$Train_Norm_data
Test_Norm_data <- Result_N_data$Test_Norm_data
str(Train_Clin[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
str(Train_Norm_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ Age : num 30.8 36.5 38.9 65.1 32.3 ...
#> $ subtype : chr "PN" NA "NE" "CL" ...
#> $ gender : chr "Male" "Male" "Male" "Female" ...
#> $ race : chr "WHITE" "WHITE" "WHITE" "WHITE" ...
#> $ ajcc_pathologic_tumor_stage : logi NA NA NA NA NA NA ...
#> $ histological_type : chr "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#> $ histological_grade : chr "G3" "G2" "G2" "G3" ...
#> $ treatment_outcome_first_course: chr NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#> $ radiation_treatment_adjuvant : chr "YES" NA "NO" "YES" ...
#> $ sample_type : chr "Primary" "Primary" "Primary" "Primary" ...
After running the function, four outputs objects are generated:
Train_Clin
(which contains only clinical
features from the training data),
Test_Clin
(which contains only clinical
features from the test data),
Train_Norm_data
(which includes clinical
features and normalized gene expression values for the training
samples), and Test_Norm_data
(which
includes clinical features and normalized gene expression values for the
test samples).
To create a survival model, the next step is to calculate the Prognostic Index (PI) score. The PI score is based on the expression levels of features selected by the LASSO regression model and their corresponding beta coefficients. For example, suppose five features (G1, G2, G3, G4, G5) are selected by the LASSO method, and their associated coefficients are B1, B2, B3, B4, and B5, respectively. The PI score is then computed using the following formula:
PI score = G1 * B1 + G2 * B2 + G3 * B3 + G4 * B4 + G5 * B5
For this function, you need to provide the normalized training data
object (Train_Norm_data) and test data object
(Test_Norm_data) obtained from the previous step
(train_test_normalization_f). Additionally, you must
specify the column number (col_num
) where clinical features
end (e.g., 21), the number of folds (nfolds
) for the LASSO
regression method (e.g., 5), and the survival time
(surv_time
) and survival event (surv_event
)
columns in the data (e.g., OS_month
and OS
,
respectively). The LASSO regression is implemented using the
glmnet
package. Finally, you need to
define names of output object to store the results, which will include
the selected LASSO features and their corresponding PI values.
# Step 4 - Lasso PI Score
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_PI <- Lasso_PI_scores_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
nfolds = 5,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
Train_Lasso_key_variables <- Result_PI$Train_Lasso_key_variables
Train_PI_data <- Result_PI$Train_PI_data
Test_PI_data <- Result_PI$Test_PI_data
str(Train_PI_data[1:10])
#> 'data.frame': 158 obs. of 10 variables:
#> $ OS : int 1 1 0 1 0 0 0 0 0 0 ...
#> $ OS_month : int 78 27 48 4 56 39 34 15 43 44 ...
#> $ AADACL4 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.212 0.297 0.053 0.358 0.064 0.032 0.09 0.051 0.1 0.014 ...
#> $ ABCC3 : num 0.1 5.476 0.029 0.108 0.037 ...
#> $ ABI1 : num 44.7 15.3 43.7 27.7 31.6 ...
#> $ ABRA : num 0 0.035 0 0 0.015 0.034 0.006 0 0.009 0.005 ...
#> $ AC006059.2: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.032 0.034 0.032 0 0.014 0.04 0.011 0.01 0.017 0.042 ...
#> $ AC008764.4: num 0.038 0 0 0 0 0 0.008 0 0 0.015 ...
str(Test_PI_data[1:10])
#> 'data.frame': 18 obs. of 10 variables:
#> $ OS : int 1 1 0 0 0 0 0 0 0 1 ...
#> $ OS_month : int 32 27 2 46 18 26 17 28 64 114 ...
#> $ AADACL4 : num 0.011 0 0.016 0 0 0 0 0 0 0 ...
#> $ ABCA12 : num 0.048 0.165 0.044 0.033 0.053 0.02 0.09 0.038 0.058 0.096 ...
#> $ ABCC3 : num 0.679 1.358 0.184 0.321 0.024 ...
#> $ ABI1 : num 39.4 19.9 31 27.8 56.4 ...
#> $ ABRA : num 0 0.001 0.029 0 0 0 0.026 0.004 0 0.014 ...
#> $ AC006059.2: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AC008676.3: num 0.002 0.054 0.012 0.092 0.031 0.003 0.004 0.078 0.046 0 ...
#> $ AC008764.4: num 0.036 0.009 0 0 0.015 0 0 0 0 0 ...
plot(Result_PI$cvfit)
## Outputs The
Lasso_PI_scores_f
function
generates the following outputs objects: 1.
Train_Lasso_key_variables
: A list of
features selected by LASSO along with their beta coefficient values. 2.
Train_Cox_Lasso_Regression_lambda_plot
:
The Lasso regression lambda plot. 3.
Train_PI_data
: This dataset contains the
expression values of genes selected by LASSO along with the PI score in
the last column for the training samples. 4.
Test_PI_data
: This dataset contains the
expression values of genes selected by LASSO along with the PI score in
the last column for the test samples.
In addition to the Prognostic Index (PI) score, the
Univariate_sig_features_f
function in the
CPSM package allows for the selection of significant features based on
univariate cox-regression survival analysis. This function identifies
features with a p-value less than 0.05, which are able to stratify
high-risk and low-risk survival groups. The stratification is done by
using the median expression value of each feature as a cutoff. ##
Required inputs To use this function, you need to provide the normalized
training (Train_Norm_data) and test
(Test_Norm_data) dataset objects, which were obtained
from the previous step (train_test_normalization_f).
Additionally, you must specify the column number (col_num
)
where the clinical features end (e.g., 21), as well as the names of the
columns containing survival time (surv_time
, e.g.,
OS_month
) and survival event information
(surv_event
, e.g., OS
). Furthermore, you need
to define output names for the resulting datasets that will contain the
expression values of the selected genes. These outputs will be used to
store the significant genes identified through univariate survival
analysis.
# Step 4b - Univariate Survival Significant Feature Selection.
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_Uni <- Univariate_sig_features_f(
train_data = Train_Norm_data,
test_data = Test_Norm_data,
col_num = 21,
surv_time = "OS_month",
surv_event = "OS"
)
Univariate_Suv_Sig_G_L <- Result_Uni$Univariate_Survival_Significant_genes_List
Train_Uni_sig_data <- Result_Uni$Train_Uni_sig_data
Test_Uni_sig_data <- Result_Uni$Test_Uni_sig_data
Uni_Sur_Sig_clin_List <- Result_Uni$Univariate_Survival_Significant_clin_List
Train_Uni_sig_clin_data <- Result_Uni$Train_Uni_sig_clin_data
Test_Uni_sig_clin_data <- Result_Uni$Test_Uni_sig_clin_data
str(Univariate_Suv_Sig_G_L[1:10])
#> chr [1:10] "A2ML1" "AADACL4" "AAMDC" "AAR2" "ABCA12" "ABCB4" "ABCB5" ...
The Univariate_sig_features_f
function
generates the following output objects: 1.
Univariate_Surv_Sig_G_L
: A table of
univariate significant genes, along with their corresponding coefficient
values, hazard ratio (HR) values, p-values, and C-Index values. 2.
Train_Uni_sig_data
: This dataset contains
the expression values of the significant genes selected by univariate
survival analysis for the training samples. 3.
Test_Uni_sig_data
: This dataset contains
the expression values of the significant genes selected by univariate
survival analysis for the test samples.
After selecting significant features using LASSO or univariate
survival analysis, the next step is to develop a machine learning (ML)
prediction model to estimate the survival probability of patients. The
MTLR_pred_model_f
function in the CPSM
package provides several options for building prediction models based on
different feature sets. These options include: - Model_type =
1: Model based on only clinical features - Model_type =
2: Model based on PI score - Model_type = 3:
Model based on PI score + clinical features - Model_type =
4: Model based on significant univariate features -
Model_type = 5: Model based on significant univariate
features + clinical features
For this analysis, we are interested in developing a model based on
the PI score (i.e., Model_type = 2). ## Required inputs
To use this function, the following inputs are required: 1.
Training data with only clinical features 2.
Test data with only clinical features 3. Model
type (e.g., 2 for a model based on PI score)
4. Training data with PI score 5. Test data
with PI score 6.
Clin_Feature_List
(e.g.,
Key_PI_list), a list of features to be used for
building the model 7. surv_time
: The name
of the column containing survival time in months (e.g.,
OS_month
) 8. surv_event
: The
name of the column containing survival event information (e.g.,
OS
)
These inputs will allow the
MTLR_pred_model_f
function to generate a
prediction model for the survival probability of patients based on the
provided data.
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Key_Clin_feature_list, package = "CPSM")
Result_Model_Type1 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 1,
train_features_data = Train_Clin,
test_features_data = Test_Clin,
Clin_Feature_List = Key_Clin_feature_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type1$survCurves_data
mean_median_survival_tim_d <- Result_Model_Type1$mean_median_survival_time_data
survival_result_bas_on_MTLR <- Result_Model_Type1$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type1$Error_mat_for_Model
str(survCurves_data)
#> 'data.frame': 13 obs. of 14 variables:
#> $ time_point : num 0 8.31 14 19 24 ...
#> $ TCGA-S9-A7IX-01: num 1 0.956 0.933 0.848 0.802 ...
#> $ TCGA-HT-8010-01: num 1 0.98 0.971 0.938 0.919 ...
#> $ TCGA-DU-5847-01: num 1 0.98 0.968 0.921 0.892 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.994 0.991 0.978 0.971 ...
#> $ TCGA-HT-7606-01: num 1 0.992 0.987 0.97 0.959 ...
#> $ TCGA-S9-A7QY-01: num 1 0.99 0.985 0.964 0.951 ...
#> $ TCGA-DH-5142-01: num 1 0.987 0.98 0.95 0.932 ...
#> $ TCGA-DU-6408-01: num 1 0.993 0.989 0.974 0.964 ...
#> $ TCGA-FG-8191-01: num 1 0.99 0.984 0.963 0.949 ...
#> $ TCGA-DU-6542-01: num 1 0.986 0.978 0.943 0.922 ...
#> $ TCGA-DB-A4XF-01: num 1 0.98 0.97 0.928 0.904 ...
#> $ TCGA-S9-A6WN-01: num 1 0.977 0.963 0.911 0.878 ...
#> $ TCGA-P5-A5EX-01: num 1 0.981 0.97 0.928 0.901 ...
str(mean_median_survival_tim_d)
#> 'data.frame': 13 obs. of 4 variables:
#> $ ID : chr "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> $ Mean : num 56.2 83.3 70.6 93.3 85.1 ...
#> $ Median : num 48.4 84.2 60.3 87.2 84.4 ...
#> $ OS_month: num 27 2 18 26 17 28 64 114 33 2 ...
str(survival_result_bas_on_MTLR)
#> num [1:13, 1:5] 56.2 83.3 70.6 93.3 85.1 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:13] "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> ..$ : chr [1:5] "Mean_Survival" "Median_Survival" "Event_Probability" "Actual_OS_Time" ...
str(Error_mat_for_Model)
#> num [1:2, 1:3] 0.78 1 42.06 50.09 38.66 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:3] "C_index" "Mean_MAE" "Median_MAE"
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_PI_list, package = "CPSM")
Result_Model_Type2 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 2,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_PI_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type2$survCurves_data
mean_median_surviv_tim_da <- Result_Model_Type2$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type2$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type2$Error_mat_for_Model
str(survCurves_data)
#> 'data.frame': 15 obs. of 19 variables:
#> $ time_point : num 0 4 10.9 14.4 18.9 ...
#> $ TCGA-E1-A7Z6-01: num 1 0.998 0.992 0.987 0.975 ...
#> $ TCGA-S9-A7IX-01: num 1 0.987 0.946 0.916 0.854 ...
#> $ TCGA-HT-8010-01: num 1 0.996 0.984 0.975 0.955 ...
#> $ TCGA-VM-A8C8-01: num 1 0.991 0.965 0.946 0.905 ...
#> $ TCGA-DU-5847-01: num 1 0.996 0.983 0.973 0.952 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.997 0.986 0.978 0.959 ...
#> $ TCGA-HT-7606-01: num 1 0.986 0.944 0.914 0.85 ...
#> $ TCGA-S9-A7QY-01: num 1 0.998 0.991 0.986 0.975 ...
#> $ TCGA-DH-5142-01: num 1 0.999 0.998 0.997 0.993 ...
#> $ TCGA-DU-6408-01: num 1 0.998 0.991 0.986 0.975 ...
#> $ TCGA-FG-8191-01: num 1 0.998 0.99 0.984 0.971 ...
#> $ TCGA-TM-A84S-01: num 1 0.996 0.984 0.975 0.954 ...
#> $ TCGA-DB-A64S-01: num 1 1 0.998 0.997 0.994 ...
#> $ TCGA-DU-6542-01: num 1 0.996 0.982 0.972 0.95 ...
#> $ TCGA-DU-A7T6-01: num 1 0.988 0.952 0.926 0.872 ...
#> $ TCGA-DB-A4XF-01: num 1 0.997 0.988 0.981 0.965 ...
#> $ TCGA-S9-A6WN-01: num 1 0.995 0.98 0.968 0.943 ...
#> $ TCGA-P5-A5EX-01: num 1 0.993 0.97 0.953 0.916 ...
str(mean_median_survival_tim_d)
#> 'data.frame': 13 obs. of 4 variables:
#> $ ID : chr "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> $ Mean : num 56.2 83.3 70.6 93.3 85.1 ...
#> $ Median : num 48.4 84.2 60.3 87.2 84.4 ...
#> $ OS_month: num 27 2 18 26 17 28 64 114 33 2 ...
str(survival_result_bas_on_MTLR)
#> num [1:13, 1:5] 56.2 83.3 70.6 93.3 85.1 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:13] "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> ..$ : chr [1:5] "Mean_Survival" "Median_Survival" "Event_Probability" "Actual_OS_Time" ...
str(Error_mat_for_Model)
#> num [1:2, 1:3] 0.97 0.8 45.5 54.59 41.03 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:3] "C_index" "Mean_MAE" "Median_MAE"
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_Clin_features_with_PI_list, package = "CPSM")
Result_Model_Type3 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 3,
train_features_data = Train_PI_data,
test_features_data = Test_PI_data,
Clin_Feature_List = Key_Clin_features_with_PI_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type3$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type3$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type3$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type3$Error_mat_for_Model
str(survCurves_data)
#> 'data.frame': 15 obs. of 19 variables:
#> $ time_point : num 0 4 10.9 14.4 18.9 ...
#> $ TCGA-E1-A7Z6-01: num 1 0.998 0.992 0.987 0.975 ...
#> $ TCGA-S9-A7IX-01: num 1 0.987 0.946 0.916 0.854 ...
#> $ TCGA-HT-8010-01: num 1 0.996 0.984 0.975 0.955 ...
#> $ TCGA-VM-A8C8-01: num 1 0.991 0.965 0.946 0.905 ...
#> $ TCGA-DU-5847-01: num 1 0.996 0.983 0.973 0.952 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.997 0.986 0.978 0.959 ...
#> $ TCGA-HT-7606-01: num 1 0.986 0.944 0.914 0.85 ...
#> $ TCGA-S9-A7QY-01: num 1 0.998 0.991 0.986 0.975 ...
#> $ TCGA-DH-5142-01: num 1 0.999 0.998 0.997 0.993 ...
#> $ TCGA-DU-6408-01: num 1 0.998 0.991 0.986 0.975 ...
#> $ TCGA-FG-8191-01: num 1 0.998 0.99 0.984 0.971 ...
#> $ TCGA-TM-A84S-01: num 1 0.996 0.984 0.975 0.954 ...
#> $ TCGA-DB-A64S-01: num 1 1 0.998 0.997 0.994 ...
#> $ TCGA-DU-6542-01: num 1 0.996 0.982 0.972 0.95 ...
#> $ TCGA-DU-A7T6-01: num 1 0.988 0.952 0.926 0.872 ...
#> $ TCGA-DB-A4XF-01: num 1 0.997 0.988 0.981 0.965 ...
#> $ TCGA-S9-A6WN-01: num 1 0.995 0.98 0.968 0.943 ...
#> $ TCGA-P5-A5EX-01: num 1 0.993 0.97 0.953 0.916 ...
str(mean_median_survival_tim_d)
#> 'data.frame': 13 obs. of 4 variables:
#> $ ID : chr "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> $ Mean : num 56.2 83.3 70.6 93.3 85.1 ...
#> $ Median : num 48.4 84.2 60.3 87.2 84.4 ...
#> $ OS_month: num 27 2 18 26 17 28 64 114 33 2 ...
str(survival_result_bas_on_MTLR)
#> num [1:13, 1:5] 56.2 83.3 70.6 93.3 85.1 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:13] "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> ..$ : chr [1:5] "Mean_Survival" "Median_Survival" "Event_Probability" "Actual_OS_Time" ...
str(Error_mat_for_Model)
#> num [1:2, 1:3] 0.97 0.8 45.5 54.59 41.03 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:3] "C_index" "Mean_MAE" "Median_MAE"
data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_Uni_sig_data, package = "CPSM")
data(Test_Uni_sig_data, package = "CPSM")
data(Key_univariate_features_with_Clin_list, package = "CPSM")
Result_Model_Type5 <- MTLR_pred_model_f(
train_clin_data = Train_Clin,
test_clin_data = Test_Clin,
Model_type = 4,
train_features_data = Train_Uni_sig_data,
test_features_data = Test_Uni_sig_data,
Clin_Feature_List = Key_univariate_features_with_Clin_list,
surv_time = "OS_month",
surv_event = "OS"
)
survCurves_data <- Result_Model_Type5$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type5$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type5$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type5$Error_mat_for_Model
str(survCurves_data)
#> 'data.frame': 15 obs. of 19 variables:
#> $ time_point : num 0 4 10.9 14.4 18.9 ...
#> $ TCGA-E1-A7Z6-01: num 1 1 0.999 0.998 0.996 ...
#> $ TCGA-S9-A7IX-01: num 1 0.983 0.93 0.89 0.801 ...
#> $ TCGA-HT-8010-01: num 1 0.999 0.998 0.997 0.993 ...
#> $ TCGA-VM-A8C8-01: num 1 0.994 0.978 0.966 0.939 ...
#> $ TCGA-DU-5847-01: num 1 0.999 0.995 0.992 0.985 ...
#> $ TCGA-TQ-A7RQ-01: num 1 0.994 0.979 0.967 0.942 ...
#> $ TCGA-HT-7606-01: num 1 0.981 0.921 0.876 0.764 ...
#> $ TCGA-S9-A7QY-01: num 1 0.998 0.993 0.989 0.98 ...
#> $ TCGA-DH-5142-01: num 1 0.999 0.996 0.994 0.989 ...
#> $ TCGA-DU-6408-01: num 1 0.999 0.994 0.991 0.984 ...
#> $ TCGA-FG-8191-01: num 1 0.997 0.99 0.983 0.967 ...
#> $ TCGA-TM-A84S-01: num 1 0.999 0.995 0.991 0.984 ...
#> $ TCGA-DB-A64S-01: num 1 0.999 0.995 0.993 0.986 ...
#> $ TCGA-DU-6542-01: num 1 0.995 0.981 0.971 0.946 ...
#> $ TCGA-DU-A7T6-01: num 1 0.997 0.989 0.981 0.96 ...
#> $ TCGA-DB-A4XF-01: num 1 0.998 0.991 0.986 0.976 ...
#> $ TCGA-S9-A6WN-01: num 1 0.989 0.955 0.93 0.887 ...
#> $ TCGA-P5-A5EX-01: num 1 0.993 0.972 0.956 0.923 ...
str(mean_median_survival_tim_d)
#> 'data.frame': 13 obs. of 4 variables:
#> $ ID : chr "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> $ Mean : num 56.2 83.3 70.6 93.3 85.1 ...
#> $ Median : num 48.4 84.2 60.3 87.2 84.4 ...
#> $ OS_month: num 27 2 18 26 17 28 64 114 33 2 ...
str(survival_result_bas_on_MTLR)
#> num [1:13, 1:5] 56.2 83.3 70.6 93.3 85.1 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:13] "TCGA-S9-A7IX-01" "TCGA-HT-8010-01" "TCGA-DU-5847-01" "TCGA-TQ-A7RQ-01" ...
#> ..$ : chr [1:5] "Mean_Survival" "Median_Survival" "Event_Probability" "Actual_OS_Time" ...
str(Error_mat_for_Model)
#> num [1:2, 1:3] 0.9 0.76 50.11 58.3 47.88 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:2] "Training_set" "Test_set"
#> ..$ : chr [1:3] "C_index" "Mean_MAE" "Median_MAE"
After implementing the
MTLR_pred_model_f
function, the following
outputs are generated:
To visualize the survival of patients, we use the
surv_curve_plots_f
function, which
generates survival curve plots based on the
survCurves_data
obtained from the previous
step (after running the MTLR_pred_model_f
function). This function also provides the option to highlight the
survival curve of a specific patient.
The function requires two inputs: 1.
Surv_curve_data: The data object containing predicted
survival probabilities for all patients. 2. Sample ID:
The ID of the specific patient (e.g., TCGA-TQ-A8XE-01
)
whose survival curve you want to highlight.
# Create Survival curves/plots for individual patients
data(survCurves_data, package = "CPSM")
plots <- surv_curve_plots_f(
Surv_curve_data = survCurves_data,
selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots$all_patients_plot)
print(plots$highlighted_patient_plot)
## Outputs After running the function, two output plots are generated:
1. Survival curves for all patients in the test data,
displayed with different colors for each patient. 2. Survival
curves for all patients (in black) with the selected patient
highlighted in red.
These plots allow for easy visualization of individual patient survival in the context of the overall test data.
To visualize the predicted survival times for patients, we use the
mean_median_surv_barplot_f
function, which
generates bar plots for the mean and median survival times based on the
data obtained from Step 5 after running the
MTLR_pred_model_f
function. This function
also provides the option to highlight a specific patient on the bar
plot.
This function requires two inputs: 1.
surv_mean_med_data: The data containing the predicted
mean and median survival times for all patients. 2. Sample
ID: The ID of the specific patient (e.g.,
TCGA-TQ-A7RQ-01
) whose bar plot should be highlighted.
data(mean_median_survival_time_data, package = "CPSM")
plots_2 <- mean_median_surv_barplot_f(
surv_mean_med_data =
mean_median_survival_time_data,
selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots_2$mean_med_all_pat)
print(plots_2$highlighted_selected_pat)
After running the function, two output bar plots are generated: 1. Bar plot for all patients in the test data, where the red-colored bars represent the mean survival time, and the cyan/green-colored bars represent the median survival time. 2. Bar plot for all patients with a highlighted patient (indicated by a dashed black outline). This plot shows that the highlighted patient has predicted mean and median survival times of 81.58 and 75.50 months, respectively.
These plots provide a clear comparison of the predicted survival times for all patients and the highlighted individual patient.
To predict the survival-based risk group of test samples (i.e.,
high-risk with shorter survival or
low-risk with longer survival), we use the
predict_survival_risk_group_f()
function provided in the
CPSM package. This function implements a
randomForestSRC-based prediction approach for survival
risk classification. Thhis function first defines actual risk groups in
the training data using the median overall survival
time:
Multiple Random Survival Forest (RSF) models are then trained using
different values for ntree
:
10, 20, 50, 100, 250, 500, 750, 1000.
The model with the best performance (e.g., highest accuracy) is selected automatically. This best-performing model is used to predict the risk group of test samples, along with prediction probabilities.
selected_train_data
: A data frame with normalized
expression values for selected features and survival information
(OS_month
, OS_event
) for the training
set.selected_test_data
: A data frame with normalized
expression values for the same features for the test set.Feature_List
: A character vector containing the names
of selected features to be used in the model.# Load example data from CPSM package
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_PI_list, package = "CPSM")
# Predict survival-based risk groups for test samples
Results_Risk_group_Prediction <- predict_survival_risk_group_f(
selected_train_data = Train_PI_data,
selected_test_data = Test_PI_data,
Feature_List = Key_PI_list
)
#> Training _with ntree = 10
#> Training _with ntree = 20
#> Training _with ntree = 50
#> Training _with ntree = 100
#> Training _with ntree = 250
#> Training _with ntree = 500
#> Training _with ntree = 750
#> Training _with ntree = 1000
#Performance of the best model on Training and Test data
Best_model_Prediction_results<- Results_Risk_group_Prediction$misclassification_results
print(head(Best_model_Prediction_results))
#> Best_ntree OOB_Misclassification High_Risk_Error Low_Risk_Error
#> all 10 0.406 0.443 0.392
#> Train_Misclassification_Error Train_Accuracy Train_Sensitivity
#> all 0.038 96.2 96.2
#> Train_Specificity Test_Misclassification_Error Test_Accuracy
#> all 96.2 0.5 50
#> Test_Sensitivity Test_Specificity
#> all 41.67 66.67
#Prediction results of the best model on Training set
Test_results <- Results_Risk_group_Prediction$Test_results #Prediction resulst on Test data
print(head(Test_results))
#> Sample_ID Actual Predicted_Risk_Group High_Risk_Prob
#> TCGA-E1-A7Z6-01 TCGA-E1-A7Z6-01 Low_Risk Low_Risk 0.3
#> TCGA-S9-A7IX-01 TCGA-S9-A7IX-01 High_Risk Low_Risk 0.2
#> TCGA-HT-8010-01 TCGA-HT-8010-01 High_Risk High_Risk 1.0
#> TCGA-VM-A8C8-01 TCGA-VM-A8C8-01 Low_Risk High_Risk 1.0
#> TCGA-DU-5847-01 TCGA-DU-5847-01 High_Risk Low_Risk 0.0
#> TCGA-TQ-A7RQ-01 TCGA-TQ-A7RQ-01 High_Risk Low_Risk 0.0
#> Low_Risk_Prob Prediction_Prob OS_month OS_event
#> TCGA-E1-A7Z6-01 0.7 0.7 32 1
#> TCGA-S9-A7IX-01 0.8 0.8 27 1
#> TCGA-HT-8010-01 0.0 1.0 2 0
#> TCGA-VM-A8C8-01 0.0 1.0 46 0
#> TCGA-DU-5847-01 1.0 1.0 18 0
#> TCGA-TQ-A7RQ-01 1.0 1.0 26 0
The output is a list that includes: 1. Best prediction model 2. Performance metrics (accuracy, sensitivity, specificity, Error rate, etc.) for training and test data 3. Predicted risk groups with prediction probability values for training samples 4. Predicted risk groups with prediction probability values for test samples
User can use these results for further validation and visualization, such as overlaying test sample survival curves on the training KM plot (see next step).
To visually evaluate how a specific test sample compares to survival
risk groups defined in the training dataset, we use the
km_overlay_plot_f()
function. This function overlays the
predicted survival curve of a selected test sample onto
the Kaplan-Meier (KM) survival plot derived from the training data. This
visual comparison helps determine how closely the test sample aligns
with population-level survival trends. ## Required Inputs It requres
requires following inputs - Train_results
:
A data frame containing predicted risk groups, survival times
(OS_month
), event status (OS_event
), and
additional training data.
Row names must correspond to sample IDs.
Test_results
:
A data frame with predicted risk groups and prediction probabilities for
the test dataset.
Row names must correspond to sample IDs.
survcurve_te_data
:
A data frame with predicted survival probabilities over multiple time
points for test samples (that we obtained from Step 5).
selected_sample
:
The sample ID (matching a row in
Test_results
) for which the test survival curve should be
plotted.
# Load example results
data(Train_results, package = "CPSM")
data(Test_results, package = "CPSM")
data(survCurves_data, package = "CPSM")
# Select a test sample to visualize
sample_id <- "TCGA-TQ-A7RQ-01"
# Generate KM overlay plot
KM_plot <- km_overlay_plot_f(
Train_results = Train_results,
Test_results = Test_results,
survcurve_te_data = survCurves_data,
selected_sample = sample_id
)
# Display plot
KM_plot
This visualization is useful for: - Displaying individual patient patterns in a survival context of training samples - Verifying predicted risk classifications
The Nomogram_generate_f
function in the
CPSM package allows you to generate a nomogram plot based on
user-defined clinical and other relevant features in the data. For
example, we will generate a nomogram using six features: Age, Gender,
Race, Histological Type, Sample Type, and PI score.
To create the nomogram, we need to provide the following inputs: 1.
Train_Data_Nomogram_input: A dataset containing all the
features, where samples are in the rows and features are in the columns.
2. feature_list_for_Nomogram: A list of features (e.g.,
Age, Gender, etc.) that will be used to generate the nomogram. 3.
surv_time: The column name containing survival time in
months (e.g., OS_month
). 4. surv_event:
The column name containing survival event information (e.g.,
OS
).
data(Train_Data_Nomogram_input, package = "CPSM")
data(feature_list_for_Nomogram, package = "CPSM")
Result_Nomogram <- Nomogram_generate_f(
data = Train_Data_Nomogram_input,
Feature_List = feature_list_for_Nomogram,
surv_time = "OS_month",
surv_event = "OS"
)
C_index_mat <- Result_Nomogram$C_index_mat
After running the function, the output is a nomogram that predicts the risk (e.g., Event risk such as death), as well as the 1-year, 3-year, 5-year, and 10-year survival probabilities for patients based on the selected features.The nomogram provides a visual representation to estimate the patient’s survival outcomes over multiple time points, helping clinicians make more informed decisions.
As last part of this document, we call the function “sessionInfo()”, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package.
sessionInfo()
#> R version 4.5.0 Patched (2025-04-21 r88169)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Ventura 13.7.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.39.0 Biobase_2.69.0
#> [3] GenomicRanges_1.61.0 GenomeInfoDb_1.45.0
#> [5] IRanges_2.43.0 S4Vectors_0.47.0
#> [7] BiocGenerics_0.55.0 generics_0.1.3
#> [9] MatrixGenerics_1.21.0 matrixStats_1.5.0
#> [11] CPSM_1.1.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_2.0.0
#> [4] shape_1.4.6.1 magrittr_2.0.3 TH.data_1.1-3
#> [7] farver_2.1.2 rmarkdown_2.29 vctrs_0.6.5
#> [10] ROCR_1.0-11 base64enc_0.1-3 rstatix_0.7.2
#> [13] polspline_1.1.25 htmltools_0.5.8.1 S4Arrays_1.9.0
#> [16] broom_1.0.8 SparseArray_1.9.0 Formula_1.2-5
#> [19] pROC_1.18.5 caret_7.0-1 sass_0.4.10
#> [22] parallelly_1.43.0 bslib_0.9.0 htmlwidgets_1.6.4
#> [25] sandwich_3.1-1 plyr_1.8.9 zoo_1.8-14
#> [28] lubridate_1.9.4 cachem_1.1.0 lifecycle_1.0.4
#> [31] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-3
#> [34] R6_2.6.1 fastmap_1.2.0 GenomeInfoDbData_1.2.14
#> [37] future_1.40.0 digest_0.6.37 colorspace_2.1-1
#> [40] Hmisc_5.2-3 ggpubr_0.6.0 labeling_0.4.3
#> [43] km.ci_0.5-6 timechange_0.3.0 httr_1.4.7
#> [46] abind_1.4-8 compiler_4.5.0 proxy_0.4-27
#> [49] withr_3.0.2 htmlTable_2.4.3 backports_1.5.0
#> [52] carData_3.0-5 ggsignif_0.6.4 MASS_7.3-65
#> [55] lava_1.8.1 quantreg_6.1 DelayedArray_0.35.1
#> [58] ModelMetrics_1.2.2.2 tools_4.5.0 foreign_0.8-90
#> [61] future.apply_1.11.3 nnet_7.3-20 glue_1.8.0
#> [64] DiagrammeR_1.0.11 nlme_3.1-168 gridtext_0.1.5
#> [67] grid_4.5.0 checkmate_2.3.2 cluster_2.1.8.1
#> [70] reshape2_1.4.4 recipes_1.3.0 gtable_0.3.6
#> [73] KMsurv_0.1-5 class_7.3-23 preprocessCore_1.71.0
#> [76] tidyr_1.3.1 survminer_0.5.0 data.table_1.17.0
#> [79] xml2_1.3.8 car_3.1-3 XVector_0.49.0
#> [82] foreach_1.5.2 pillar_1.10.2 stringr_1.5.1
#> [85] splines_4.5.0 ggtext_0.1.2 dplyr_1.1.4
#> [88] lattice_0.22-7 survival_3.8-3 SparseM_1.84-2
#> [91] tidyselect_1.2.1 rms_8.0-0 knitr_1.50
#> [94] gridExtra_2.3 svglite_2.1.3 xfun_0.52
#> [97] hardhat_1.4.1 timeDate_4041.110 visNetwork_2.1.2
#> [100] stringi_1.8.7 UCSC.utils_1.5.0 evaluate_1.0.3
#> [103] codetools_0.2-20 data.tree_1.1.0 tibble_3.2.1
#> [106] cli_3.6.5 rpart_4.1.24 xtable_1.8-4
#> [109] randomForestSRC_3.3.3 systemfonts_1.2.2 MTLR_0.2.1
#> [112] jquerylib_0.1.4 survMisc_0.5.6 dichromat_2.0-0.1
#> [115] Rcpp_1.0.14 globals_0.17.0 parallel_4.5.0
#> [118] MatrixModels_0.5-4 ggfortify_0.4.17 gower_1.0.2
#> [121] ggplot2_3.5.2 listenv_0.9.1 glmnet_4.1-8
#> [124] mvtnorm_1.3-3 ipred_0.9-15 e1071_1.7-16
#> [127] scales_1.4.0 prodlim_2024.06.25 purrr_1.0.4
#> [130] crayon_1.5.3 rlang_1.1.6 multcomp_1.4-28