ProteinGymR 1.3.0
This vignette demonstrates how to explore and visualize DMS and model scores
from the ProteinGym database v1.2. Specifically, it walks through the
functionality to generate heatmaps of all pairwise DMS substitution mutants and
projects these scores onto 3D protein structures. ProteinGymR
uses
functionality from the Bioconductor package ComplexHeatmap and
r3dmol from CRAN under the hood to generate the heatmaps and 3D protein
structures, respectively.
Explore the “ACE2_HUMAN” assay from Chan et al. 2020 and create a heatmap of
the DMS scores with plot_dms_heatmap()
. If the argument dms_data
is not
specified, the default will load the most recent DMS substitution data from
ProteinGym with ProteinGymR::dms_substitutions()
. This function only requires
a specific assay name. To obtain all assay names, run:
names(dms_substitutions())
. By default, the function also plots the full range
of positions where DMS data is available for this assay. To plot a specific
region of interest, use the arguments start_pos()
and end_pos()
which takes
in an integer for the first and last residue position to plot in the protein.
ace2_dms <- plot_dms_heatmap(
assay_name = "ACE2_HUMAN_Chan_2020",
start_pos = 1,
end_pos = 100)
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
print(ace2_dms)
The heatmap shows the DMS score at each position along the given protein (x-axis) where a residue was mutated (alternate amino acid on displayed on the y-axis and the reference allele at the position is shown on top). For this demonstration, we subset to the first 1-100 positions and grouped the amino acids by their physiochemical properties (DE,KRH,NQ,ST,PGAVIL,MC,FYW). See [here][physiochem] for more information. As a note, not all positions along the protein sequence may be subjected to mutation for every DMS assay. This results from the specific research objectives, prioritization choices of the investigators, or technical constraints inherent to the experimental design.
A low DMS score indicates low fitness, while a higher DMS score indicates high fitness. We can think of higher DMS scores as being more benign, while lower DMS score indicates more pathogenic regions.
Based on the “ACE2_HUMAN_Chan_2020” assay, we can see that at positions 90 and 92, fitness remained high despite across amino acid changes; possibly suggestive of a benign region of the protein. However, several mutations at position 48 resulted in low fitness. This could represent an important region for protein function where any perturbation would likely be deleterious.
Let’s plot another assay, specifying a region and invoking the ComplexHeatmap
row clustering under the hood. For more details about this clustering method or
to view more function parameters, read the function documentation with
?plot_dms_heatmap()
.
plot_dms_heatmap(assay_name = "SHOC2_HUMAN_Kwon_2022",
start_pos = 10,
end_pos = 60,
cluster_rows = TRUE)
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
In this region of the SHOC2_HUMAN protein, mutating to a K (y-axis) seemed to
have the most benign affect across all mutations.
ProteinGymR Bioc 3.21 provides functionality to generate heatmaps of zero-shot
mode scores for 79 variant effect prediction models and 11 semi-supervised
models with the function plot_zeroshot_heatmap()
. The required arguments
for this function are the assay name to plot (same as for the DMS heatmap),
and a model to plot. For a complete list of models, run available_models()
for
zero-shot models, and supervised_available_models()
for the 11 semi-supervised
models. If model_data
is not provided, the default model scores from
ProteinGym will be loaded in from ProteinGymR::zeroshot_substitutions()
.
ace2_model <- plot_zeroshot_heatmap(
assay_name = "ACE2_HUMAN_Chan_2020",
model = "GEMME",
start_pos = 1,
end_pos = 100)
## 'model_data' not provided, using default data loaded with zeroshot_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
print(ace2_model)
As with the DMS scores, we are plotting the GEMME zero-shot scores for each position available in the assay “ACE2_HUMAN_Chan_2020”. Note that the model scores here are mostly negative and thus, the color scale is red. Model scores are not rescaled or normalized across the 79 models, and comparison across of raw model predictions should be cautioned in this context. For more information on model scores and how to interpret them, consult the original ProteinGym publication.
It can be useful to visualize the DMS and model scores side by side for a given assay to compare the experimental DMS scores and predicted zero-shot scores outputted from the model. This is easily done with the following:
ComplexHeatmap::draw(ace2_dms %v% ace2_model)
%v%
stacks the heatmaps in one column, while +
will display them in two
columns, side by side. This can also be done with any output of class
ComplexHeatmap::Heatmap().
This vignette demonstrates how to explore and visualize DMS or model scores on a 3D protein structure using the package r3dmol under the hood. The function requires a file path to a PBD file containing protein structure coordinates and a DMS or model assay to aggregate DMS scores that will be projected onto the 3D structure.
By default, if no dms_data
argument is provided, the DMS substitutions from
ProteinGymR::dms_substitutions()
are loaded in, or if viewing model scores,
model scores from an assay loaded with ProteinGymR::model_substitutions()
will
be used as default. Alternatively, a user can specify their own DMS list to
project. The user may also specify what aggregation method to use for
calculating the summary statistic at each residue position. By default,
the mean DMS score/model prediction score is calculated for each position.
See the function documentation for details: ?plot_structure()
First, let’s use all the default settings. The only required arguments are
the assay_name
and the pdb_file
file path.
Importantly, the function normalizes the aggregated DMS scores so that a score of zero will always be represented as white and the range of scores will be scaled relatively to -1 and 1 in that assay, with 1 being the maximum aggregated DMS score within that assay region.
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020")
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
In this example, we are plotting the 3D structure of the ACE2_HUMAN protein and overlaying the mean DMS score across all mutants in a given position. Chan et al. 2020 who generated the DMS assay data only experimentally assessed a subset of the entire ACE2_HUMAN protein. By default the function only displayed the regions where there is information available in the assay. Red colors represent more pathogenic (lower DMS scores) and blue colors show more benign positions (higher DMS scores). Regions that appear white indicate closer to no change before and after the DMS perturbation. By default, regions of the protein without experiment information have the “ball and stick” representation.
If we want to visualize the complete protein structure, but simply grey out the
regions we do not have DMS assay information for rather than using the
“ball and stick” representation, we can set the argument
full_structure == TRUE
.
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020",
full_structure = TRUE)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'full_structure' is set to TRUE by default, displaying complete protein structure.
Now we can more clearly see the entire protein structure for ACE2_HUMAN in the ribbon representation.
Some assays extensively assessed nearly every position of the complete protein,
for example: the C6KNH7_9INFA protein from Lee et al. 2018. Let’s visualize
this protein and set the aggregation method to view the minimum DMS score across
all mutants at each position by setting aggregation_fun = min
.
plot_structure(assay_name = "C6KNH7_9INFA_Lee_2018",
aggregate_fun = min)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
As we might expect, the minimum DMS value (more pathogenic) is almost always a negative number across all positions of this protein. Therefore, there seems to be at least one amino acid mutation that could severely distrupt the fitness at any position of this protein.
dms_data <- dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
df <- dms_data[["C6KNH7_9INFA_Lee_2018"]]
ggplot(df, aes(DMS_score)) +
geom_histogram(alpha = 0.8) +
theme_minimal() +
theme(text = element_text(size = 20))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that for this DMS assay, most values of the minimum DMS score at each
position are negative and are thus represented as red.
Finally, the user can also zoom into a specific region of the protein by
defining residue position coordinates with start_pos
and end_pos
.
Let’s just visualize the mean DMS scores of the protein from position 10:50.
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020",
start_pos = 10,
end_pos = 50,
full_structure = FALSE)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
Finally, it is possible to use the same color scheme as the popEVE mutation portal. We can do this for any of the heatmaps or protein structure plots.
plot_structure(assay_name = "ACE2_HUMAN_Chan_2020",
color_scheme = "EVE")
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
By default, dms_corr_plot()
allows the user to evaluate the correlation
between experimental and model prediction scores. By default, it takes in a
protein UniProt ID and runs a Spearman correlation between the ProteinGym DMS
assay scores and AlphaMissense predicted pathogenicity scores. It returns a
ggplot for visualization.
dms_corr_plot(uniprotId = "Q9NV35")
## 'alphamissense_table' not provided, using default table from `ProteinGymR::am_scores()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_table' not provided, using default table from `ProteinGymR::dms_substitutions()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = -0.68; Pval = 0"
Be default, the dms_corr_plot()
function gathers any of the 217 DMS assays of
the chosen UniProt ID and correlates the average DMS score across relevant
assays and the AlphaMissense model predictions.
dms_corr_plot(uniprotId = "Q9NV35")
## 'alphamissense_table' not provided, using default table from `ProteinGymR::am_scores()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_table' not provided, using default table from `ProteinGymR::dms_substitutions()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = -0.68; Pval = 0"
Be default, the dms_corr_plot()
function gathers any of the 217 DMS assays of
the chosen UniProt ID and correlates the average DMS scoree across relevant
assays and the AlphaMissense model predictions.
Although the default uses the AlphaMissense scores, it is also possible to correlate DMS experimental scores with predictions from any of the 79 zero-shot or 11 supervised models. Below is an example of the workflow to accomplish this for a single assay “A0A140D2T1_ZIKV_Sourisseau_2019”.
Load in both the zero-shot and DMS substitution scores for the 217 assays.
## Load model scores
model_data <- zeroshot_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## Load DMS scores
dms_data <- dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
Specify the chosen assay and model to explore. Again, to view all assay names to
choose from, run names()
on the list loaded in by dms_substitutions()
.
## Specify the chosen assay and model to explore.
assay_name <- "A0A140D2T1_ZIKV_Sourisseau_2019"
model <- "GEMME"
## Filter selected assay
model_df <- as.data.frame(model_data[[assay_name]])
dms_df <- as.data.frame(dms_data[[assay_name]])
## Filter out multiple aa sites
model_df <- model_df |>
filter(!str_detect(.data$mutant, ":"))
dms_df <- dms_df |>
filter(!str_detect(.data$mutant, ":"))
## Select chosen model
model_df <- model_df |>
dplyr::select(
mutant,
all_of(model)
)
model_df <- model_df |>
dplyr::select(all_of(model), mutant)
Now we wrangle and merge the scores into a data.frame that we will use for plotting.
## Wrangle the data
dms_df <- dms_df |>
dplyr::select(DMS_score, mutant)
merged_table <-
left_join(
model_df, dms_df,
by = c("mutant"),
relationship = "many-to-many"
) |>
na.omit() |>
select(mutant, DMS_score, model = all_of(model))
## Corr function
pg_correlate <-
function(merged_table)
{
cor_results <-
cor.test(
merged_table$DMS_score, merged_table$model,
method=c("spearman"),
exact = FALSE
)
cor_results
}
cor_results <- pg_correlate(merged_table)
## Correlation density plot
pg_density_plot <-
merged_table |>
ggplot(
aes(x = .data$DMS_score, y = .data$model)
) +
geom_bin2d(bins = 60) +
geom_point(alpha = 0) +
scale_fill_continuous(type = "viridis") +
xlab("DMS score") +
ylab(paste0(model, "\nzero shot score")) +
theme_classic() +
theme(
axis.text.x = element_text(size = 16),
axis.text.y = element_text(size = 16),
axis.title.y = element_text(size = 16, vjust = 2),
axis.title.x = element_text(size = 16, vjust = 0),
legend.title = element_text(size = 16),
legend.text = element_text(size = 16)
)
# Add marginal density plots
pg_density_plot <- ggMarginal(
pg_density_plot,
type = "densigram", # Can also use "histogram"
fill = "#B0C4DE",
color = "black" # Change color as needed
)
pg_density_plot
print(paste0("r = ", format(round(cor_results$estimate, 2)),
"; Pval = ", cor_results$p.value))
## [1] "r = 0.43; Pval = 0"
We can also explore the correlation between two different models.
## Select assay and models to compare
assay_name <- "A0A140D2T1_ZIKV_Sourisseau_2019"
model1 <- "EVE_ensemble"
model2 <- "S3F_MSA"
## Filter selected assay
model_df <- model_data[[assay_name]]
## Filter out multiple aa sites
model_df <- model_df |>
filter(!grepl(":", .data$mutant))
## Select chosen model
model_df <- model_df |>
dplyr::select(
mutant,
all_of(c(model1 = model1, model2 = model2))
)
## Corr function
pg_correlate <-
function(model_df)
{
cor_results <-
cor.test(
model_df$model1, model_df$model2,
method=c("spearman"),
exact = FALSE
)
cor_results
}
cor_results <- pg_correlate(model_df = model_df)
## Correlation density plot
pg_density_plot <-
model_df |>
ggplot(
aes(x = .data$model1, y = .data$model2)
) +
geom_bin2d(bins = 60) +
geom_point(alpha = 0) +
scale_fill_continuous(type = "viridis") +
xlab(paste0(model1, "\nzero shot score")) +
ylab(paste0(model2, "\nzero shot score")) +
theme_classic() +
theme(
axis.text.x = element_text(size = 16),
axis.text.y = element_text(size = 16),
axis.title.y = element_text(size = 16, vjust = 2),
axis.title.x = element_text(size = 16, vjust = 0),
legend.title = element_text(size = 16),
legend.text = element_text(size = 16)
)
## Add marginal density plots
pg_density_plot <- ggMarginal(
pg_density_plot,
type = "densigram", # Can also use "histogram"
fill = "#B0C4DE",
color = "black" # Change color as needed
)
pg_density_plot
print(paste0("r = ", format(round(cor_results$estimate, 2)),
"; Pval = ", cor_results$p.value))
## [1] "r = 0.84; Pval = 0"
Between the EVE_ensemble and S3F_MSA, there seems to be good correlation between the model predictions for the variants in the assay “A0A140D2T1_ZIKV_Sourisseau_2019”.
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggExtra_0.10.1 ComplexHeatmap_2.25.0 ggplot2_3.5.2
## [4] stringr_1.5.1 dplyr_1.1.4 tidyr_1.3.1
## [7] ProteinGymR_1.3.0 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 httr2_1.1.2 rlang_1.1.6
## [4] magrittr_2.0.3 clue_0.3-66 GetoptLong_1.0.5
## [7] matrixStats_1.5.0 compiler_4.5.0 RSQLite_2.3.9
## [10] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
## [13] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [16] magick_2.8.6 dbplyr_2.5.0 XVector_0.49.0
## [19] labeling_0.4.3 promises_1.3.2 rmarkdown_2.29
## [22] tinytex_0.57 purrr_1.0.4 bit_4.6.0
## [25] xfun_0.52 cachem_1.1.0 queryup_1.0.5
## [28] GenomeInfoDb_1.45.1 jsonlite_2.0.0 gghalves_0.1.4
## [31] blob_1.2.4 later_1.4.2 parallel_4.5.0
## [34] spdl_0.0.5 cluster_2.1.8.1 R6_2.6.1
## [37] stringi_1.8.7 bslib_0.9.0 RColorBrewer_1.1-3
## [40] jquerylib_0.1.4 Rcpp_1.0.14 bookdown_0.43
## [43] iterators_1.0.14 knitr_1.50 IRanges_2.43.0
## [46] httpuv_1.6.16 tidyselect_1.2.1 dichromat_2.0-0.1
## [49] yaml_2.3.10 doParallel_1.0.17 codetools_0.2-20
## [52] miniUI_0.1.2 curl_6.2.2 tibble_3.2.1
## [55] withr_3.0.2 Biobase_2.69.0 shiny_1.10.0
## [58] bio3d_2.4-5 KEGGREST_1.49.0 evaluate_1.0.3
## [61] r3dmol_0.1.2 BiocFileCache_2.99.0 ggdist_3.3.3
## [64] circlize_0.4.16 ExperimentHub_2.99.0 Biostrings_2.77.0
## [67] pillar_1.10.2 BiocManager_1.30.25 filelock_1.0.3
## [70] foreach_1.5.2 stats4_4.5.0 distributional_0.5.0
## [73] generics_0.1.3 BiocVersion_3.22.0 S4Vectors_0.47.0
## [76] scales_1.4.0 xtable_1.8-4 glue_1.8.0
## [79] tools_4.5.0 AnnotationHub_3.99.0 forcats_1.0.0
## [82] Cairo_1.6-2 AnnotationDbi_1.71.0 colorspace_2.1-1
## [85] GenomeInfoDbData_1.2.14 RcppSpdlog_0.0.21 cli_3.6.5
## [88] rappdirs_0.3.3 viridisLite_0.4.2 gtable_0.3.6
## [91] sass_0.4.10 digest_0.6.37 BiocGenerics_0.55.0
## [94] rjson_0.2.23 htmlwidgets_1.6.4 farver_2.1.2
## [97] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
## [100] httr_1.4.7 GlobalOptions_0.1.2 mime_0.13
## [103] bit64_4.6.0-1