1 Setup

2 Introduction

This vignette demonstrates how to explore and visualize DMS and model scores from the ProteinGym database v1.2. Specifically, it walks through the functionality to generate heatmaps of all pairwise DMS substitution mutants and projects these scores onto 3D protein structures. ProteinGymR uses functionality from the Bioconductor package ComplexHeatmap and r3dmol from CRAN under the hood to generate the heatmaps and 3D protein structures, respectively.

3 Visualize DMS scores along a protein

Explore the “ACE2_HUMAN” assay from Chan et al. 2020 and create a heatmap of the DMS scores with plot_dms_heatmap(). If the argument dms_data is not specified, the default will load the most recent DMS substitution data from ProteinGym with ProteinGymR::dms_substitutions(). This function only requires a specific assay name. To obtain all assay names, run: names(dms_substitutions()). By default, the function also plots the full range of positions where DMS data is available for this assay. To plot a specific region of interest, use the arguments start_pos() and end_pos() which takes in an integer for the first and last residue position to plot in the protein.

ace2_dms <- plot_dms_heatmap(
    assay_name = "ACE2_HUMAN_Chan_2020",
    start_pos = 1, 
    end_pos = 100)
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
print(ace2_dms)

The heatmap shows the DMS score at each position along the given protein (x-axis) where a residue was mutated (alternate amino acid on displayed on the y-axis and the reference allele at the position is shown on top). For this demonstration, we subset to the first 1-100 positions and grouped the amino acids by their physiochemical properties (DE,KRH,NQ,ST,PGAVIL,MC,FYW). See [here][physiochem] for more information. As a note, not all positions along the protein sequence may be subjected to mutation for every DMS assay. This results from the specific research objectives, prioritization choices of the investigators, or technical constraints inherent to the experimental design.

A low DMS score indicates low fitness, while a higher DMS score indicates high fitness. We can think of higher DMS scores as being more benign, while lower DMS score indicates more pathogenic regions.

Based on the “ACE2_HUMAN_Chan_2020” assay, we can see that at positions 90 and 92, fitness remained high despite across amino acid changes; possibly suggestive of a benign region of the protein. However, several mutations at position 48 resulted in low fitness. This could represent an important region for protein function where any perturbation would likely be deleterious.

Let’s plot another assay, specifying a region and invoking the ComplexHeatmap row clustering under the hood. For more details about this clustering method or to view more function parameters, read the function documentation with ?plot_dms_heatmap().

plot_dms_heatmap(assay_name = "SHOC2_HUMAN_Kwon_2022", 
    start_pos = 10,
    end_pos = 60,
    cluster_rows = TRUE)
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

In this region of the SHOC2_HUMAN protein, mutating to a K (y-axis) seemed to have the most benign affect across all mutations.

4 Visualize model scores along a protein

ProteinGymR Bioc 3.21 provides functionality to generate heatmaps of zero-shot mode scores for 79 variant effect prediction models and 11 semi-supervised models with the function plot_zeroshot_heatmap(). The required arguments for this function are the assay name to plot (same as for the DMS heatmap), and a model to plot. For a complete list of models, run available_models() for zero-shot models, and supervised_available_models() for the 11 semi-supervised models. If model_data is not provided, the default model scores from ProteinGym will be loaded in from ProteinGymR::zeroshot_substitutions().

ace2_model <- plot_zeroshot_heatmap(
    assay_name = "ACE2_HUMAN_Chan_2020", 
    model = "GEMME",
    start_pos = 1,
    end_pos = 100)
## 'model_data' not provided, using default data loaded with zeroshot_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
print(ace2_model)

As with the DMS scores, we are plotting the GEMME zero-shot scores for each position available in the assay “ACE2_HUMAN_Chan_2020”. Note that the model scores here are mostly negative and thus, the color scale is red. Model scores are not rescaled or normalized across the 79 models, and comparison across of raw model predictions should be cautioned in this context. For more information on model scores and how to interpret them, consult the original ProteinGym publication.

It can be useful to visualize the DMS and model scores side by side for a given assay to compare the experimental DMS scores and predicted zero-shot scores outputted from the model. This is easily done with the following:

ComplexHeatmap::draw(ace2_dms %v% ace2_model)

%v% stacks the heatmaps in one column, while + will display them in two columns, side by side. This can also be done with any output of class ComplexHeatmap::Heatmap().


5 3D protein structure

This vignette demonstrates how to explore and visualize DMS or model scores on a 3D protein structure using the package r3dmol under the hood. The function requires a file path to a PBD file containing protein structure coordinates and a DMS or model assay to aggregate DMS scores that will be projected onto the 3D structure.

By default, if no dms_data argument is provided, the DMS substitutions from ProteinGymR::dms_substitutions() are loaded in, or if viewing model scores, model scores from an assay loaded with ProteinGymR::model_substitutions() will be used as default. Alternatively, a user can specify their own DMS list to project. The user may also specify what aggregation method to use for calculating the summary statistic at each residue position. By default, the mean DMS score/model prediction score is calculated for each position. See the function documentation for details: ?plot_structure()

First, let’s use all the default settings. The only required arguments are the assay_name and the pdb_file file path.

Importantly, the function normalizes the aggregated DMS scores so that a score of zero will always be represented as white and the range of scores will be scaled relatively to -1 and 1 in that assay, with 1 being the maximum aggregated DMS score within that assay region.

plot_structure(assay_name = "ACE2_HUMAN_Chan_2020")
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache


In this example, we are plotting the 3D structure of the ACE2_HUMAN protein and overlaying the mean DMS score across all mutants in a given position. Chan et al. 2020 who generated the DMS assay data only experimentally assessed a subset of the entire ACE2_HUMAN protein. By default the function only displayed the regions where there is information available in the assay. Red colors represent more pathogenic (lower DMS scores) and blue colors show more benign positions (higher DMS scores). Regions that appear white indicate closer to no change before and after the DMS perturbation. By default, regions of the protein without experiment information have the “ball and stick” representation.

If we want to visualize the complete protein structure, but simply grey out the regions we do not have DMS assay information for rather than using the “ball and stick” representation, we can set the argument full_structure == TRUE.


plot_structure(assay_name = "ACE2_HUMAN_Chan_2020", 
    full_structure = TRUE)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'full_structure' is set to TRUE by default, displaying complete protein structure.


Now we can more clearly see the entire protein structure for ACE2_HUMAN in the ribbon representation.

Some assays extensively assessed nearly every position of the complete protein, for example: the C6KNH7_9INFA protein from Lee et al. 2018. Let’s visualize this protein and set the aggregation method to view the minimum DMS score across all mutants at each position by setting aggregation_fun = min.

plot_structure(assay_name = "C6KNH7_9INFA_Lee_2018", 
    aggregate_fun = min)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache


As we might expect, the minimum DMS value (more pathogenic) is almost always a negative number across all positions of this protein. Therefore, there seems to be at least one amino acid mutation that could severely distrupt the fitness at any position of this protein.

dms_data <- dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
df <- dms_data[["C6KNH7_9INFA_Lee_2018"]]

ggplot(df, aes(DMS_score)) +
    geom_histogram(alpha = 0.8) + 
    theme_minimal() +
    theme(text = element_text(size = 20)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


We can see that for this DMS assay, most values of the minimum DMS score at each position are negative and are thus represented as red.


Finally, the user can also zoom into a specific region of the protein by defining residue position coordinates with start_pos and end_pos.

Let’s just visualize the mean DMS scores of the protein from position 10:50.


plot_structure(assay_name = "ACE2_HUMAN_Chan_2020", 
    start_pos = 10,
    end_pos = 50, 
    full_structure = FALSE)
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

Finally, it is possible to use the same color scheme as the popEVE mutation portal. We can do this for any of the heatmaps or protein structure plots.

plot_structure(assay_name = "ACE2_HUMAN_Chan_2020",
    color_scheme = "EVE")
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_data' not provided, using DMS data loaded with dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

6 Correlate DMS scores with model scores

By default, dms_corr_plot() allows the user to evaluate the correlation between experimental and model prediction scores. By default, it takes in a protein UniProt ID and runs a Spearman correlation between the ProteinGym DMS assay scores and AlphaMissense predicted pathogenicity scores. It returns a ggplot for visualization.


dms_corr_plot(uniprotId = "Q9NV35")
## 'alphamissense_table' not provided, using default table from `ProteinGymR::am_scores()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_table' not provided, using default table from `ProteinGymR::dms_substitutions()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = -0.68; Pval = 0"


Be default, the dms_corr_plot() function gathers any of the 217 DMS assays of the chosen UniProt ID and correlates the average DMS score across relevant assays and the AlphaMissense model predictions.

dms_corr_plot(uniprotId = "Q9NV35")
## 'alphamissense_table' not provided, using default table from `ProteinGymR::am_scores()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## 'dms_table' not provided, using default table from `ProteinGymR::dms_substitutions()`
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## [1] "r = -0.68; Pval = 0"

Be default, the dms_corr_plot() function gathers any of the 217 DMS assays of the chosen UniProt ID and correlates the average DMS scoree across relevant assays and the AlphaMissense model predictions.


Although the default uses the AlphaMissense scores, it is also possible to correlate DMS experimental scores with predictions from any of the 79 zero-shot or 11 supervised models. Below is an example of the workflow to accomplish this for a single assay “A0A140D2T1_ZIKV_Sourisseau_2019”.

Load in both the zero-shot and DMS substitution scores for the 217 assays.

## Load model scores
model_data <- zeroshot_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache
## Load DMS scores
dms_data <- dms_substitutions()
## see ?ProteinGymR and browseVignettes('ProteinGymR') for documentation
## loading from cache

Specify the chosen assay and model to explore. Again, to view all assay names to choose from, run names() on the list loaded in by dms_substitutions().

## Specify the chosen assay and model to explore.
assay_name <- "A0A140D2T1_ZIKV_Sourisseau_2019"
model <- "GEMME"

## Filter selected assay
model_df <- as.data.frame(model_data[[assay_name]])
dms_df <- as.data.frame(dms_data[[assay_name]])

## Filter out multiple aa sites
model_df <- model_df |>
    filter(!str_detect(.data$mutant, ":"))

dms_df <- dms_df |>  
    filter(!str_detect(.data$mutant, ":"))

## Select chosen model
model_df <- model_df |>
    dplyr::select(
        mutant,
        all_of(model)
    )

model_df <- model_df |>
    dplyr::select(all_of(model), mutant)

Now we wrangle and merge the scores into a data.frame that we will use for plotting.

## Wrangle the data
dms_df <- dms_df |>
   dplyr::select(DMS_score, mutant)

merged_table <- 
        left_join(
            model_df, dms_df, 
            by = c("mutant"),
            relationship = "many-to-many"
        ) |>
        na.omit() |> 
        select(mutant, DMS_score, model = all_of(model))

## Corr function
pg_correlate <- 
    function(merged_table)
{
    cor_results <- 
        cor.test(
            merged_table$DMS_score, merged_table$model, 
            method=c("spearman"), 
            exact = FALSE
        )
    
    cor_results
}

cor_results <- pg_correlate(merged_table)

## Correlation density plot
pg_density_plot <- 
    merged_table |> 
    ggplot(
        aes(x = .data$DMS_score, y = .data$model)
    ) +
    geom_bin2d(bins = 60) +
    geom_point(alpha = 0) +
    scale_fill_continuous(type = "viridis") +
    xlab("DMS score") +
    ylab(paste0(model, "\nzero shot score")) +
    theme_classic() +
    theme(
        axis.text.x = element_text(size = 16),
        axis.text.y = element_text(size = 16),
        axis.title.y = element_text(size = 16, vjust = 2),
        axis.title.x = element_text(size = 16, vjust = 0),
        legend.title = element_text(size = 16),
        legend.text = element_text(size = 16)
    )

# Add marginal density plots
pg_density_plot <- ggMarginal(
        pg_density_plot,
        type = "densigram", # Can also use "histogram"
        fill = "#B0C4DE", 
        color = "black"  # Change color as needed
    )


pg_density_plot

print(paste0("r = ", format(round(cor_results$estimate, 2)), 
                "; Pval = ", cor_results$p.value))
## [1] "r = 0.43; Pval = 0"


We can also explore the correlation between two different models.

## Select assay and models to compare
assay_name <- "A0A140D2T1_ZIKV_Sourisseau_2019"
model1 <- "EVE_ensemble"
model2 <- "S3F_MSA"

## Filter selected assay
model_df <- model_data[[assay_name]]

## Filter out multiple aa sites
model_df <- model_df |>  
    filter(!grepl(":", .data$mutant))

## Select chosen model
model_df <- model_df |>
    dplyr::select(
        mutant,
        all_of(c(model1 = model1, model2 = model2))
    )

## Corr function
pg_correlate <- 
    function(model_df)
{
    cor_results <- 
        cor.test(
            model_df$model1, model_df$model2, 
            method=c("spearman"), 
            exact = FALSE
        )
    
    cor_results
}

cor_results <- pg_correlate(model_df = model_df)

## Correlation density plot
pg_density_plot <- 
    model_df |> 
    ggplot(
        aes(x = .data$model1, y = .data$model2)
    ) +
    geom_bin2d(bins = 60) +
    geom_point(alpha = 0) +
    scale_fill_continuous(type = "viridis") +
    xlab(paste0(model1, "\nzero shot score")) +
    ylab(paste0(model2, "\nzero shot score")) +
    theme_classic() +
    theme(
        axis.text.x = element_text(size = 16),
        axis.text.y = element_text(size = 16),
        axis.title.y = element_text(size = 16, vjust = 2),
        axis.title.x = element_text(size = 16, vjust = 0),
        legend.title = element_text(size = 16),
        legend.text = element_text(size = 16)
    )

## Add marginal density plots
pg_density_plot <- ggMarginal(
        pg_density_plot,
        type = "densigram", # Can also use "histogram"
        fill = "#B0C4DE", 
        color = "black"  # Change color as needed
        )
    
pg_density_plot

print(paste0("r = ", format(round(cor_results$estimate, 2)), 
                "; Pval = ", cor_results$p.value))
## [1] "r = 0.84; Pval = 0"


Between the EVE_ensemble and S3F_MSA, there seems to be good correlation between the model predictions for the variants in the assay “A0A140D2T1_ZIKV_Sourisseau_2019”.

6.1 Session Info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggExtra_0.10.1        ComplexHeatmap_2.25.0 ggplot2_3.5.2        
## [4] stringr_1.5.1         dplyr_1.1.4           tidyr_1.3.1          
## [7] ProteinGymR_1.3.0     BiocStyle_2.37.0     
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3               httr2_1.1.2             rlang_1.1.6            
##   [4] magrittr_2.0.3          clue_0.3-66             GetoptLong_1.0.5       
##   [7] matrixStats_1.5.0       compiler_4.5.0          RSQLite_2.3.9          
##  [10] png_0.1-8               vctrs_0.6.5             pkgconfig_2.0.3        
##  [13] shape_1.4.6.1           crayon_1.5.3            fastmap_1.2.0          
##  [16] magick_2.8.6            dbplyr_2.5.0            XVector_0.49.0         
##  [19] labeling_0.4.3          promises_1.3.2          rmarkdown_2.29         
##  [22] tinytex_0.57            purrr_1.0.4             bit_4.6.0              
##  [25] xfun_0.52               cachem_1.1.0            queryup_1.0.5          
##  [28] GenomeInfoDb_1.45.1     jsonlite_2.0.0          gghalves_0.1.4         
##  [31] blob_1.2.4              later_1.4.2             parallel_4.5.0         
##  [34] spdl_0.0.5              cluster_2.1.8.1         R6_2.6.1               
##  [37] stringi_1.8.7           bslib_0.9.0             RColorBrewer_1.1-3     
##  [40] jquerylib_0.1.4         Rcpp_1.0.14             bookdown_0.43          
##  [43] iterators_1.0.14        knitr_1.50              IRanges_2.43.0         
##  [46] httpuv_1.6.16           tidyselect_1.2.1        dichromat_2.0-0.1      
##  [49] yaml_2.3.10             doParallel_1.0.17       codetools_0.2-20       
##  [52] miniUI_0.1.2            curl_6.2.2              tibble_3.2.1           
##  [55] withr_3.0.2             Biobase_2.69.0          shiny_1.10.0           
##  [58] bio3d_2.4-5             KEGGREST_1.49.0         evaluate_1.0.3         
##  [61] r3dmol_0.1.2            BiocFileCache_2.99.0    ggdist_3.3.3           
##  [64] circlize_0.4.16         ExperimentHub_2.99.0    Biostrings_2.77.0      
##  [67] pillar_1.10.2           BiocManager_1.30.25     filelock_1.0.3         
##  [70] foreach_1.5.2           stats4_4.5.0            distributional_0.5.0   
##  [73] generics_0.1.3          BiocVersion_3.22.0      S4Vectors_0.47.0       
##  [76] scales_1.4.0            xtable_1.8-4            glue_1.8.0             
##  [79] tools_4.5.0             AnnotationHub_3.99.0    forcats_1.0.0          
##  [82] Cairo_1.6-2             AnnotationDbi_1.71.0    colorspace_2.1-1       
##  [85] GenomeInfoDbData_1.2.14 RcppSpdlog_0.0.21       cli_3.6.5              
##  [88] rappdirs_0.3.3          viridisLite_0.4.2       gtable_0.3.6           
##  [91] sass_0.4.10             digest_0.6.37           BiocGenerics_0.55.0    
##  [94] rjson_0.2.23            htmlwidgets_1.6.4       farver_2.1.2           
##  [97] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
## [100] httr_1.4.7              GlobalOptions_0.1.2     mime_0.13              
## [103] bit64_4.6.0-1