Getting Tumour Methylation Data with TumourMethData

library(TumourMethData)
#> Warning: replacing previous import 'HDF5Array::h5ls' by 'rhdf5::h5ls' when
#> loading 'TumourMethData'

Introduction

DNA methylation is a repressive epigenetic modification involving the addition of methyl groups to DNA and occurs almost exclusively at CpG dinucleotides in mammals. Altered DNA methylation plays a profound role in the development and progression of cancer. However, much of our knowledge of DNA methylation in cancer has been garnered from methylation microarrays which measure methylation at only a small subset (generally <1%) of the almost 30 million CpG sites in humans, mostly those located close to gene promoters. Thus, whole genome bisulfite sequenicng (WGBS) studies in tumours which measure DNA methylation across the entire genome provide an invaluable resource for gaining a comprehensive understanding of DNA methylation changes in cancer, especially at regulatory regions located far from genes.

While packages such as curatedTCGAData provide DNA methylation data generated with microarrays for a range of different cancer types,
TumourMethData provides a collection of whole genome DNA methylation datasets for several different cancers (primary prostate cancer, prostate cancer metastases, esophageal cancer and rhabdoid tumour at present) as well as matching normal samples where available.

These whole genome methylation datasets are provided as RangedSummarizedExperiments, facilitating easy download of the data and extraction of methylation values for regions of interest.

Furthermore, RNA-seq transcripts counts are also provided for several of the datasets, enabling thorough analysis of how DNA methylation is associated with transcription and how this relationship is perturbed in cancer.

Downloading data

We can view the available datasets with TumourMethDatasets.

# Show available methylation datasets
data("TumourMethDatasets", package = "TumourMethData")
print(TumourMethDatasets)
#>                dataset_name cancer_type technology genome_build
#> 1           cpgea_wgbs_hg38    prostate       WGBS         hg38
#> 2            tcga_wgbs_hg38     various       WGBS         hg38
#> 3           mcrpc_wgbs_hg38    prostate       WGBS         hg38
#> 4     mcrpc_wgbs_hg38_chr11    prostate       WGBS         hg38
#> 5  cao_esophageal_wgbs_hg19  esophageal       WGBS         hg19
#> 6 target_rhabdoid_wgbs_hg19    rhabdoid       WGBS         hg19
#>   number_tumour_samples number_normal_samples wgbs_coverage_available
#> 1                   187                   187                   FALSE
#> 2                    39                     8                   FALSE
#> 3                   100                     0                    TRUE
#> 4                   100                     0                    TRUE
#> 5                    10                     9                   FALSE
#> 6                    69                     0                   FALSE
#>   dataset_size_gb transcript_counts_available
#> 1           40.00                        TRUE
#> 2            5.40                        TRUE
#> 3           16.00                        TRUE
#> 4            0.76                        TRUE
#> 5            2.00                        TRUE
#> 6            4.50                        TRUE
#>                                                                                                                                                                                                                                                                                                                                                                              notes
#> 1                                                                                                                                                                                                                                                                                                                                                                                 
#> 2                                                                                                                                                                                                                                                                                                                                                                                 
#> 3                                                                                                                                                                                                                                                                                                                                                                                 
#> 4                                                                                                                                                                                                                                                                                                     This dataset is a subset of the data in mcrpc_wgbs_hg38 for example purposes
#> 5                                                                                                                                                                                                                                                                                                                                                                                 
#> 6 Methylation values are not as precise as in other datasets. The original \n    methylation values were integers between 0 and 10 with separate values for the C and G positions of each CpG site.\n    The mean of these values was divided by 10 to produce the methylation values here, \n    with CpG sites missing methylation values for either to C or G given an NA value
#>                                                                                                                              original_publication
#> 1                                                            A genomic and epigenomic atlas of prostate cancer in Asian populations; Nature; 2020
#> 2                                      DNA methylation loss in late-replicating domains is linked to mitotic cell division; Nature genetics; 2018
#> 3                                                                The DNA methylation landscape of advanced prostate cancer; Nature genetics; 2020
#> 4                                                                The DNA methylation landscape of advanced prostate cancer; Nature genetics; 2020
#> 5              Multi-faceted epigenetic dysregulation of gene expression promotes esophageal squamous cell carcinoma; Nature communications; 2020
#> 6 Genome-Wide Profiles of Extra-cranial Malignant Rhabdoid Tumors Reveal Heterogeneity and Dysregulated Developmental Pathways; Cancer Cell; 2016

We use download_meth_dataset to download the methylation dataset we are interested in using mcrpc_wgbs_hg38_chr11 as an example.

# Download esophageal WGBS data
mcrpc_wgbs_hg38_chr11 = download_meth_dataset(dataset = "mcrpc_wgbs_hg38_chr11")
#> see ?TumourMethData and browseVignettes('TumourMethData') for documentation
#> loading from cache
#> require("rhdf5")
print(mcrpc_wgbs_hg38_chr11)
#> class: RangedSummarizedExperiment 
#> dim: 1333114 100 
#> metadata(5): genome is_h5 ref_CpG chrom_sizes descriptive_stats
#> assays(2): beta cov
#> rownames: NULL
#> rowData names(0):
#> colnames(100): DTB_003 DTB_005 ... DTB_265 DTB_266
#> colData names(4): metastatis_site subtype age sex

SessionInfo

sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] rhdf5_2.49.0                TumourMethData_1.3.0       
#>  [3] SummarizedExperiment_1.35.0 Biobase_2.65.0             
#>  [5] GenomicRanges_1.57.0        GenomeInfoDb_1.41.0        
#>  [7] IRanges_2.39.0              S4Vectors_0.43.0           
#>  [9] BiocGenerics_0.51.0         MatrixGenerics_1.17.0      
#> [11] matrixStats_1.3.0          
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.45.0         xfun_0.43               bslib_0.7.0            
#>  [4] lattice_0.22-6          rhdf5filters_1.17.0     vctrs_0.6.5            
#>  [7] tools_4.4.0             generics_0.1.3          curl_5.2.1             
#> [10] AnnotationDbi_1.67.0    tibble_3.2.1            fansi_1.0.6            
#> [13] RSQLite_2.3.6           blob_1.2.4              R.oo_1.26.0            
#> [16] pkgconfig_2.0.3         Matrix_1.7-0            dbplyr_2.5.0           
#> [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.12 compiler_4.4.0         
#> [22] Biostrings_2.73.0       htmltools_0.5.8.1       sass_0.4.9             
#> [25] yaml_2.3.8              pillar_1.9.0            crayon_1.5.2           
#> [28] jquerylib_0.1.4         R.utils_2.12.3          DelayedArray_0.31.0    
#> [31] cachem_1.0.8            abind_1.4-5             mime_0.12              
#> [34] ExperimentHub_2.13.0    AnnotationHub_3.13.0    tidyselect_1.2.1       
#> [37] digest_0.6.35           purrr_1.0.2             dplyr_1.1.4            
#> [40] BiocVersion_3.20.0      fastmap_1.1.1           grid_4.4.0             
#> [43] cli_3.6.2               SparseArray_1.5.0       magrittr_2.0.3         
#> [46] S4Arrays_1.5.0          utf8_1.2.4              withr_3.0.0            
#> [49] rappdirs_0.3.3          filelock_1.0.3          UCSC.utils_1.1.0       
#> [52] bit64_4.0.5             rmarkdown_2.26          XVector_0.45.0         
#> [55] httr_1.4.7              bit_4.0.5               R.methodsS3_1.8.2      
#> [58] png_0.1-8               HDF5Array_1.33.0        memoise_2.0.1          
#> [61] evaluate_0.23           knitr_1.46              BiocFileCache_2.13.0   
#> [64] rlang_1.1.3             glue_1.7.0              DBI_1.2.2              
#> [67] BiocManager_1.30.22     jsonlite_1.8.8          Rhdf5lib_1.27.0        
#> [70] R6_2.5.1                zlibbioc_1.51.0