A large variety of RNA molecules exist in cells. They adopt diverse structures to serve important cellular functions. In the last decade, a number of deep sequencing-based technologies have been developed to study RNA structures at a genome-wide scale, at high resolution, and in vivo or in vitro. These technologies generate large-scale and highly noisy data. A common goal in many studies utilizing such data is to compare RNA structuromes under two or more conditions, and identify RNAs or their regions that may have altered structures between the conditions.
dStruct: dStruct is a package meant for differential analysis of RNA Structurome profiling data (Choudhary et al. 2019). It is broadly applicable to all existing technologies for RNA structurome profiling. It accounts for inherent biological variation in structuromes between replicate samples and returns differential regions. Here, we describe the input data format required by dStruct, provide use cases, and discuss options for analysis and visualization.
For details on the methods presented here, consider having a look at our manuscript:
Choudhary, K., Lai, Y. H., Tran, E. J., & Aviran, S. (2019). dStruct: identifying differentially reactive regions from RNA structurome profiling data. Genome Biology, 20(1), 1-26. doi: 10.1186/s13059-019-1641-3
The following code chunk loads the
tidyverse packages. We use
tidyverse for cleaner code presentation in this vignette.
dStruct is broadly applicable to technologies for RNA structurome profiling, e.g., SHAPE-Seq, Structure-Seq, PARS, SHAPE-MaP, etc. The raw data from these technologies are reads from DNA sequencing platforms. The start sites of read alignment coordinates are adjacent to sites of reactions of ribonucleotides with a structure-probing reagent. For some technologies, the sites of mismatches between reads and the reference genome sequences are sites of reactions, e.g., SHAPE-MaP and DMS-MaPseq. The counts of reads mapping to the nucleotides of an RNA should be tallied and converted to reactivities, which represent the degrees of reactions. The specific steps to process data from sequencing reads to reactivities depend on the structurome profiling technology. For a review on structurome profiling technologies and methods for assessing reactivities, see Choudhary, Deng, and Aviran (2017).
dStruct takes nucleotide-level normalized reactivities of one or multiple RNAs as input. Currently, the preffered normalization method depends on the RNA structurome profiling technology. We have provided an implementation of the 2-8 % method for normalization (Low and Weeks (2010)) via a function named
twoEightNormalize, which is commonly used to normalize data from technologies such as SHAPE-Seq, Structure-Seq, etc. In the following, we assume that the user has processed the raw reads to obtain normalized reactivities for their RNAs of interest.
The input data to dStruct is a
list object. Each element of the
list should be a
data.frame containing normalized reactivities.
We have provided two test datasets with this package. These are derived from experiments described in Lai et al. (2019) and Wan et al. (2014). As described in the section on differential analysis below, we use these to illustrate de novo and guided discovery use cases, respectively. To start with, let us examine the data structure required for input.
The Lai et al. data can be loaded as follows.
This adds an object named
lai2019 to the global environment, which can be checked using the following function.
##  "lai2019"
lai2019 is an object of class
list. Each of the list elements is a
data.frame containing normalized reactivities.
##  "list"
names(lai2019) %>% head()
##  "YJR045C" "YOR383C" "YHL033C" "YMR120C" "YJR009C" "YJL189W"
All elements of
lai2019 have the same structure. Let us check one of them.
##  "data.frame"
head(lai2019[["YAL042W"]], n= 20)
## A1 A2 A3 B1 B2 ## 1 NA NA NA NA NA ## 2 NA NA NA NA NA ## 3 NA NA NA NA NA ## 4 NA NA NA NA NA ## 5 NA NA NA NA NA ## 6 NA NA NA NA NA ## 7 NA NA NA NA NA ## 8 0.1543654 0.68574419 0.58373375 0.10324931 1.5352021 ## 9 NA NA NA NA NA ## 10 1.4011014 0.74277787 0.78535765 0.00000000 0.8647261 ## 11 NA NA NA NA NA ## 12 0.8934077 0.70739927 0.77902156 1.90763268 0.4470532 ## 13 0.2122749 0.04184798 0.08881596 0.43993182 0.0000000 ## 14 0.0000000 0.07652060 0.07850023 0.25289521 0.9604754 ## 15 0.1638981 0.25520100 0.07795018 0.06206092 0.0000000 ## 16 2.3609079 1.26914665 0.37622768 0.26726991 2.5536829 ## 17 0.7113571 0.15869237 0.22513679 0.07022696 0.0000000 ## 18 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 ## 19 0.2581967 0.00000000 0.18476585 0.01770362 0.4273101 ## 20 NA NA NA NA NA
Each row of the above
data.frame has the reactivity for one nucleotide of the RNA with id
"YAL042W" and each column is one sample. Consecutive rows must represent consecutive nucleotides.
NA values represent unavailable reactivities. If the data were generated using base-selective reagents (e.g., DMS used by Lai et al.),
dStruct expects that the reactivities of unprobed bases have been masked as
NA before running the differential analysis.
dStruct supports comparisons of samples from two conditions at a time. The prefixes
B in the columns of each
lai2019 stand for the two conditions. It is required that the conditions be labeled as
dStruct will be looking for these internally. The numeric suffixes in the column names are replicate sample numbers. These must start with
1 and increase in steps of 1. If the samples were prepared in batches with each batch having one sample of each group, the corresponding samples should be given the same numeric suffix.
Differential analysis can be performed for multiple RNAs in one step or a single RNA at a time. Additionally, we allow for two modes of analysis, one to identify differentially reactive regions de novo, and another called as guided discovery to assess the significance of differences in reactivities in the regions provided by the user.
dStruct analyzes reactivity profiles for a single transcript to identify differential regions de novo. Additionally, we provide a wrapper function called
dStructome which can analyze profiles for multiple transcripts simultaneously. For example, a single transcript, say
YAL042W, can be analyzed as follows.
dStruct(rdf = lai2019[["YAL042W"]], reps_A = 3, reps_B = 2, batches = TRUE, min_length = 21, between_combs = data.frame(c("A3", "B1", "B2")), within_combs = data.frame(c("A1", "A2", "A3")), ind_regions = TRUE)
## IRanges object with 5 ranges and 3 metadata columns: ## start end width | pval del_d FDR ## <integer> <integer> <integer> | <numeric> <numeric> <numeric> ##  19 146 128 | 0.000145679 0.1107181 0.000728395 ##  530 553 24 | 0.071184091 0.0793906 0.071184091 ##  583 632 50 | 0.021902932 0.0174842 0.036504886 ##  1122 1177 56 | 0.036887884 0.0204798 0.046109855 ##  1187 1218 32 | 0.016484691 0.0920271 0.036504886
Input arguments. The arguments
reps_B of the
dStruct function are required while the others are optional. These take in a
data.frame with reactivity values, the number of samples of group A (group of wild-type S. cerevisiae samples), and the number of samples of group B (group of dbp2\(\Delta\) S. cerevisiae samples), respectively. Set
TRUE if the samples were prepared in batches with a paired experiment design, i.e., two samples in each batch – one of each group.
min_length is the minimum length of a differential region that
dStruct would search for. For the Lai et al. dataset, we set it to 21 nt because that is the putative length of regions bound by Dbp2, which was the RNA helicase under investigation in their study. As described in our manuscript,
dStruct performs differential analysis by regrouping the samples in homogeneous and heterogeneous groups and assessing the dissimilarity of reactivity profiles in these groups. The homogeneous groups comprise of samples from the same original group (e.g., A or B) while the heterogeneous groups comprise of samples from both A and B, respectively (see Choudhary et al. (2019) for details). Users can explicitly specify these groupings to the
dStruct function as shown in the example above. Otherwise,
dStruct would automatically generate the homogeneous and heterogeneous groups. Setting
TRUE requires test for significance of difference in reactivity patterns in individual regions. If it is set to
FALSE all regions identified within a transcript are tested collectively to obtain significance scores at the level of a transcript.
Output. The output is an
IRanges object with columns giving the start (column titled start) and end (column titled end) coordinates of the tested regions from the transcript whose reactivity profile is input, p-values, medians of nucleotide-wise differences of the between-group and within-group d scores (column titled del_d), and the FDR-adjusted p-values.
Alternatively, all transcripts in a reactivity list can be analyzed in one step using the wrapper function
dStructome. By default,
dStructome spawns multiple processes, which speeds up the analysis by processing transcripts in parallel. This behavior can be changed using the argument
processes as shown in the following code chunk.
res <- dStructome(lai2019, 3, 2, batches= TRUE, min_length = 21, between_combs = data.frame(c("A3", "B1", "B2")), within_combs = data.frame(c("A1", "A2", "A3")), ind_regions = TRUE, processes = 1)
This returns the coordinates of regions in transcripts with the corresponding p- and FDR-values.
## IRanges object with 6 ranges and 4 metadata columns: ## start end width | pval del_d t ## <integer> <integer> <integer> | <numeric> <numeric> <character> ##  15 203 189 | 2.33508e-10 0.0931510 YJR045C ##  397 450 54 | 5.86849e-03 0.1454283 YJR045C ##  486 516 31 | 1.07727e-02 0.1392935 YJR045C ##  540 586 47 | 1.47197e-02 0.1025044 YJR045C ##  1009 1060 52 | 1.12415e-03 0.0947074 YJR045C ##  1065 1150 86 | 1.61434e-03 0.0758058 YJR045C ## FDR ## <numeric> ##  2.21833e-08 ##  1.50677e-02 ##  2.17746e-02 ##  2.74190e-02 ##  5.08545e-03 ##  6.66793e-03
dStructGuided tests for differential reactivity profiles of pre-defined regions. Additionally, the wrapper function
dStructome can analyze profiles for multiple regions simultaneously when its argument
method = "guided".
We illustrate this mode of running
dStruct using PARS data from Wan et al. (2014). We downloaded the data reported in their manuscript and processed it to get PARS scores as described in Choudhary et al. (2019). The PARS scores are available with the
dStruct package and can be loaded as follows.
This adds an object named
wan2014 to the global environment, which can be checked using the following function.
##  "lai2019" "res" "wan2014"
wan2014 is an object of class
list and has the same structure as described for Lai et al.’s data above. Each of its elements is a
data.frame containing PARS scores for an 11 nt region. At its center, each region has a single-nucleotide variant between the two groups of samples (the regions could be defined by users based on other considerations, e.g., regions bound by proteins based on assays such as eCLIP-seq). There are two samples of group A and one of group B.
## A1 A2 B1 ## 357 3.6080461 4.017427 2.9634741 ## 358 2.7955295 4.334984 2.4288433 ## 359 3.7938507 3.336283 2.3813066 ## 360 -4.6073303 -4.857981 -5.2343042 ## 361 -4.1478987 -3.960829 -3.5739914 ## 362 -0.3410369 -1.915608 -0.7267698 ## 363 -1.1824781 -1.669851 -2.0728629 ## 364 -1.6066576 -3.048363 -5.2223924 ## 365 -3.1375035 -2.807355 -2.8479969 ## 366 3.1375035 1.841302 1.2895066 ## 367 3.0356239 2.938599 2.4780473
The names of the list elements give a transcript’s ID and the coordinates of its region to be tested.
names(wan2014) %>% head()
##  "1__NM_000146__357__367" "3__NM_015841__2784__2794" ##  "2__ENST00000527880.1__281__291" "3__NM_033554__673__683" ##  "3__NM_020820__2667__2677" "3__NM_002473__6909__6919"
To test a single region, use
dStructGuided as follows.
dStructGuided(wan2014[], reps_A = 2, reps_B = 1)
## pval del_d ## 0.16015625 0.02291958
dStructGuided returns a p-value for significance of differential PARS scores in the input region and the median of nucleotide-wise differences of the between-group and within-group d scores. Alternatively, all regions in
wan2014 can be analyzed in one step using the wrapper function
res_predefined_regs <- dStructome(wan2014, reps_A = 2, reps_B = 1, method = "guided", processes = 1)
This returns a table with the coordinates of regions in transcripts with the corresponding p- and FDR-values.
## t pval del_d FDR ## 1 1__NM_000146__357__367 0.16015625 0.02291958 0.18017578 ## 2 2__ENST00000527880.1__281__291 0.20654297 0.09063676 0.20654297 ## 3 3__NM_020820__2667__2677 0.08740234 0.07210989 0.11237444 ## 4 1__NM_002482__1624__1634 0.07373047 0.01983767 0.11059570 ## 5 3__NM_032855__1812__1822 0.04150391 0.07708277 0.09140625 ## 6 2__NM_199440__385__395 0.05078125 0.09179449 0.09140625
plotDStructurome takes in a list of
data.frames with reactivities and the results from
dStructome to save a PDF file showing the reactivities in a region. We illustrate this with the result for the
lai2019 data. Let us plot the reactivities for the gene with the lowest FDR value. We can identify it as follows.
toPlot <- res@elementMetadata %>% data.frame() %$% magrittr::extract(t, order(FDR)) %>% head(., n = 1) toPlot
##  "YJR045C"
Users may pass the entire reactivity list to
plotDStructurome and the complete result table. By default, the results are plotted for FDR < 0.05 and \(y\)-axis limits range from 0 to 3. However, this may take some time in case there are a large number of differential regions. Here, we plot the reactivities for the gene in the object
plotDStructurome(rl = lai2019[toPlot], diff_regions = subset(res, t == toPlot), outfile = paste0("DRRs_in_", toPlot), fdr = 0.05, ylim = c(-0.05, 3))
This saves a PDF file named “DRRs_in_YJR045C.pdf” in the current working directory.
## R version 4.2.0 RC (2022-04-21 r82226) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so ## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_GB LC_COLLATE=C ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 ##  readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 ##  tidyverse_1.3.1 dStruct_1.3.0 BiocStyle_2.25.0 ## ## loaded via a namespace (and not attached): ##  Rcpp_126.96.36.199 lubridate_1.8.0 lattice_0.20-45 ##  zoo_1.8-10 assertthat_0.2.1 digest_0.6.29 ##  utf8_1.2.2 plyr_1.8.7 R6_2.5.1 ##  cellranger_1.1.0 backports_1.4.1 stats4_4.2.0 ##  reprex_2.0.1 evaluate_0.15 httr_1.4.2 ##  pillar_1.7.0 rlang_1.0.2 readxl_1.4.0 ##  rstudioapi_0.13 jquerylib_0.1.4 S4Vectors_0.35.0 ##  rmarkdown_2.14 labeling_0.4.2 munsell_0.5.0 ##  broom_0.8.0 compiler_4.2.0 modelr_0.1.8 ##  xfun_0.30 BiocGenerics_0.43.0 pkgconfig_2.0.3 ##  htmltools_0.5.2 tidyselect_1.1.2 bookdown_0.26 ##  IRanges_2.31.0 fansi_1.0.3 crayon_1.5.1 ##  tzdb_0.3.0 dbplyr_2.1.1 withr_2.5.0 ##  grid_4.2.0 jsonlite_1.8.0 gtable_0.3.0 ##  lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3 ##  scales_1.2.0 cli_3.3.0 stringi_1.7.6 ##  reshape2_1.4.4 farver_2.1.0 fs_1.5.2 ##  xml2_1.3.3 bslib_0.3.1 ellipsis_0.3.2 ##  generics_0.1.2 vctrs_0.4.1 tools_4.2.0 ##  glue_1.6.2 hms_1.1.1 parallel_4.2.0 ##  fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3 ##  BiocManager_1.30.17 rvest_1.0.2 knitr_1.38 ##  haven_2.5.0 sass_0.4.1
Choudhary, Krishna, Fei Deng, and Sharon Aviran. 2017. “Comparative and Integrative Analysis of RNA Structural Profiling Data: Current Practices and Emerging Questions.” Quantitative Biology 5 (1): 3–24.
Choudhary, Krishna, Yu-Hsuan Lai, Elizabeth J Tran, and Sharon Aviran. 2019. “dStruct: Identifying Differentially Reactive Regions from RNA Structurome Profiling Data.” Genome Biology 20 (1): 1–26.
Lai, Yu-Hsuan, Krishna Choudhary, Sara C Cloutier, Zheng Xing, Sharon Aviran, and Elizabeth J Tran. 2019. “Genome-Wide Discovery of DEAD-Box RNA Helicase Targets Reveals RNA Structural Remodeling in Transcription Termination.” Genetics 212 (1): 153–74.
Low, Justin T, and Kevin M Weeks. 2010. “SHAPE-Directed RNA Secondary Structure Prediction.” Methods 52 (2): 150–58.
Wan, Yue, Kun Qu, Qiangfeng Cliff Zhang, Ryan A Flynn, Ohad Manor, Zhengqing Ouyang, Jiajing Zhang, et al. 2014. “Landscape and Variation of RNA Secondary Structure Across the Human Transcriptome.” Nature 505 (7485): 706–9.