Warning: In this vignette, due to space limitations, we demonstrate the functions of RCAS using static images. In order to see how an interactive report from RCAS looks see
For the most up-to-date functionality, usage and installation instructions, and example outputs, see our github repository here.
RCAS is an automated system that provides dynamic genome annotations for custom input files that contain transcriptomic regions. Such transcriptomic regions could be, for instance, peak regions detected by CLIP-Seq analysis that detect protein-RNA interactions, RNA modifications (alias the epitranscriptome), CAGE-tag locations, or any other collection of target regions at the level of the transcriptome.
RCAS is designed as a reporting tool for the functional analysis of RNA-binding sites detected by high-throughput experiments. It takes as input a BED format file containing the genomic coordinates of the RNA binding sites and a GTF file that contains the genomic annotation features usually provided by publicly available databases such as Ensembl and UCSC. RCAS performs overlap operations between the genomic coordinates of the RNA binding sites and the genomic annotation features and produces in-depth annotation summaries such as the distribution of binding sites with respect to gene features (exons, introns, 5’/3’ UTR regions, exon-intron boundaries, promoter regions, and whole transcripts) along with functionally enriched term for targeted genes. Moreover, RCAS can darry out discriminative motif discovery in the query regions. The final report of RCAS consists of high-quality dynamic figures and tables, which are readily applicable for publications or other academic usage.
RCAS minimally requires as input a BED file and a GTF file. The BED file should contain coordinates/intervals of transcriptomic regions which are located via transcriptomics methods such as Clip-Seq. The GTF file should provide reference annotation. The recommended source of GTF files is the ENSEMBLE database.
For this vignette, in order to demonstrate RCAS functionality, we use sample BED and GTF data that are built-in the RCAS library, which can be imported using a common R function: data(). To import custom BED and GTF files, the user should execute two RCAS functions called importBed() and importGtf().
To use importBed() and importGtf(), the user should provide file paths to the respective BED file and GTF file. To reduce memory usage and time consumption, we advise the user to set
sampleN=10000 to avoid huge input of intervals.
To find out the distribution of the query regions across gene types:
biotype_col <- grep('gene_biotype', colnames(overlaps), value = T) df <- overlaps[,length(unique(queryIndex)), by = biotype_col] colnames(df) <- c("feature", "count") df$percent <- round(df$count / length(queryRegions) * 100, 1) df <- df[order(count, decreasing = TRUE)] ggplot2::ggplot(df, aes(x = reorder(feature, -percent), y = percent)) + geom_bar(stat = 'identity', aes(fill = feature)) + geom_label(aes(y = percent + 0.5), label = df$count) + labs(x = 'transcript feature', y = paste0('percent overlap (n = ', length(queryRegions), ')')) + theme_bw(base_size = 14) + theme(axis.text.x = element_text(angle = 90))
GTF files contain some annotation features (e.g. exons, transcripts) that are usually explicitly defined, however, some transcript features such as introns, exon-intron boundaries, promoter regions are only implicitly defined. Such implicit features can be extracted from a GTF file using makeTxDb family of functions from the GenomicFeatures library.
First we create a list of GRanges objects, where each list element contains all the available coordinates of transcript features such as transcripts, exons, introns, 5’/3’ UTRs, exon-intron boundaries, and promoter regions.
To have a global overview of the distribution of query regions across gene features, we can use the summarizeQueryRegions function. If a given query region does not overlap with any of the given coordinates of the transcript features, it is categorized under
summary <- summarizeQueryRegions(queryRegions = queryRegions, txdbFeatures = txdbFeatures) df <- data.frame(summary) df$percent <- round((df$count / length(queryRegions)), 3) * 100 df$feature <- rownames(df) ggplot2::ggplot(df, aes(x = reorder(feature, -percent), y = percent)) + geom_bar(stat = 'identity', aes(fill = feature)) + geom_label(aes(y = percent + 3), label = df$count) + labs(x = 'transcript feature', y = paste0('percent overlap (n = ', length(queryRegions), ')')) + theme_bw(base_size = 14) + theme(axis.text.x = element_text(angle = 90))
To find out which genes overlap with how many queries and categorise overlaps by transcript features; we use
getTargetedGenesTable function, which returns a data.frame object.
It may be useful to look at the distribution of query regions at the boundaries of transcript features. For instance, it may be important to see the relative signal at transcript ends (transcription start sites versus transcription end sites). Or, it may be important to see how the signal is distributed at exon boundaries, which may give an idea about the regulation of the transcript. Here we demonstrate how to get such signal distributions at transcription start/end sites. The same approach can be done for any other collection of transcript features (exons, introns, promoters, UTRs etc.)
cvgF <- getFeatureBoundaryCoverage(queryRegions = queryRegions, featureCoords = txdbFeatures$transcripts, flankSize = 1000, boundaryType = 'fiveprime', sampleN = 10000) cvgT <- getFeatureBoundaryCoverage(queryRegions = queryRegions, featureCoords = txdbFeatures$transcripts, flankSize = 1000, boundaryType = 'threeprime', sampleN = 10000) cvgF$boundary <- 'fiveprime' cvgT$boundary <- 'threeprime' df <- rbind(cvgF, cvgT) ggplot2::ggplot(df, aes(x = bases, y = meanCoverage)) + geom_ribbon(fill = 'lightgreen', aes(ymin = meanCoverage - standardError * 1.96, ymax = meanCoverage + standardError * 1.96)) + geom_line(color = 'black') + facet_grid( ~ boundary) + theme_bw(base_size = 14)
Coverage profiles can be obtained for a single type of transcript feature or a list of transcript features. Here we demonstrate how to get coverage profile of query regions across all available transcript features. It might be a good idea to use sampleN parameter to randomly downsample the target regions to speed up the calculations.
cvgList <- calculateCoverageProfileList(queryRegions = queryRegions, targetRegionsList = txdbFeatures, sampleN = 10000) ggplot2::ggplot(cvgList, aes(x = bins, y = meanCoverage)) + geom_ribbon(fill = 'lightgreen', aes(ymin = meanCoverage - standardError * 1.96, ymax = meanCoverage + standardError * 1.96)) + geom_line(color = 'black') + theme_bw(base_size = 14) + facet_wrap( ~ feature, ncol = 3)
We build a classifier based on k-mer frequencies (allowing for mismatches) to find the most informative motifs that help discriminate the query sequences from the background distribution.
A summary table from the motif analysis results can be obtained
RCAS makes use of gprofiler2 package enriched functions in genes that overlap the query regions.
|0.00e+00||organonitrogen compound biosynthetic process||GO:BP|
|1.50e-06||cellular macromolecule metabolic process||GO:BP|
|1.70e-06||organic substance biosynthetic process||GO:BP|
|2.30e-06||cellular biosynthetic process||GO:BP|
|2.02e-05||regulation of mRNA metabolic process||GO:BP|
|4.82e-05||RNA catabolic process||GO:BP|
|5.10e-05||cellular macromolecule biosynthetic process||GO:BP|
|5.43e-05||cellular protein metabolic process||GO:BP|
|5.55e-05||mRNA catabolic process||GO:BP|
The users can use the runReport() function to generate full custom reports including all the analysis modules described above. There are four main parts of the analysis report.
By default, runReport() function aims to run all three modules, while the user can turn off these individual modules.
Below are example commands to generate reports using these functionalities.
runReport( queryFilePath = 'input.BED', gffFilePath = 'annotation.gtf')
runReport( queryFilePath = 'input.BED', gffFilePath = 'annotation.gtf', motifAnalysis = FALSE, goAnalysis = FALSE )
You can run RCAS for any of the genome versions available in the BSgenome package.
runReport( queryFilePath = 'input.mm9.BED', gffFilePath = 'annotation.mm9.gtf', genomeVersion = 'mm9' )
runReport(quiet = TRUE)
One may be interested in printing the raw data used to make the plots and tables in the HTML report output of runReport function. Such tables could be used for meta-analysis of multiple analysis results. In order to activate this function, printProcessedTables argument must be set to TRUE.
runReport(printProcessedTables = TRUE)
RCAS is developed in the group of Altuna Akalin (head of the Scientific Bioinformatics Platform) by Bora Uyar (Bioinformatics Scientist), Dilmurat Yusuf (Bioinformatics Scientist) and Ricardo Wurmus (System Administrator) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.