Precise knowledge on the binding sites of an RNA-binding protein (RBP) is key to understand (post-) transcriptional regulatory processes. The package BindingSiteFinder provides all functionalities to define exact binding sites defined from iCLIP data. The following vignette describes the complete workflow.
BindingSiteFinder 2.2.0
Most cellular processes are regulated by RNA-binding proteins (RBPs). Knowledge on their exact positioning can be obtained from individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments. In a recent publication we described a complete analysis workflow to detect RBP binding sites from iCLIP data. The workflow covers all essential steps from quality control of sequencing reads, different peak calling options, to the downstream analysis and definition of binding sites. The pre-processing and peak calling steps rely on publicly available software, whereas the definition of the final binding sites follows a custom procedure implemented in BindingSiteFinder. This vignette explains how equally sized binding sites can be defined from a genome-wide iCLIP coverage.
The workflow described herein is based on our recently published complete iCLIP analysis pipeline (Busch et al. 2020). Thus, we expect the user to have preprocessed their iCLIP sequencing reads up to the point of the peak calling step. In brief, this includes basic processing of the sequencing reads, such as quality filtering, barcode handling, mapping and the generation of a single nucleotide crosslink file for all replicates under consideration. As we describe in our manuscript, replicate .bam files may or may not be merged prior to peak calling, for which we suggest PureCLIP (Krakau, Richard, and Marsico 2017). For simplicity, we address only the case in which peak calling was performed on the merge of all replicates.
The BindingSiteFinder package is available at https://bioconductor.org and can be installed via BiocManager::install
:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("BindingSiteFinder")
Note: If you use BindingSiteFinder in published research, please cite:
Busch, A., Brüggemann, M., Ebersberger, S., & Zarnack, K. (2020) iCLIP data analysis: A complete pipeline from sequencing reads to RBP binding sites. Methods, 178, 49-62. https://doi.org/10.1016/j.ymeth.2019.11.008
This part is for the impatient user that wants to start as fast as possible. For that reason this section contains only the most basic steps and function calls to get to the final binding site set. For more details on each step see the Standard workflow section.
It all starts with your input iCLIP data, which has to be in form of a BSFDataSet
. This data-structure combines crosslink sites (the output from PureCLIP) with the individual crosslinks per replicate (iCLIP coverage). Further instructions on the data-set constructions are given in the chapter Construction of the BindingSiteFinder dataset.
files <- system.file("extdata", package="BindingSiteFinder")
load(list.files(files, pattern = ".rda$", full.names = TRUE))
bds
## Dataset: Test set
## Object of class BSFDataSet
## #N Ranges: 19,749
## Width ranges: 1
## #N Samples: 4
## #N Conditions: 1
To intersect binding sites with genomic features, such as genes and transcript regions the respective regions have to be presented as GenomicRanges
objects. Further instructions on how to import annotations from various sources are given in chapter Construct the annotation objects.
load(list.files(files, pattern = ".rds$", full.names = TRUE)[1])
gns
## GRanges object with 1387 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000008735.14 chr22 50600793-50613981 + | ENSG00000008735.14
## ENSG00000015475.19 chr22 17734138-17774770 - | ENSG00000015475.19
## ENSG00000025708.14 chr22 50525752-50530032 - | ENSG00000025708.14
## ENSG00000025770.19 chr22 50508224-50524780 + | ENSG00000025770.19
## ENSG00000040608.14 chr22 20241415-20283246 - | ENSG00000040608.14
## ... ... ... ... . ...
## ENSG00000288106.1 chr22 38886837-38903794 + | ENSG00000288106.1
## ENSG00000288153.1 chr22 30664961-30674134 + | ENSG00000288153.1
## ENSG00000288262.1 chr22 11474744-11479643 + | ENSG00000288262.1
## ENSG00000288540.1 chr22 48320412-48333199 - | ENSG00000288540.1
## ENSG00000288683.1 chr22 18077989-18131720 + | ENSG00000288683.1
## gene_type gene_name
## <character> <character>
## ENSG00000008735.14 protein_coding MAPK8IP2
## ENSG00000015475.19 protein_coding BID
## ENSG00000025708.14 protein_coding TYMP
## ENSG00000025770.19 protein_coding NCAPH2
## ENSG00000040608.14 protein_coding RTN4R
## ... ... ...
## ENSG00000288106.1 lncRNA AL022318.5
## ENSG00000288153.1 unprocessed_pseudogene AC003072.2
## ENSG00000288262.1 unprocessed_pseudogene CT867976.1
## ENSG00000288540.1 lncRNA AL008720.2
## ENSG00000288683.1 protein_coding AC016027.6
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
load(list.files(files, pattern = ".rds$", full.names = TRUE)[2])
regions
## GRangesList object of length 4:
## $CDS
## GRanges object with 17035 ranges and 4 metadata columns:
## seqnames ranges strand | cds_id exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <integer> <character>
## chr22 15528192-15529139 + | 1 <NA> <NA>
## chr22 15690078-15690314 + | 2 <NA> <NA>
## chr22 15690078-15690709 + | 3 <NA> <NA>
## chr22 15690246-15690709 + | 4 <NA> <NA>
## chr22 15690426-15690709 + | 5 <NA> <NA>
## . ... ... ... . ... ... ...
## chr22 50782188-50782294 - | 17031 <NA> <NA>
## chr22 50782188-50782294 - | 17032 <NA> <NA>
## chr22 50782188-50782294 - | 17033 <NA> <NA>
## chr22 50782188-50782294 - | 17034 <NA> <NA>
## chr22 50782234-50782294 - | 17035 <NA> <NA>
## exon_rank
## <integer>
## <NA>
## <NA>
## <NA>
## <NA>
## <NA>
## . ...
## <NA>
## <NA>
## <NA>
## <NA>
## <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## ...
## <3 more elements>
The fastest way to accurately define binding site with very little manual intervention is running the wrapper function BSFind()
. It will perform all necessary computations and will store all results and diagnostics internally. Various plotting functions can be used to make all sorts of diagnostics. Export functions can be used to extract resulting binding sites for further processing. Each step is covered in more detail in the Standard workflow section.
# run BSFind to compute binding sites
bdsQuick = BSFind(object = bds, anno.genes = gns, anno.transcriptRegionList = regions, est.subsetChromosome = "chr22")
# export output as .bed
exportToBED(bdsQuick, con = "./myBindingSites.bed")
To get a fast glance at the data-set with minimal user intervention that is also visually appealing, the quickFigure()
function exists.
# not run
quickFigure(bdsQuick, what = "main", save.filename = "./my_fig_main.pdf", save.device = "pdf")