#Abstract

A number of recently developed next-generation sequencing based methods (e.g. PAR-CLIP, Bisulphite sequencing, RRBS) specifically induce nucleotide substitutions within the short reads with respect to the reference genome. wavClusteR provides functions for the analysis of the data obtained by such methods with a major focus on PAR-CLIP. The package leverages on experimentally induced substitutions to identify high confidence signals (e.g. RNA-binding sites) in the data. A wavClusteR workflow consists of two steps:

  1. The estimation of a non-parametric two-component mixture model to identify substitution frequencies that are likely to be affected by the experimental procedure
  2. The identification of binding sites (clusters).

The package supports multicore computing. For a detailed description of the method please refer to Sievers et al., 2012; Comoglio et al., 2015.

#Preparing the input

wavClusteR expects a sorted and indexed BAM file as input. A streamlined workflow to generate the required input from a fastq file is outlined below (line 1 is pseudo code, replace it with the aligner specific syntax).

    #ALIGN: 
        sample.fastq -> sample.sam
    #CONVERT: 
        samtools view -b -S sample.sam -o sample.bam
    #SORT: 
        samtools sort sample.bam sample_sorted
    #INDEXING: 
        samtools index sample_sorted.bam
  1. Short read alignment to the reference genome using relaxed alignment parameters. Experimentally-induced transitions will otherwise impede alignment of informative reads. Currently, wavClusteR has been tested with bowtie/Bowtie2 Langmead et al., 2009; Langmead and Salzberg, 2012
  2. Conversion of the alignment file from SAM to BAM format, e.g. using samtools view from SAMtools
  3. Sorting of the BAM file, e.g. using samtools sort
  4. Indexing of the sorted BAM file, e.g. with samtools index.

##Example dataset

wavClusteR provides a chunk of a human Argonaute 2 (AGO2) PAR-CLIP data set from Kishore et al., 2011. Reads in the chunk map to the genomic interval chrX:23996166-24023263. This data set is used below to illustrate a workflow for PAR-CLIP data analysis.

#A workflow for PAR-CLIP data analysis

##Reading sorted BAM files

A sorted and indexed BAM file can be loaded into the R session with readSortedBam. This function extracts aligned reads, sequences and the mismatch MD field, and returns a GRanges object.

library(wavClusteR)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colMeans, colSums, colnames,
##     dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
##     intersect, is.unsorted, lapply, lengths, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
##     rowMeans, rowSums, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Rsamtools
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
Bam <- readSortedBam(filename = filename)
## Warning in doTryCatch(return(expr), name, parentenv, handler): length of NULL
## cannot be changed
Bam
## GRanges object with 5358 ranges and 2 metadata columns:
##          seqnames            ranges strand |                    qseq
##             <Rle>         <IRanges>  <Rle> |          <DNAStringSet>
##      [1]     chrX 24001819-24001844      - | CAGAGATAAA...TATTTTAAAG
##      [2]     chrX 24001819-24001843      - | CAGAGATAAA...ATATTTTAAG
##      [3]     chrX 24001834-24001863      - | ATATTTTAGA...ATTTTATTTA
##      [4]     chrX 24001836-24001865      - | TTTTTAAAGA...TTTATTTAAA
##      [5]     chrX 24001841-24001876      - | AAAGATTAAA...TTTTCTTCAT
##      ...      ...               ...    ... .                     ...
##   [5354]     chrX 24023018-24023051      - | GTTTCACAGC...AAAAATATGT
##   [5355]     chrX 24023018-24023051      - | GTTTCACAGC...AAAAATATGT
##   [5356]     chrX 24023019-24023051      - | TTTCACAGCG...AAAAATATGT
##   [5357]     chrX 24023019-24023051      - | TTTCACAGCG...AAAAATATGT
##   [5358]     chrX 24023067-24023090      - | CAAAGGCGCG...GGTTTATTTT
##               tag.MD
##          <character>
##      [1]          26
##      [2]        24A0
##      [3]        8A21
##      [4]     0A13A15
##      [5]       24A11
##      ...         ...
##   [5354]       10A23
##   [5355]       10A23
##   [5356]        9A23
##   [5357]        9A23
##   [5358]        9A14
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

##Extracting informative genomic positions

Prior to model fitting, genome-wide substitutions need to be identified and filtered based on a coverage threshold. The getAllSub function identifies all genomic positions exhibiting at least one substitution and coverage above this threshold.

countTable <- getAllSub( Bam, minCov = 10 )
## Loading required package: doMC
## Loading required package: foreach
## Loading required package: iterators
## Considering substitutions, n = 497, processing in 1 chunks
##    chunk #: 1
##    considering the + strand
## Computing local coverage at substitutions...
##    considering the - strand
## Computing local coverage at substitutions...
head( countTable )
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames    ranges strand | substitutions  coverage     count
##          <Rle> <IRanges>  <Rle> |   <character> <numeric> <integer>
##   [1]     chrX  24001959      - |            TC        17         2
##   [2]     chrX  24001973      - |            TC        17        12
##   [3]     chrX  24001977      - |            TC        13         1
##   [4]     chrX  24002046      - |            TC        10         1
##   [5]     chrX  24002057      - |            TC        10         6
##   [6]     chrX  24002147      - |            TC        22         3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The function returns a GRanges object specifying genomic position, strand, substitution type (e.g. “TC”: T in the reference genome; C in the read), strand-specific coverage and number of observed substitutions at the position.

###How to choose the coverage threshold?

The coverage threshold minCov defines the genomic positions to be used for parameter estimation, thus providing a means to tune the stringency of the analysis. Currently, wavClusteR does not allow to learn this threshold from the data. However, since the model is based on relative substitution frequencies, the value of minCov will influence the variance of the estimated parameters. Therefore, values smaller than default (minCov=10) are not recommended.

##Inspecting the substitutions profile (diagnostic I)

Once all substitutions are computed, a diagnostic substitution profile can be plotted with plotSubstitutions.

plotSubstitutions( countTable, highlight = "TC" )

The function returns a barplot showing the total number of genomic positions that exhibit a given type of substitution and highlights the substitution type that is expected to be generated by the experimental procedure. The percentage of substitution of this type is also shown. This plot can be readily used to assess the quality of the sequenced libraries and can be used to compare different data sets generated under the same experimental condition.

##Estimating the model

wavClusteR uses the identified genomic positions to learn a non-parametric mixture model discriminating PAR-CLIP-specific from extrinsic transitions. Model parameters are estimated by fitMixtureModel.

model <- fitMixtureModel(countTable, substitution = "TC")

The function returns a list of:

As the small size of our example data set does not to estimate the model reliably, the mixture model for the entire AGO2 dataset has been precomputed and is provided by the package.

data(model)
str(model)
## List of 5
##  $ l1: Named num 0.181
##   ..- attr(*, "names")= chr "TC"
##  $ l2: Named num 0.819
##   ..- attr(*, "names")= chr "TC"
##  $ p : num [1:999] 7.52 9.44 10.05 10.38 10.48 ...
##  $ p1: num [1:999] 89.6 64.4 50.4 41.5 35.3 ...
##  $ p2: num [1:999] 0 0 1.14 3.51 5 ...

##Inspecting the model fit (diagnostic II)

The model fit can be inspected using getExpInterval.

(support <- getExpInterval( model, bayes = TRUE ) )

## $supportStart
## [1] 0.007
## 
## $supportEnd
## [1] 0.98

The function returns two diagnostic plots. The first plot illustrates the estimated densities \(p\), \(p_1\) and \(p_2\), and log odds

\[ o=

The second plot shows the resulting posterior class probability, i.e. the probability that a given relative substitution frequency (RSF, horizontal axis) has been induced by PAR-CLIP. The area under the curve corresponding to the returned RSF interval is colored, and the RSF interval indicated. By default, getExpInterval returns the RSF interval according to the Bayes classifier, i.e. a posterior probability cutoff of 0.5. However, the user can specify a custom RSF interval:

  1. By means of the rightProb and leftProb parameters, e.g.

    (support <- getExpInterval( model, bayes = FALSE, leftProb = 0.9, rightProb =   0.9 ) )

    ## $supportStart
    ## [1] 0.076
    ## 
    ## $supportEnd
    ## [1] 0.905
  2. By inspecting the posterior class probability density and passing the RSF interval boundaries when calling high-confidence substitutions (see point 6).

Finally, the model can be used to produce further diagnostic plots. Particularly, the total number of reads exhibitng a given substitution and an RSF-based partitioning of genomic positions with substitutions can be obtained by

plotSubstitutions( countTable, highlight = "TC", model )