RNAmodR.ML 1.20.0
Post-transcriptional modifications can be found abundantly in rRNA and tRNA and can be detected classically via several strategies. However, difficulties arise if the identity and the position of the modified nucleotides is to be determined at the same time. Classically, a primer extension, a form of reverse transcription (RT), would allow certain modifications to be accessed by blocks during the RT changes or changes in the cDNA sequences. Other modification would need to be selectively treated by chemical reactions to influence the outcome of the reverse transcription.
With the increased availability of high throughput sequencing, these classical methods were adapted to high throughput methods allowing more RNA molecules to be accessed at the same time. However, patterns of some modification cannot be detected by accessing small number of parameters. For these cases machine learning models can be trained on data from positions known to be modified in order to detect additional modified positions.
To extend the functionality of the RNAmodR
package and classical detection
strategies used for RiboMethSeq or AlkAnilineSeq (see RNAmodR.RiboMethSeq
and
RNAmodR.AlkAnilineSeq
packages) towards detection through machine learning
models, RNAmodR.ML
provides classes and an example workflow.
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'ExperimentHubData'
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.ML)
library(RNAmodR.Data)
The ModifierML
class extends the Modifier
class from the RNAmodR
package
and adds one slot, mlModel
, a getter/setter getMLModel
/setMLModel
, an
additional useMLModel
function to be called from the aggregate
function.
The slot mlModel
can either be an empty character or contain a name of a
ModifierMLModel
class, which is loaded upon creation of a ModifierML
object,
and serves as a wrapper around a machine learning model. For different types of
machine learning models ModifierMLModel
derived classes are available, which
currently are:
ModifierMLranger
for models generated with the ranger
package
(Wright and Ziegler 2017)ModifierMLkeras
for models generated with the keras
package
(Allaire and Chollet 2018)To illustrate how to develop a machine learning model with help from the
RNAmodR.ML
package, an example is given below.
Modifier
classAs an example for this vignette, we will try to detect D positions in
AlkAnilineSeq data. First define a specific ModifierML
class loading pileup
and coverage data. In this example, the RNA specific RNAModifierML
class is
used.
setClass("ModMLExample",
contains = c("RNAModifierML"),
prototype = list(mod = c("D"),
score = "score",
dataType = c("PileupSequenceData",
"CoverageSequenceData"),
mlModel = character(0)))
# constructor function for ModMLExample
ModMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::Modifier("ModMLExample", x = x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
setClass("ModSetMLExample",
contains = "ModifierSet",
prototype = list(elementType = "ModMLExample"))
# constructor function for ModSetMLExample
ModSetMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::ModifierSet("ModMLExample", x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
Since the mlModel
slot contains an empty character, the creation of the object
will not automatically trigger a search for modifications. However, it will
aggregate the data in the format we want to use. The aggregate_example
function is just an example and the aggregation of the data is part of the
model building. (See (#Summary))
setMethod(
f = "aggregateData",
signature = signature(x = "ModMLExample"),
definition =
function(x){
aggregate_example(x)
}
)
To gather training data, we just create a ModMLExample
object and let it do
its job.
me <- ModMLExample(files[[1]], annotation, sequences)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## Loading Pileup data from BAM files ... OK
## Loading Coverage data from BAM files ... OK
## Aggregating data and calculating scores ... Starting to search for 'Dihydrouridine' ...
## Warning: ML model not set. Skipped search for modifications ...
## done.
Afterwards we need to load/create coordinates for positions known to be modified as well as positions known to be unmodified.
data("dmod",package = "RNAmodR.ML")
# we just select the next U position from known positions
nextUPos <- function(gr){
nextU <- lapply(seq.int(1L,2L),
function(i){
subseq <- subseq(sequences(me)[dmod$Parent], start(dmod)+3L)
pos <- start(dmod) + 2L +
vapply(strsplit(as.character(subseq),""),
function(y){which(y == "U")[i]},integer(1))
ans <- dmod
ranges(ans) <- IRanges(start = pos, width = 1L)
ans
})
nextU <- do.call(c,nextU)
nextU$mod <- NULL
unique(nextU)
}
nondmod <- nextUPos(dmod)
nondmod <- nondmod[!(nondmod %in% dmod)]
coord <- unique(c(dmod,nondmod))
coord <- coord[order(as.integer(coord$Parent))]
With these coordinates the aggregated data of the ModMLExample
can be subset
to a training data set using the function trainingData
.
trainingData <- trainingData(me,coord)
trainingData <- unlist(trainingData, use.names = FALSE)
# converting logical labels to integer
trainingData$labels <- as.integer(trainingData$labels)
How a specific model can be trained or what kind of strategies can be employed
to successfully train a model, is out of scope for the vignette. For this
example a random forest is trained using the functionality provided by the
ranger
package.
library(ranger)
model <- ranger(labels ~ ., data.frame(trainingData))
Now, the trained model can be used to create a new ModifierMLModel
class and
object.
setClass("ModifierMLexample",
contains = c("ModifierMLranger"),
prototype = list(model = model))
ModifierMLexample <- function(...){
new("ModifierMLexample")
}
mlmodel <- ModifierMLexample()
To be able to use the model via the ModifierMLModel
class, we also need to
define an accessor to the predictions made by the model. This function is called
useModel
and is already prefined for the ModifierMLModel
classes mentioned
above in secion Using RNAmodR.ML.
getMethod("useModel", c("ModifierMLranger","ModifierML"))
## Method Definition:
##
## function (x, y)
## {
## data <- getAggregateData(y)
## model <- x@model
## if (!is(model, "ranger")) {
## stop("model is not a ranger model")
## }
## unlisted_data <- unlist(data, use.names = FALSE)
## p <- predict(x@model, data.frame(unlisted_data))
## relist(p$predictions, data)
## }
## <bytecode: 0x630447e0bd50>
## <environment: namespace:RNAmodR.ML>
##
## Signatures:
## x y
## target "ModifierMLranger" "ModifierML"
## defined "ModifierMLranger" "ModifierML"
If the results of these function is not usable for a specific model, it can
redefined for your needs. As defined by RNAmodR.ML
the function returns a
NumericList
along the aggregated data of the ModifierML
object.
The generated ModifierMLexample
wrapper can now be set for the ModifierML
object using the setMLModel
function. (If a model is saved as part of a
package, this step ist not necessary, because it can be part of the class
definition)
setMLModel(me) <- mlmodel
In order for the prediction data to be usable, another function has to be
implemented to save the predictions in the aggregated data. The function is
called useMLModel
.
setMethod(f = "useMLModel",
signature = signature(x = "ModMLExample"),
definition =
function(x){
predictions <- useModel(getMLModel(x), x)
data <- getAggregateData(x)
unlisted_data <- unlist(data, use.names = FALSE)
unlisted_data$score <- unlist(predictions)
x@aggregate <- relist(unlisted_data,data)
x
}
)
By re-running the aggregate
function and force an update of the data, the
predictions are made and used to populate the score
column as outlined above.
me <- aggregate(me, force = TRUE)
During the model building phase some form of a performance measurement usually
is included. In addition to these model specific measurements, RNAmodR.ML
includes the functionality from the ROCR
package inherited from the RNAmodR
package. With this the performance of a model can evaluted over the training set
or any coordinates.
plotROC(me, dmod)