Contents

1 Introduction

This package is intended to provide tools for the quality assessment of data produced by Oxford Nanopore’s MinION sequencer. It includes a functions to generate a number plots for examining the statistics that we think will be useful for this task.

However, nanopore sequencing is an emerging and rapidly developing technology. It is not clear what will be most informative. We hope that IONiseR will provide a framework for visualisation of metrics that we haven’t thought of, and welcome feedback at mike.smith@embl.de.

If you’re not interested in the quality assement of the raw or event level data, and want to jump straight to the getting FASTQ format files from fast5 files you can go straight to the final section of this document.

1.1 Getting started

In order to get started we need to load the IONiseR library. In addition also load ggplot2 and gridExtra, which are useful for arranging the plots in this vignette, but they are not essential to using the package.

library(IONiseR)
library(ggplot2)
library(gridExtra)

2 Reading data

Once the libraries are loaded we need to read some data. We do this using the function readFast5Summary(). This function takes a vector containing the path names of fast5 files you’d like to read. The example below looks in a specific folder and selects all of the files whose name ends with “.fast5”. We then pass this list of files to the reading function.

You should replace “/path/to/data” with the location of your fast5 files.

fast5files <- list.files(path = "/path/to/data/", pattern = ".fast5$", full.names = TRUE)
example.summary <- readFast5Summary( fast5files )

Raw fast5 data isn’t distributed with this package, but example of the summarised format can found in the accompanying minionSummaryData package. The following command will load this data, giving us an object called s.typhi.rep1. If you have your own MinION data you wish to work with you should ignore this section of code and modify the example above to read your own files.

The data presented in this example are taken from the publication by Ashton et al *(Ashton et al. 2015). You can obtain the original data from the European Nucleotide Archive here: http://www.ebi.ac.uk/ena/data/view/ERR668747*

library(minionSummaryData)
data(s.typhi.rep1)

Typing the name of the resulting object will print a short summary of its contents to the screen.

s.typhi.rep1
## Object of class: Fast5Summary
## Contains information from:
##   9502 fast5 files
##   |- 8209 template strands
##   |- 4454 complement strands
##   |- 3738 full 2D reads
##   |- 2159 pass reads

2.1 The Fast5Summary class

The s.typhi.rep1 object is an example of a the Fast5Summary class.

The structure of the class tries to reflect the variety of data one can find in fast5 files. Depending upon the quality of the data basecalling may not be successful resulting in fast5 files that are essentially empty or files that contain event information but no called bases. More commonly, basecalling is successful to some extent, but there is still a range of possibilities including files with only a template strand called, files with template and complement strands of differening lengths, and ideally files with well matched template and complement strands, plus a consensus 2D read.

It is possible to read fast5 files that have not been basecalled, but since in most cases uploading to the Metrichor basecaller is performed automatically, it is unlikely that this will be a common usecase.

The Fast5Summary class has four slots, each of which is designed to store data relating to one level of processing described above. The data themselves are stored as either as a data.frame or a ShortReadQ, one per slot. The table below gives the names of the four slots, along with a description of stage within the base calling process this represents and the specific fields that are currently stored in the appropriate data.frame.

Fast5Summary-class
readInfo - All fast5 files contain this level of information
File name, channel, mux, pass/fail status
eventData - If events were recorded this level is populated
Mean signal, start time, duration, no. of events
baseCalled - Created if base calling succeeded. Separate entries for each strand
Start time, duration, no. of events, strand, 2D status
fastq - Up to three entries per file (template, complement and 2D reads)
Bases and qualities

2.1.1 Extracting data

Data in the four slots can be obtained using the appropriate accessor methods: readInfo(), eventData(), baseCalled() and fastq(). The example below extracts the base called from our example object.

baseCalled(s.typhi.rep1)
## # A tibble: 12,663 x 6
##       id num_events duration start_time   strand full_2D
##    <int>      <int>    <dbl>      <dbl>    <chr>   <lgl>
##  1     1      10278 369.7328   4960.870 template    TRUE
##  2     2       5793 205.6166   6451.166 template    TRUE
##  3     3      11491 307.3706   6870.268 template    TRUE
##  4     4        115   2.3708   7547.124 template   FALSE
##  5     6       1051  51.5772   7561.986 template    TRUE
##  6     8      21384 543.7290   8127.883 template   FALSE
##  7     9       1065  24.2394   9002.287 template    TRUE
##  8    10       4504  74.0992   2166.572 template   FALSE
##  9    11       7438 205.1246   9069.968 template    TRUE
## 10    12       1497  76.5008   9572.040 template   FALSE
## # ... with 12,653 more rows

2.1.2 Subsetting

Since the the number of entries in each slot can be different between a specific reads, the id column is present in all entires corresponds to the fast5 file data was read from. Subsetting operations work relative to this id field, so all data from the selected files is retained. In the example below we select two files and can see that both the template and complement read information is retained in the baseCalled slot.

baseCalled(s.typhi.rep1[1:2])
## # A tibble: 4 x 6
##      id num_events duration start_time     strand full_2D
##   <int>      <int>    <dbl>      <dbl>      <chr>   <lgl>
## 1     1      10278 369.7328   4960.870   template    TRUE
## 2     2       5793 205.6166   6451.166   template    TRUE
## 3     1       8636 223.2206   5333.229 complement    TRUE
## 4     2       4796 168.4536   6659.120 complement    TRUE

3 Summary plots

3.1 Data production

Once data have been read into a summary object IONiseR contains a number of functions for plotting the data. The first example below visualises the accumulation of reads over the run time. The second plot shows the how many channels were active (i.e. the number of molecules being read) during each minute of the experiment.

p1 <- plotReadAccumulation(s.typhi.rep1)
p2 <- plotActiveChannels(s.typhi.rep1)
grid.arrange(p1, p2, ncol = 2)

3.2 Read types

We may also be interested in the proportion of reads types that were generated. Ideally, Oxford Nanopore’s sequencing technology works by reading both the template and complement strands of a double-stranded DNA molecule. The readings from both strands are then combined to give a higher confidence consensus sequence for the whole fragment - referred to as a 2D read.

Given the nature of this process, there is a strict hierarchy to the data that can be found in a fast5 file. A full 2D read requires both a complement and template strand to have been read correctly. Similarly, a complement strand can only be present if the template was read successfully. Finally, you can encounter a file containing no called bases on either strand.

The function plotReadCategories() will visualise the total number of fast5 files summarised in an Fast5Summary object, along with the counts of those containing template, complement and 2D calls. For an ideal dataset all four bars will be the same height, and the difference between them can reflect the quality of a dataset. These values are the same as those printed out when typing in the name of a summary data object.

It may also be interesting to examine the base quality scores for the reads in the three categories. The function plotReadCategoryQuals() allows one to do this, calculating the mean quality score for each sequence.

p1 <- plotReadCategoryCounts(s.typhi.rep1)
p2 <- plotReadCategoryQuals(s.typhi.rep1)
grid.arrange(p1, p2, ncol = 2)

3.3 Reading rates

We can also look the performance of the pores over time. For example, we may be interested in how rapidly events occur, which should be analagous to the rate at which molecules move through the nanopores. The function plotEventRate() visualises this.

In similar fashion, we can also look at the rate at which actual nucleotide bases are called changes over time using plotBaseProductionRate(). One would expect this to be closely related to the event rate (since each event should correspond to a base moving through the pore), however it is possible to envisage an scenario where events are recorded, but for some reason the base caller struggles to interpret them.

p1 <- plotEventRate(s.typhi.rep1)
p2 <- plotBaseProductionRate(s.typhi.rep1)
grid.arrange(p1, p2, ncol = 2)