Contents

1 Sample inference from amplicon sequencing data with dada2

Benjamin Callahan, Joey McMurdie, Susan Holmes

Statistics Department, Stanford University

Stanford, CA 94305, USA

benjaming.j.callahan@gmail.com

DADA2 Home Page

2 Introduction

The investigation of environmental microbial communities and microbiomes has been driven in significant part by the recent widespread adoption of amplicon sequencing. In amplicon sequencing a particular genetic locus is amplified from DNA extracted from the community of interest, and then sequenced on a next-generation sequencing platform. This technique removes the need to culture microbes in order to detect their presence, and cost-effectively provides a deep census of a microbial community.

However, the process of amplicon sequencing introduces errors into the DNA sequences being analyzed, and these errors severely complicate the interpretation of the results. DADA2 implements a novel algorithm that models the errors introduced during amplicon sequencing, and uses that error model to infer the true sample composition. DADA2 takes the place of the ubiquitous “OTU-picking” step in amplicon sequencing workflows. As demonstrated in our preprint and in further benchmarking, the DADA2 method provides both better sensitivity and specificity than OTU methods: DADA2 detect real biological variation missed by OTU methods while outputting fewer spurious sequences.

3 Overview of the dada2 pipeline

The starting point for the dada2 pipeline is a set of demultiplexed fastq files corresponding to the samples in your amplicon sequencing study. That is, dada2 expects there to be an individual fastq file for each sample (or two fastq files, one forward and one reverse, for each sample). Demultiplexing is often performed at the sequencing center, but if that has not been done there are a variety of tools do accomplish this, including the popular QIIME python scripts split_libraries_fastq.py followed by split_sequence_file_on_sample_ids.py, and the utility idemp, among others.

Once demultiplexed fastq files are in hand, the dada2 pipeline proceeds as follows:

  1. Filter and Trim: fastqFilter() or fastqPairedFilter()
  2. Dereplicate: derepFastq()
  3. Infer sample composition: dada()
  4. Merge paired reads: mergePairs()
  5. Make sequence table: makeSequenceTable()
  6. Remove chimeras: isBimeraDenovo() or removeBimeraDenovo()

The output of the dada2 pipeline is a sample-by-sequence matrix, with each entry corresponding to the number of times that inferred sample sequence was observed in that sample. This table is analogous to the common OTU table, except at higher resolution (exact sample sequences rather than 97% OTUs).

We’ll now go through that pipeline on a highly simplified dataset of just one paired-end sample (we’ll add a second later).

4 Filter and Trim

We’ll start by getting the filenames of our example paired-end fastq files. Usually you will define these filenames directly, or read them out of a directory, but for this tutorial we’re using files included with the package, which we can identify via a particular function call:

library(dada2); packageVersion("dada2")
## [1] '1.4.0'
fnF1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fnR1 <- system.file("extdata", "sam1R.fastq.gz", package="dada2")

filtF1 <- tempfile(fileext=".fastq.gz")
filtR1 <- tempfile(fileext=".fastq.gz")

Note that the dada2 package “speaks” the gzip format natively, all fastq files can remain in the space-saving gzip format throughout.

Now that we have the filenames, we’re going to inspect the quality profile of our data:

plotQualityProfile(fnF1) # Forward

plotQualityProfile(fnR1) # Reverse