Example using Negative Binomial in Microbiome Differential Abundance Testing

In this example I use the publicly available data from a study on colorectal cancer:

Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.

As a side-note, this work was published ahead of print in Genome Research alongside a highly-related article from a separate group of researchers (long-live reproducible observations!): Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. In case you are interested. For the purposes of example, however, we will stick to the data from the former study, with data available at the microbio.me/qiime server.

Study ID: 1457

Project Name: Kostic_colorectal_cancer_fusobacterium

Study Abstract: The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcino genesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.

Import data with phyloseq, convert to DESeq2

Start by loading phyloseq.

library("phyloseq")
packageVersion("phyloseq")
## [1] '1.9.2'

Defined file path, and import the published OTU count data into R.

filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", 
    package = "phyloseq")
kostic = microbio_me_qiime(filepath)
## Found biom-format file, now parsing it... 
## Done parsing biom... 
## Importing Sample Metdadata from mapping file...
## Merging the imported objects... 
## Successfully merged, phyloseq-class created. 
##  Returning...

Here I had to use a relative file path so that this example works on all systems that have phyloseq installed. In practice, your file path will look like this (if you've downloaded the data ahead of time):

filepath = "~/Downloads/study_1457_split_library_seqs_and_mapping.zip"
kostic = microbio_me_qiime(filepath)

Or like this (if you're accessing data directly from the microbio.me/qiime server directly):

kostic = microbio_me_qiime(1457)

Convert to DESeq2's DESeqDataSet class

In this example I'm using the major sample covariate, DIAGNOSIS, as the study design factor. The focus of this study was to compare the microbiomes of pairs of healthy and cancerous tissues, so this makes sense. Your study could have a more complex or nested design, and you should think carefully about the study design formula, because this is critical to the test results and their meaning. You might even need to define a new factor if none of the variables in your current table appropriately represent your study's design. See the DESeq2 home page for more details.

Here is the summary of the data variable kostic that we are about to use, as well as the first few entries of the DIAGNOSIS factor.

kostic
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 190 samples ]
## sample_data() Sample Data:       [ 190 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]
head(sample_data(kostic)$DIAGNOSIS, 25)
##  [1] Healthy Tumor   Tumor   Healthy Healthy Healthy Tumor   Healthy
##  [9] Healthy Healthy Healthy Healthy Healthy Healthy Healthy Tumor  
## [17] Healthy Healthy Healthy Healthy Healthy Tumor   Tumor   Tumor  
## [25] Healthy
## Levels: Healthy None Tumor

DESeq2 conversion and call

First load DESeq2.

library("DESeq2")
packageVersion("DESeq2")
## [1] '1.5.0'

The following two lines actually do all the complicated DESeq2 work. The function phyloseq_to_deseq2 converts your phyloseq-format microbiome data into a DESeqDataSet with dispersions estimated, using the experimental design formula, also shown (the ~DIAGNOSIS term). The DESeq function does the rest of the testing, in this case with default testing framework, but you can actually use alternatives.

diagdds = phyloseq_to_deseq2(kostic, ~DIAGNOSIS)
diagdds = DESeq(diagdds)  #, fitType='local')
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## Warning: the parametric fit of dispersion estimates over the mean of counts
## failed, which occurs when the trend is not well captured by the
## function y = a/x + b. A local regression fit is automatically performed,
## and the analysis can continue. You can specify fitType='local' or 'mean'
## to avoid this message if re-running the same data.
## When using local regression fit, the user should examine plotDispEsts(dds)
## to make sure the fitted line is not sharply curving up or down based on
## the position of individual points.
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 337 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds))
## estimating dispersions
## fitting model and testing

Note: The default multiple-inference correction is Benjamini-Hochberg, and occurs within the DESeq function.

Investigate test results table

The following results function call creates a table of the results of the tests. Very fast. The hard work was already stored with the rest of the DESeq2-related data in our latest version of the diagdds object (see above). I then order by the adjusted p-value, removing the entries with an NA value. The rest of this example is just formatting the results table with taxonomic information for nice(ish) display in the HTML output.

res = results(diagdds)
res = res[order(res$padj, na.last = NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kostic)[rownames(sigtab), 
    ], "matrix"))
head(sigtab)
##        baseMean log2FoldChange  lfcSE   stat    pvalue      padj  Kingdom
## 592346    3.777         -1.661 0.2565 -6.474 9.548e-11 4.191e-08 Bacteria
## 72853    31.679         -1.654 0.2660 -6.217 5.058e-10 1.110e-07 Bacteria
## 168181    4.601         -1.376 0.2263 -6.080 1.205e-09 1.763e-07 Bacteria
## 274244    7.989         -1.637 0.2739 -5.976 2.287e-09 2.510e-07 Bacteria
## 175685    5.300         -1.259 0.2247 -5.604 2.094e-08 1.839e-06 Bacteria
## 186029   10.126         -1.345 0.2420 -5.558 2.727e-08 1.995e-06 Bacteria
##                Phylum                  Class             Order
## 592346 Proteobacteria    Gammaproteobacteria Enterobacteriales
## 72853      Firmicutes             Clostridia     Clostridiales
## 168181  Bacteroidetes            Bacteroidia     Bacteroidales
## 274244  Bacteroidetes            Bacteroidia     Bacteroidales
## 175685     Firmicutes             Clostridia     Clostridiales
## 186029 Actinobacteria Actinobacteria (class)  Coriobacteriales
##                    Family            Genus                 Species
## 592346 Enterobacteriaceae       Klebsiella                    <NA>
## 72853     Ruminococcaceae Faecalibacterium                    <NA>
## 168181      Rikenellaceae        Alistipes                    <NA>
## 274244               <NA>             <NA>                    <NA>
## 175685    Lachnospiraceae             <NA>                    <NA>
## 186029  Coriobacteriaceae      Collinsella Collinsella aerofaciens

Let's look at just the OTUs that were significantly enriched in the carcinoma tissue. First, cleaning up the table a little for legibility.

posigtab = sigtab[sigtab[, "log2FoldChange"] > 0, ]
posigtab = posigtab[, c("baseMean", "log2FoldChange", "lfcSE", "padj", "Phylum", 
    "Class", "Family", "Genus")]
OTU baseMean log2FoldChange lfcSE padj Phylum Class Family Genus
307981 3.764 1.2678 0.2361 4.308e-06 Proteobacteria Gammaproteobacteria Enterobacteriaceae Klebsiella
52289 1.920 1.0854 0.2250 2.671e-05 Firmicutes Clostridia Lachnospiraceae Clostridium
88701 1.434 0.9080 0.2116 1.950e-04 Bacteroidetes Bacteroidia Prevotellaceae Prevotella
110842 2.051 1.0159 0.2419 2.626e-04 Firmicutes Clostridia Catabacteriaceae NA
9778 2.591 0.9073 0.2215 3.627e-04 Proteobacteria Gammaproteobacteria Enterobacteriaceae Salmonella
551781 1.640 0.9394 0.2308 3.882e-04 Firmicutes Clostridia Ruminococcaceae NA
74391 2.111 0.9177 0.2372 7.146e-04 Fusobacteria Fusobacteria (class) Fusobacteriaceae Leptotrichia
28952 2.218 0.8417 0.2372 2.098e-03 Bacteroidetes Bacteroidia Prevotellaceae Prevotella
249899 1.584 0.7596 0.2259 3.347e-03 Firmicutes Clostridia Peptococcaceae Peptococcus
298806 1.508 0.6234 0.1922 4.797e-03 Actinobacteria Actinobacteria (class) Coriobacteriaceae Collinsella
470451 3.459 0.8098 0.2517 5.025e-03 Fusobacteria Fusobacteria (class) Fusobacteriaceae Fusobacterium
359171 2.027 0.7008 0.2182 5.030e-03 Firmicutes Clostridia Clostridiales Family XI. Incertae Sedis Anaerococcus
470243 3.317 0.7808 0.2566 7.614e-03 Proteobacteria Betaproteobacteria Neisseriaceae Eikenella

As expected from the original study abstract and title, a Fusobacterium OTU was most-significantly differentially abundant between the cancerous and healthy samples.

Here is a bar plot showing the log2-fold-change, showing Genus and Phylum. Uses some ggplot2 commands.

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
sigtabgen = subset(sigtab, !is.na(Genus))
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels = names(x))
# Genus order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels = names(x))
ggplot(sigtabgen, aes(x = Genus, y = log2FoldChange, fill = Phylum)) + geom_point(size = 6) + 
    theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5))

plot of chunk bar-plot