# Contents

Paul J. McMurdie and Susan Holmes

mcmurdie@stanford.edu

If you find phyloseq and/or its tutorials useful, please acknowledge and cite phyloseq in your publications:

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data (2013) PLoS ONE 8(4):e61217 http://dx.plos.org/10.1371/journal.pone.0061217

# 1 Other resources

The phyloseq project also has a number of supporting online resources, most of which can by found at the phyloseq home page, or from the phyloseq stable release page on Bioconductor.

To post feature requests or ask for help, try the phyloseq Issue Tracker.

# 2 The experimental data used in this example

In this example I use the publicly available data from a study on colorectal cancer:

Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Kostic, A. D., Gevers, D., Pedamallu, C. S., Michaud, M., Duke, F., Earl, A. M., et al. (2012). Genome research, 22(2), 292-298.

As a side-note, this work was published ahead of print in Genome Research alongside a highly-related article from a separate group of researchers (long-live reproducible observations!): Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. In case you are interested. For the purposes of example, however, we will stick to the data from the former study, with data available at the microbio.me/qiime server.

Data source, from methods section in article:

The 16S gene data set consists of 454 FLX Titanium sequences spanning the V3 to V5 variable regions obtained for 190 samples (95 pairs). Detailed protocols used for 16S amplification and se- quencing are available on the HMP Data Analysis and Coordination Center website (http://www.hmpdacc.org/tools_protocols/tools_ protocols.php).

Study ID: 1457

Project Name: Kostic_colorectal_cancer_fusobacterium

Study Abstract:

The tumor microenvironment of colorectal carcinoma is a complex community of genomically altered cancer cells, nonneoplastic cells, and a diverse collection of microorganisms. Each of these components may contribute to carcino genesis; however, the role of the microbiota is the least well understood. We have characterized the composition of the microbiota in colorectal carcinoma using whole genome sequences from nine tumor/normal pairs. Fusobacterium sequences were enriched in carcinomas, confirmed by quantitative PCR and 16S rDNA sequence analysis of 95 carcinoma/normal DNA pairs, while the Bacteroidetes and Firmicutes phyla were depleted in tumors. Fusobacteria were also visualized within colorectal tumors using FISH. These findings reveal alterations in the colorectal cancer microbiota; however, the precise role of Fusobacteria in colorectal carcinoma pathogenesis requires further investigation.

# 3 Import data with phyloseq, convert to DESeq2

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.25.0'

Defined file path, and import the published OTU count data into R.

filepath = system.file("extdata", "study_1457_split_library_seqs_and_mapping.zip", package="phyloseq")
kostic = microbio_me_qiime(filepath)
## Found biom-format file, now parsing it...
## Done parsing biom...
## Importing Sample Metdadata from mapping file...
## Merging the imported objects...
## Successfully merged, phyloseq-class created.
##  Returning...

Here I had to use a relative file path so that this example works on all systems that have phyloseq installed. In practice, your file path will look like this (if you’ve downloaded the data ahead of time):

filepath = "~/Downloads/study_1457_split_library_seqs_and_mapping.zip"
kostic = microbio_me_qiime(filepath)

Or like this (if you’re accessing data directly from the microbio.me/qiime server directly):

kostic = microbio_me_qiime(1457)

# 4 Convert to DESeq2’s DESeqDataSet class

In this example I’m using the major sample covariate, DIAGNOSIS, as the study design factor. The focus of this study was to compare the microbiomes of pairs of healthy and cancerous tissues, so this makes sense. Your study could have a more complex or nested design, and you should think carefully about the study design formula, because this is critical to the test results and their meaning. You might even need to define a new factor if none of the variables in your current table appropriately represent your study’s design. See the DESeq2 home page for more details.

Here is the summary of the data variable kostic that we are about to use, as well as the first few entries of the DIAGNOSIS factor.

kostic
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2505 taxa and 190 samples ]
## sample_data() Sample Data:       [ 190 samples by 71 sample variables ]
## tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ] sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(kostic)[rownames(sigtab), ], "matrix")) head(sigtab) ## baseMean log2FoldChange lfcSE stat pvalue ## 64396 181.564648 2.195547 0.4282121 5.127241 2.940185e-07 ## 72853 28.633097 -1.604873 0.3310712 -4.847515 1.250173e-06 ## 374052 75.002469 2.523486 0.5411855 4.662885 3.118073e-06 ## 307981 3.258919 2.153038 0.4832019 4.455773 8.359162e-06 ## 180285 168.356513 -1.204357 0.2782362 -4.328542 1.501000e-05 ## padj Kingdom Phylum Class ## 64396 0.0007238735 Bacteria Fusobacteria Fusobacteria (class) ## 72853 0.0015389626 Bacteria Firmicutes Clostridia ## 374052 0.0025588983 Bacteria Fusobacteria Fusobacteria (class) ## 307981 0.0051450643 Bacteria Proteobacteria Gammaproteobacteria ## 180285 0.0073909224 Bacteria Firmicutes Clostridia ## Order Family Genus Species ## 64396 Fusobacteriales Fusobacteriaceae Fusobacterium <NA> ## 72853 Clostridiales Ruminococcaceae Faecalibacterium <NA> ## 374052 Fusobacteriales Fusobacteriaceae Fusobacterium <NA> ## 307981 Enterobacteriales Enterobacteriaceae Klebsiella <NA> ## 180285 Clostridiales Ruminococcaceae Faecalibacterium <NA> Let’s look at just the OTUs that were significantly enriched in the carcinoma tissue. First, cleaning up the table a little for legibility. posigtab = sigtab[sigtab[, "log2FoldChange"] > 0, ] posigtab = posigtab[, c("baseMean", "log2FoldChange", "lfcSE", "padj", "Phylum", "Class", "Family", "Genus")] OTU baseMean log2FoldChange lfcSE padj Phylum Class Family Genus 64396 181.564648 2.195547 0.4282121 0.0007238735 Fusobacteria Fusobacteria (class) Fusobacteriaceae Fusobacterium 374052 75.002469 2.523486 0.5411855 0.0025588983 Fusobacteria Fusobacteria (class) Fusobacteriaceae Fusobacterium 307981 3.258919 2.153038 0.4832019 0.0051450643 Proteobacteria Gammaproteobacteria Enterobacteriaceae Klebsiella As expected from the original study abstract and title, a Fusobacterium OTU was among the most-significantly differentially abundant between the cancerous and healthy samples. # 7 Plot Results Here is a bar plot showing the log2-fold-change, showing Genus and Phylum. Uses some ggplot2 commands. library("ggplot2") theme_set(theme_bw()) sigtabgen = subset(sigtab, !is.na(Genus)) # Phylum order x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x)) x = sort(x, TRUE) sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x)) # Genus order x = tapply(sigtabgen$log2FoldChange, sigtabgen$Genus, function(x) max(x)) x = sort(x, TRUE) sigtabgen$Genus = factor(as.character(sigtabgen\$Genus), levels=names(x))
ggplot(sigtabgen, aes(y=Genus, x=log2FoldChange, color=Phylum)) +
geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))