1 Introduction

MTseeker works best when given some interesting mitochondrial data to work with. Renal oncocytomas are great big pink cells that are jammed full of defective mitochondria, and sometimes progress to genomically unstable chromophobe renal cell carcinomas (kidney cancer) in unlucky hosts. Nobody seems to be entirely sure what role mitochondrial variants play in their evolution, but the cells have thousands of mitochondria stuffed into them. So that’s what we’ll study.

2 Loading data

First we needed to load the oncocytoma BAMs. We don’t actually do this here, since they are several gigabytes apiece, but notice that all of them have been aligned with BWA against the canonical rCRS mitogenome by splicing it into hg19. (As opposed to GRCh37, which is what we should have done… but the point is that any modern GRCh assembly or a spliced rCRS contig works.)

library(MTseeker)
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colMeans, colSums, colnames,
##     dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
##     intersect, is.unsorted, lapply, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
##     rowSums, rownames, sapply, setdiff, sort, table, tapply, union,
##     unique, unsplit, which, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: GenomeInfoDb
## Loading required package: IRanges
## Loading required package: GenomicAlignments
## Loading required package: GenomicRanges
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:DelayedArray':
## 
##     type
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: Rsamtools
## Loading required package: VariantAnnotation
## 
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:base':
## 
##     tabulate
## 
## 
if (FALSE) { 
  # we use SamBlaster... a lot... in my lab.
  # however, this example takes a while even with SamBlaster. 
  # it is recorded here for posterity and also "how did you get that result". 
  BAMfiles <- grep("(split|disc)", value=T, invert=T, list.files(patt=".bam$"))
  names(BAMfiles) <- sapply(strsplit(BAMfiles, "\\."), `[`, 1)
  BAMs <- data.frame(BAM=BAMfiles, 
                     Sample_Group=ifelse(grepl("NKS", BAMfiles), 
                                         "normal","tumor"))
  rownames(BAMs) <- sub("NKS", "normal", sub("RO","oncocytoma", rownames(BAMs)))
  BAMs$subject <- as.integer(sapply(strsplit(BAMs$BAM, "(_|\\.)"), `[`, 2))

  # we merged all the BAMs after-the-fact, so...
  BAMs <- subset(BAMs, grepl("merged", BAMs$BAM))
  BAMs <- BAMs[order(BAMs$subject), ]

  library(parallel) 
  options("mc.cores"=detectCores())
  MTreads <- getMT(BAMs, filter=FALSE) 
  names(MTreads) <- sapply(strsplit(fileName(MTreads), "\\."), `[`, 1)
  saveRDS(MTreads, file="oncocytoma_and_matched_normal_MTreads.rds")
}

Since realigning 22 whole exomes and extracting/counting reads takes a while, we created the MTseekerData package to hold the output from doing the above. The RONKSreads and RONKSvariants data objects hold aligned reads (as an MAlignments object) and called variants (as an MVRangesList object) comparing _R_enal _O_ncocytomas and _N_ormal _K_idney _S_amples from 11 subjects who developed this premalignant neoplasia of the kidney. Most such cases seem to be self-limiting, perhaps due to their defective mitophagy, and they accumulate thousands of excess mitochondria with Complex I defects. If these tumors do become malignant, they tend to turn into chromophobe renal cell carcinomas, a type of kidney cancer that is characterized by aneuploidy of all or nearly all chromosomes. So the relationship between a self-limiting oncocytoma and a genomically unstable carcinoma is thought to hinge at least in part on how the affected cell deals with metabolic derangement, and whether that results in the accumulation of additional mutations (like TP53 in CRCCs).

library(MTseekerData)

3 Relative mitochondrial copy number changes

We’d like to compute the relative mitochondrial copy number for each. For whatever reason, this seems to throw an error in the vignette build if we plot all of them, so we will just plot a subset in this vignette.

data(RONKSreads, package="MTseekerData")
mVn <- Summary(RONKSreads)$mitoVsNuclear
names(mVn) <- names(RONKSreads) 
CN <- mVn[seq(2,22,2)]/mVn[seq(1,21,2)] 
mtCN <- data.frame(subject=names(CN), CN=CN)

library(ggplot2) 
library(ggthemes)
p <- ggplot(head(mtCN), aes(x=subject, y=CN, fill=subject)) + 
       geom_col() + theme_tufte(base_size=24) + ylim(0,4) + 
       ylab("Tumor/normal mitochondrial ratio") + 
       ggtitle("Mitochondrial retention in oncocytomas")
print(p)

Note that some of the oncocytomas appear to have fewer mitochondrial genome copies than their normal kidney sample counterparts, contrary to intuition. These tumors usually stain as great big pink eosinophil-like cells on slides, supposedly due to their massive cytoplasmic load of mitochongdria. For whatever reason, though, the amount of (non-duplicated) mtDNA recovered in some of the oncocytomas relative to nearby normal kidney cells isn’t always much greater. (It is also possible that some neighboring cells might also be premalignant.)

4 Calling variants

Obviously it’s not much good to have a variant caller that can’t call variants, so we demonstrate that here. (Note: tumor/normal calls, haplogroup inference, and soft-backfiltering of haplogroup-determining variants are works in progress, so we do not currently demonstrate them here, although the fpFilter datasets are useful for these purposes) At present, MTseeker supports the gmapR package for mitochondrial variant calling, though we plan to provide cross-platform support via Rsamtools::pileup in the near future.

gmapR can be a bit feisty, so we simply document the process below:

if (FALSE) { 
  # doing this requires the BAM files
  RONKSvariants <- callMT(RONKSreads)
  # which is why we skip it in the vignette 
  save(RONKSvariants, file="RONKSvariants.rda")
  # see ?callMT for a much simpler runnable example
}

For this vignette, we have stored the results in the MTseekerData package:

library(MTseekerData)
data(RONKSvariants, package="MTseekerData")

5 Plotting variants

Let’s filter out some of the common variants, to focus on those only seen in the renal oncocytoma samples. The filterMT function drops samples with median read depth less than 20x, and if requested, also drops variants that fall into known homopolymeric regions (fpFilter=TRUE) or have variant allele frequencies (VAFs) of less than 0.03, many or most of which are likely to be NuMT (nu-mite) sequences that map equally well between nuclear and mitochondrial assemblies.

The coverage filter is somewhat gratuitous, since the shallowest mitochondrial coverage in this study is 333x, but the false-positive filter and the NuMT filter are both a good idea in most studies. Additionally, variant calls that are not marked as PASSing quality control are all dropped at this stage. The granges method for MVRanges and MVRangesList objects turns a set of variant calls into a GenomicRanges object with the aggregated affected regions.

RO <- grep("RO_", names(RONKSvariants))
filtered_RO <- filterMT(RONKSvariants[RO], fpFilter=TRUE, NuMT=TRUE)
RO_recurrent <- subset(granges(filtered_RO), 
                       region == "coding" & rowSums(overlaps) > 1)
## Filtering out low-quality calls...
## Aggregating variants...
## Annotating variants by region...
## Annotating variants by sample...

Same thing for the normal kidney samples, so that we can weed out a bit more. Again, we’ll use the granges method to aggregate affected regions of chrM.

NKS <- grep("NKS_", names(RONKSvariants))
filtered_NKS <- filterMT(RONKSvariants[NKS], fpFilter=TRUE, NuMT=TRUE)
NKS_recurrent <- subset(granges(filtered_NKS), 
                        region == "coding" & rowSums(overlaps) > 1)
## Filtering out low-quality calls...
## Aggregating variants...
## Annotating variants by region...
## Annotating variants by sample...
NKS_gaps <- subset(gaps(NKS_recurrent), strand == "*")

Lastly, let’s take only the variants in the oncocytomas that do not overlap recurrent variants in the normal kidney samples, to simplify our plotting.

RONKSfiltered <- endoapply(filterMT(RONKSvariants), subsetByOverlaps, NKS_gaps)
RONKScoding <- encoding(RONKSfiltered)

OK, now we have whittled away some of the more common variants to focus on those that seem to be specific and recurrent in the oncocytomas. Let’s plot the resulting variants in the first few samples. The plot looks a bit like tree rings; each tree ring is one sample, and each black mark is a variant (of whatever minimum variant allele frequency, or VAF, enforced) in a particular sample. The color-coded lines point to where on chrM the variant maps back to, since there are sparse and dense regions of variation. For whatever reason, vignette builds fail if we plot all of the variants, so we skip this step in the production vignette. (It is done in the build!)

plot(RONKScoding)

The resulting plot looks like so: