eQTL tutorial, Bioc 2014

VJ Carey stvjc at channing dot harvard dot edu

Basic data from RECOUNT

We will use transcript profiles obtained through RNA-seq applied to HapMap cell lines (Montgomery, Sammeth, Gutierrez-Arcelus, Lach, Ingle, Nisbett, Guigo, and Dermitzakis (2010)). The reported gene-level count data have been rendered as an ExpressionSet (montpick_eset.RData) in the RECOUNT project (Frazee, Langmead, and Leek (2011)) using the ENSEMBL nomenclature for genes. These counts were filtered to exclude genes with uniform zero value over all samples, subjected to rlog transformation (DESeq2), and then filtered to include only those with MAD at the 40th percentile of the overall distribution of genewise MADs.

Characteristic code for the transformation to SummarizedExperiment is given here. You do not need to execute this as the transformed data are available for loading as noted below. (exs2se() is a bit of custom code available in the source of this vignette.)

library(Biobase)
load("/data/eQTL2014/montpick_eset.RData")
mp = montpick.eset
annotation(mp) = "org.Hs.eg.db"
tmp = select(org.Hs.eg.db, keytype="ENSEMBL", columns="ENTREZID",
  keys=featureNames(mp))
ezs = split(tmp[,2], tmp[,1])
ezs = sapply(ezs, "[", 1)
drop = which(duplicated(ezs) | is.na(ezs))
if (length(drop)>0) ezs = ezs[-drop]
mp = mp[names(ezs)]
featureNames(mp) = as.character(ezs[featureNames(mp)])
mpSE = exs2se(mp, annoDb = org.Hs.eg.db, probekeytype="ENTREZID")

Transformed expression data in SummarizedExperiment

We obtain the basic image of transformed count data as follows.

load("/data/eQTL2014/rmpDEPF.rda") # rlogged Montgomery Pickrell DESeqDataSet, some positive, filtered
rmpDEPF

## class: SummarizedExperiment 
## dim: 7054 129 
## exptData(1): MIAME
## assays(1): ''
## rownames(7054): 8813 57147 ... 56099 56104
## rowData metadata column names(6): baseMean baseVar ... dispFit
##   rlogIntercept
## colnames(129): NA06985 NA06986 ... NA19239 NA19257
## colData names(5): sample.id num.tech.reps population study
##   sizeFactor

Genotype data in an Amazon S3 bucket

All the data generated in the 1000 genomes project is available in a S3 (simple storage service) container. We have a URL that can be used to access the genotypes from chr22 in VCF format.

load("/data/eQTL2014/s3url.rda")
s3url

## [1] "http://1000genomes.s3.amazonaws.com/release/20110521/ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"

library(VariantAnnotation)

head = scanVcfHeader(s3url)
str(head, max.level=4)

## Formal class 'VCFHeader' [package "VariantAnnotation"] with 3 slots
##   ..@ reference: chr(0) 
##   ..@ samples  : chr [1:1092] "HG00096" "HG00097" "HG00099" "HG00100" ...
##   ..@ header   :Formal class 'SimpleDataFrameList' [package "IRanges"] with 4 slots
##   .. .. ..@ elementType    : chr "DataFrame"
##   .. .. ..@ elementMetadata: NULL
##   .. .. ..@ metadata       : list()
##   .. .. ..@ listData       :List of 4

A naive analysis

The GGtools package includes a function, cisAssoc(), that can test additive genetic associations in cis between SNP genotypes housed in tabix-indexed VCF and expression data housed in a SummarizedExperiment. The test is the score test for a generalized linear model fit to independent samples with a continuous response, with response variance approximately independent of response mean. These assumptions are generally not met for count data from RNA-seq experiments, but they may hold reasonably approximately after the rlog transformation. We will check on the severity of the violations after performing some tests.

Here is code that computes, for 200 genes on chr17, association test statistics and versions thereof under permutation of expression against genotype. Do not perform these calculations, they will take too long for the tutorial. The answer is provided in the tutorial data, see below.

library(GGtools)

## Loading required package: GGBase
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: splines
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following object is masked from 'package:VariantAnnotation':
## 
##     expand
## 
## The following object is masked from 'package:IRanges':
## 
##     expand
## 
## Loading required package: data.table
## data.table 1.9.2  For help type: help("data.table")
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'GGtools'
## 
## The following object is masked from 'package:stats':
## 
##     getCall

library(foreach)
library(doParallel)
registerDoSEQ()
r17 = rmpDEPF[ which(seqnames(rmpDEPF)=="chr17"),]
set.seed(1234)
s3urlb = sub("22", "17", s3url)
c17 = cisAssoc(r17, TabixFile(s3urlb), cisradius=100000, lbmaf=.025)
c17d = addmindist(c17)

Here is an extract from the output. We have manually added a column 'mindist' with the number of nucleotides from SNP to nearest end of gene body. 'mindist' is zero for SNP lying within the gene body.

c17d[1:3,1:5]

## GRanges with 3 ranges and 5 metadata columns:
##       seqnames               ranges strand | paramRangeID            REF
##          <Rle>            <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   [1]       17 [36926181, 36926181]      * |         3927              T
##   [2]       17 [36926303, 36926303]      * |         3927              C
##   [3]       17 [36926384, 36926384]      * |         3927              C
##                   ALT     chisq permScore_1
##       <CharacterList> <numeric>   <numeric>
##   [1]               A    2.9869     0.51171
##   [2]               T    4.3823     1.01236
##   [3]               T    0.4235     0.05441
##   ---
##   seqlengths:
##    17
##    NA

names(mcols(c17d))

##  [1] "paramRangeID" "REF"          "ALT"          "chisq"       
##  [5] "permScore_1"  "permScore_2"  "permScore_3"  "snp"         
##  [9] "MAF"          "probeid"      "score"        "fdr"         
## [13] "mindist"

Computing the plug-in FDR

The following algorithm is from The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (Springer 2001).

We can obtain estimated FDR for the gene-SNP pairs tested in the example above, as follows, using the pifdr() function in GGtools.

fdr = pifdr( c17d$chisq, c(c17d$permScore_1, 
  c17d$permScore_2, c17d$permScore_3) )
sum(fdr == 0)

## [1] 430

sum(fdr <= .05)

## [1] 5292

Sensitivity analysis, and “what to count?”

The following computations generate a sensitivity analysis display, confronting the enumeration of significantly associated SNP-gene pairs.

library(data.table)
library(foreach)

## foreach: simple, scalable parallel programming from Revolution Analytics
## Use Revolution R for scalability, fault tolerance and more.
## http://www.revolutionanalytics.com

registerDoSEQ()
dt1 = data.table(as.data.frame(mcols(c17d)))

s1 = eqsens_dt( dt1 )

plotsens(s1, ylab="Count of significant SNP-gene pairs at given FDR")

## Loading required package: reshape2
## Loading required package: ggplot2

plot of chunk dopl

If instead of counting SNP-gene pairs, we wish to count genes that show evidence of association with some genotype, the plug-in FDR procedure can be used with maximal association scores per gene, for observed and permuted tests.

s2 = eqsens_dt( dt1, by="probes" )

plotsens(s2, ylab="Count of genes with evidence of eQTL at given FDR")

plot of chunk pl2

There is an indication here that we get a larger yield if we filter more sharply on MAF and distance than we did in the initial run.

Exercise 1.

How can we compute the FDR for the gene-wise hypotheses H0g: Mean expression of gene g is independent of SNP content for all SNP cis to g, under the restrictions cis radius less than 50kb and MAF greater than 5 percent?

Visualization

The following code creates a visualization of genotype-expression association for a selected SNP-gene pair.

plotOne = function (summex, vcf.tf, whicha, whichv, genome="hg19", ... ) {
    sampidsInSumm = colnames(summex)
    sampidsInVCF = sampsInVCF(vcf.tf)
    oksamp = intersect(sampidsInSumm, sampidsInVCF)
    stopifnot(length(oksamp) > 0)
    summex = summex[, oksamp]
    vp = ScanVcfParam(fixed = "ALT", info = NA, geno = "GT", 
        samples = oksamp, which = whichv)
    vdata = readVcf(vcf.tf, genome = genome, param = vp)
    rdd = rowData(vdata)
    vdata = GGtools:::snvsOnly(vdata)
    gtdata = factor(geno(vdata)[[1]])
    plot(  as.numeric(assays(summex)[[1]][whicha,]) ~ gtdata , ...)
}
r17 = rmpDEPF[ which(as.character(seqnames(rmpDEPF))=="chr17"),]
s3urlb = sub("22", "17", s3url)
plotOne(r17, TabixFile(s3urlb), 1, c17d[1] )

plot of chunk doonePlot

Exercise 2.

What are the names of the SNP and gene depicted here? Replot, annotating with this information (improve the x and y labels for maximal clarity of interpretation).

Exercise 3.

What is the FDR for the association shown in the display?

We can show four relationships at once fairly simply.

opar = par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 2:5)
 plotOne(r17, TabixFile(s3urlb), i, c17d[i] )

plot of chunk dofff

par(opar)

Exercise 4.

Depict the associations for four gene-SNP pairs with very small FDR. The four genes should be different.

Hint:

order(fdr)[1:10]
c17d[7659]
args(plotOne)
plotOne(r17, TabixFile(s3urlb), "114881", c17d[7659] )

Exercise 5: Master regulators.

Find SNP whose FDR for association with multiple distinct genes is found to be less than one percent. Visualize the relationships.

Connection with “GWAS hits”

We can acquire an image of the current NHGRI GWAS catalog as follows.

library(gwascat)
curgw = makeCurrentGwascat()
curgw

HOWEVER – as of 28 July 2014, the coordinates used for the current catalog are GRCh38. We used rtracklayer's liftOver with the appropriate chain file to obtain an hg19 addressing scheme, losing a small number of nonmappable loci.

load("/data/eQTL2014/curgwHG19.rda")
curgwHG19[1:3,1:5]

## GRanges with 3 ranges and 5 metadata columns:
##     seqnames                 ranges strand | Date.Added.to.Catalog
##        <Rle>              <IRanges>  <Rle> |           <character>
##   1    chr19 [  7739177,   7739177]      * |            04/16/2014
##   2     chr2 [191113009, 191113009]      * |            07/26/2014
##   3     chr6 [ 31197514,  31197514]      * |            07/26/2014
##      PUBMEDID First.Author        Date                Journal
##     <integer>  <character> <character>            <character>
##   1  24123702     Chung CM  03/03/2014 Diabetes Metab Res Rev
##   2  24325914        Chu M  12/09/2013         Carcinogenesis
##   3  24324648    Yatagai Y  12/04/2013               PLoS One
##   ---
##   seqlengths:
##    chr19 chr16  chr1 chr17 chr21 chr13 ... chr15  chr9 chr12 chr14  chrX
##       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA

Exercise 6

How frequent are exact coincidences between SNP exhibiting eQTL association at FDR five percent or less, and GWAS hits? Compare this frequency as computed for SNP exhibiting association at FDR 95 percent or greater.

Further exercises

Exercise 6: Population of origin.

We have disregarded the population of origin in this analysis. Use visualization to assess whether the association between gene OSBPL7 and SNP rs17774008 has similar forms in subjects from CEU and YRI populations.

You can use s3url to get access to genotype data for chr22. It takes 5 minutes or so to run cisAssoc on chr22 genes and SNP. Obtain analyses for CEU and YRI groups separately. What is the concordance on significance of SNP-gene pairs at FDR five percent or so?

Exercise 7: Efficient analysis with counted expression values.

/data/eQTL2014/montpick_eset.RData has the raw counts. (Note, probe identifiers are in ENSEMBL vocabulary.) Use DESeq2 or edgeR to obtain an FDR for a selected SNP-gene pair. Write a function that does this for any given selection. In principle the mean-variance relationship needs to be modeled separately for each gene-SNP pair, so obtaining a statistically optimal solution for a general search may be laborious. It would be nice to understand the costs of using rlog with blind and fast set to TRUE before doing a Gaussian-like analysis of eQTL, as opposed to pair-specific optimally modeled negative binomial testing for eQTL.

Bibliography

[1] A. C. Frazee, B. Langmead and J. T. Leek. “ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets.”. In: BMC bioinformatics 12.1 (Jan. 2011), p. 449. ISSN: 1471-2105. DOI: 10.1186/1471-2105-12-449. .

[2] S. B. Montgomery, M. Sammeth, M. Gutierrez-Arcelus, et al. “Transcriptome genetics using second generation sequencing in a Caucasian population.”. In: Nature 464.7289 (Apr. 2010), pp. 773-7. ISSN: 1476-4687. DOI: 10.1038/nature08903. .