Contents

1 Overview

This vignette provides a description of how to use the GENESIS package to analyze sequence data. We demonstrate the use of mixed models for genetic association testing, as PC-AiR PCs can be used as fixed effect covariates to adjust for population stratification, and a kinship matrix (or genetic relationship matrix) estimated from PC-Relate can be used to account for phenotype correlation due to genetic similarity among samples. To illustrate the methods, we use a small subset of data from 1000 Genomes Phase 3.

2 Convert VCF to GDS

The first step is to convert a VCF file into the GDS file format used by GENESIS. We use the SeqArray package, which defines the extended GDS format used to capture all data in a VCF file. If the VCF files are split by chromosome, they can be combined into a single GDS file.

library(SeqArray)
vcffile <- system.file("extdata", "1KG", 
                       paste0("1KG_phase3_subset_chr", 1:22, ".vcf.gz"), 
                       package="GENESIS")
gdsfile <- tempfile()
seqVCF2GDS(vcffile, gdsfile, verbose=FALSE)
gds <- seqOpen(gdsfile)
gds
## Object of class "SeqVarGDSClass"
## File: /tmp/RtmpsVQFtu/file3aeb18e17e92 (419.8K)
## +    [  ] *
## |--+ description   [  ] *
## |--+ sample.id   { Str8 100 LZMA_ra(37.8%), 309B } *
## |--+ variant.id   { Int32 24639 LZMA_ra(7.99%), 7.7K } *
## |--+ position   { Int32 24639 LZMA_ra(71.8%), 69.1K } *
## |--+ chromosome   { Str8 24639 LZMA_ra(0.36%), 237B } *
## |--+ allele   { Str8 24639 LZMA_ra(19.2%), 20.0K } *
## |--+ genotype   [  ] *
## |  |--+ data   { Bit2 2x100x24657 LZMA_ra(18.7%), 224.9K } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Int16 0 LZMA_ra, 18B }
## |--+ phase   [  ]
## |  |--+ data   { Bit1 100x24639 LZMA_ra(0.06%), 201B } *
## |  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
## |  \--+ extra   { Bit1 0 LZMA_ra, 18B }
## |--+ annotation   [  ]
## |  |--+ id   { Str8 24639 LZMA_ra(37.3%), 87.8K } *
## |  |--+ qual   { Float32 24639 LZMA_ra(0.17%), 173B } *
## |  |--+ filter   { Int32,factor 24639 LZMA_ra(0.17%), 173B } *
## |  |--+ info   [  ]
## |  \--+ format   [  ]
## \--+ sample.annotation   [  ]

2.1 Create a SeqVarData object

Next, we combine the GDS file with information about the samples, which we store in an AnnotatedDataFrame (defined in the Biobase package). An AnnotatedDataFrame combines a data.frame with metadata describing each column. A SeqVarData object (defined in the SeqVarTools package), contains both an open GDS file and an AnnotatedDataFrame describing the samples. The sample.id column in the AnnotatedDataFrame must match the sample.id node in the GDS file.

library(GENESIS)
library(Biobase)
library(SeqVarTools)

data(sample_annotation_1KG)
annot <- sample_annotation_1KG
head(annot)
##   sample.id Population sex
## 1   HG00110        GBR   F
## 2   HG00116        GBR   M
## 3   HG00120        GBR   F
## 4   HG00128        GBR   F
## 5   HG00136        GBR   M
## 6   HG00137        GBR   F
# simulate some phenotype data
annot$outcome <- rnorm(nrow(annot))
metadata <- data.frame(labelDescription=c("sample id", 
                                          "1000 genomes population", 
                                          "sex", 
                                          "simulated phenotype"),
                       row.names=names(annot))
annot <- AnnotatedDataFrame(annot, metadata)

all.equal(annot$sample.id, seqGetData(gds, "sample.id"))
## [1] TRUE
seqData <- SeqVarData(gds, sampleData=annot)

3 Population structure and relatedness

PC-AiR and PC-Relate are described in detail in a separate vignette. Here, we demonstrate their use to provide adjustment for population structure and relatedness in a mixed model.

3.1 KING

Step 1 is to get initial estimates of kinship using KING, which is robust to population structure but not admixture. The KING algorithm is available in SNPRelate. We select a subset of variants for this calculation with LD pruning.

library(SNPRelate)

# LD pruning to get variant set
snpset <- snpgdsLDpruning(gds, method="corr", slide.max.bp=10e6, 
                          ld.threshold=sqrt(0.1), verbose=FALSE)
pruned <- unlist(snpset, use.names=FALSE)

king <- snpgdsIBDKING(gds, snp.id=pruned, verbose=FALSE)
kingMat <- king$kinship
dimnames(kingMat) <- list(king$sample.id, king$sample.id)

3.2 PC-AiR

The next step is PC-AiR, in which we select a set of unrelated samples that is maximally informative about all ancestries in the sample, use this unrelated set for Principal Component Analysis (PCA), then project the relatives onto the PCs. We use a kinship threshold of degree 3 (unrelated is less than first cousins). In this example, we use the KING estimates for both kinship (kinobj) and ancestry divergence (divobj). KING kinship estimates are negative for samples with different ancestry.

pcs <- pcair(seqData, 
             kinobj=kingMat, kin.thresh=2^(-9/2),
             divobj=kingMat, div.thresh=-2^(-9/2),
             snp.include=pruned)
## Principal Component Analysis (PCA) on genotypes:
## Calculating allele counts/frequencies ...
## 
[..................................................]  0%, ETC: ---    
[==================================================] 100%, completed in 0s
## Excluding 333 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## Working space: 88 samples, 10,490 SNVs
##     using 1 (CPU) core
## CPU capabilities: Double-Precision SSE2
## Mon Jun  3 21:09:34 2019    (internal increment: 36816)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed in 1s
## Mon Jun  3 21:09:35 2019    Begin (eigenvalues and eigenvectors)
## Mon Jun  3 21:09:35 2019    Done.
## SNP loading:
## Working space: 88 samples, 10490 SNPs
##     using 1 (CPU) core
##     using the top 32 eigenvectors
## Mon Jun  3 21:09:35 2019    (internal increment: 65536)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed in 0s
## Mon Jun  3 21:09:35 2019    Done.
## Sample loading:
## Working space: 12 samples, 10490 SNPs
##     using 1 (CPU) core
##     using the top 32 eigenvectors
## Mon Jun  3 21:09:35 2019    (internal increment: 65536)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed in 0s
## Mon Jun  3 21:09:35 2019    Done.

We need to determine which PCs are ancestry informative. To do this we need population information for the 1000 Genomes samples. We make a parallel coordinates plot, color-coding by 1000 Genomes population.

library(dplyr)
library(RColorBrewer)
library(ggplot2)
library(GGally)

pc.df <- as.data.frame(pcs$vectors)
names(pc.df) <- paste0("PC", 1:ncol(pcs$vectors))
pc.df$sample.id <- row.names(pcs$vectors)
pc.df <- left_join(pc.df, pData(annot), by="sample.id")

pop.cols <- setNames(brewer.pal(12, "Paired"),
    c("ACB", "ASW", "CEU", "GBR", "CHB", "JPT", 
      "CLM", "MXL", "LWK", "YRI", "GIH", "PUR"))

ggplot(pc.df, aes(PC1, PC2, color=Population)) + geom_point() +
    scale_color_manual(values=pop.cols)

ggparcoord(pc.df, columns=1:10, groupColumn="Population", scale="uniminmax") +
    scale_color_manual(values=pop.cols) +
    xlab("PC") + ylab("")