This vignette provides a description of how to use the GENESIS package to analyze sequence data. We demonstrate the use of mixed models for genetic association testing, as PC-AiR PCs can be used as fixed effect covariates to adjust for population stratification, and a kinship matrix (or genetic relationship matrix) estimated from PC-Relate can be used to account for phenotype correlation due to genetic similarity among samples. To illustrate the methods, we use a small subset of data from 1000 Genomes Phase 3.
The first step is to convert a VCF file into the GDS file format used by GENESIS. We use the SeqArray package, which defines the extended GDS format used to capture all data in a VCF file. If the VCF files are split by chromosome, they can be combined into a single GDS file.
library(SeqArray)
vcffile <- system.file("extdata", "1KG",
paste0("1KG_phase3_subset_chr", 1:22, ".vcf.gz"),
package="GENESIS")
gdsfile <- tempfile()
seqVCF2GDS(vcffile, gdsfile, verbose=FALSE)
gds <- seqOpen(gdsfile)
gds
## Object of class "SeqVarGDSClass"
## File: /tmp/RtmpMR7dVC/filed4055283d798e (419.7K)
## + [ ] *
## |--+ description [ ] *
## |--+ sample.id { Str8 100 LZMA_ra(37.8%), 309B } *
## |--+ variant.id { Int32 24639 LZMA_ra(7.99%), 7.7K } *
## |--+ position { Int32 24639 LZMA_ra(71.8%), 69.1K } *
## |--+ chromosome { Str8 24639 LZMA_ra(0.36%), 237B } *
## |--+ allele { Str8 24639 LZMA_ra(19.2%), 20.0K } *
## |--+ genotype [ ] *
## | |--+ data { Bit2 2x100x24657 LZMA_ra(18.7%), 224.9K } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Int16 0 LZMA_ra, 18B }
## |--+ phase [ ]
## | |--+ data { Bit1 100x24639 LZMA_ra(0.06%), 201B } *
## | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
## | \--+ extra { Bit1 0 LZMA_ra, 18B }
## |--+ annotation [ ]
## | |--+ id { Str8 24639 LZMA_ra(37.3%), 87.8K } *
## | |--+ qual { Float32 24639 LZMA_ra(0.17%), 173B } *
## | |--+ filter { Int32,factor 24639 LZMA_ra(0.17%), 173B } *
## | |--+ info [ ]
## | \--+ format [ ]
## \--+ sample.annotation [ ]
Next, we combine the GDS file with information about the samples, which we store in an AnnotatedDataFrame
(defined in the Biobase package). An AnnotatedDataFrame
combines a data.frame
with metadata describing each column. A SeqVarData
object (defined in the SeqVarTools package), contains both an open GDS file and an AnnotatedDataFrame
describing the samples. The sample.id
column in the AnnotatedDataFrame
must match the sample.id
node in the GDS file.
library(GENESIS)
library(Biobase)
library(SeqVarTools)
data(sample_annotation_1KG)
annot <- sample_annotation_1KG
head(annot)
## sample.id Population sex
## 1 HG00110 GBR F
## 2 HG00116 GBR M
## 3 HG00120 GBR F
## 4 HG00128 GBR F
## 5 HG00136 GBR M
## 6 HG00137 GBR F
# simulate some phenotype data
set.seed(4)
annot$outcome <- rnorm(nrow(annot))
metadata <- data.frame(labelDescription=c("sample id",
"1000 genomes population",
"sex",
"simulated phenotype"),
row.names=names(annot))
annot <- AnnotatedDataFrame(annot, metadata)
all.equal(annot$sample.id, seqGetData(gds, "sample.id"))
## [1] TRUE
seqData <- SeqVarData(gds, sampleData=annot)