# 1 Introduction

scoreInvHap infers haplotype inversion’s status of a set of samples using SNP data. This method computes a similarity score between the sample SNPs in your cohort and the reference haplotypes. Samples are classified into the haplotype having the highest score. There are other approaches to perform this task such us inveRsion, invClust or PFIDO that are based on multivariate clustering procedures. However, these approaches may have several limitation that are addressed by using scoreInvHap. These limitations include:

• analizing limited or moderate number of samples. Clustering methods require to have large number of samples to perform clustering to properly discriminate haplotype inversion status. In particular, calling of a single sample cannot be performed without having a reference data.
• analyzing data from different SNP arrays. PFIDO and invClust may provide different answers depending on which array is used (i.e., Illumina 1M, Affy 6.0, …) since dfferent coverage may lead to perform the calling using different number of SNPs of a given inverted region.
• combining calling from different SNP arrrays. invClust and PFIDO clasify each individual as homozygous normal, heterozygous or homozygous inverted depending on cluster frequency. In those cases where inversion with haplotype frequency is close to 50%, both methods may have problems to determine which is the reference haplotype. Performing calling in different cohort of samples may lead to wrong results when combining data from different studies.

scoreInvHap overcomes these difficulties by using a set of reference genotypes that has been used to properly characterize the inversion of interest. Genomic information such as linkage desequillibrium (R2) and hetgerozygous sequence has been determined for each of the region of interest (ROI) and are used to create the score that discriminate the inversion status.

# 2 Inversion characterization

The package can be loaded by typing:

library(scoreInvHap)

Calling procedure requires three objects that characterize the inversion of interest:

• the reference genotypes,
• the SNPs R2, and
• the heterozygous reference.

scoreInvHap already contains these required objects of four inversions. Two of them are well known and characterized genomic inversions: 8p23 (ROIno.8.3) and 17q21.31 (ROIno.17.16). These inversions are annotated using the ROI numbers used in Sander’s paper. We have also included two inversions that are described in invFEST database: 7p11.2 (HsInv0286) and Xq13.2 (HsInv0396). These objects have been created using VCF phase 3 data of 1000 Genomes Project. The code used to generate them can be found in the /inst/scripts folder of the package. This code can be modified to create the required files of any other inversion of interest. Required information of other inversions will be incorporated in the future when checked in our group or described in the literature.

## 2.1 Reference genotypes

As previously stated, the method uses the frequency of the SNP genotypes of each SNP located into the inversion region. This information is provided for he different haplotype populations. This information is encoded in an object called Refs that can be loaded by typing:

data("Refs")
names(Refs)
## [1] "ROIno.8.3"   "ROIno.17.16" "HsInv0286"   "HsInv0396"

Each element of this list is a reference for one of the four available inversions. For instance, the reference of inversion ROIno.8.3 can be obtained by:

ref <- Refs$ROIno.8.3 class(ref) ## [1] "list" ref[1:2] ##$rs141039449
##
## haplos         CC         CG          GG
##   NI/NI 0.9310345 0.06321839 0.005747126
##   NI/I  0.8595745 0.12765957 0.012765957
##   I/I   0.6914894 0.25531915 0.053191489
##
## $rs138092889 ## ## haplos AA AG GG ## NI/NI 0.005747126 0.02298851 0.9712644 ## NI/I 0.012765957 0.05531915 0.9319149 ## I/I 0.042553191 0.19148936 0.7659574 This object is a list of matrices containing the frequency of each genotype (columns) in each inversion genotype (rows). Each component is named with the SNP id contained in the ROI. Notice that the alleles of the heteryzogous genotypes are alphabetically ordered. ## 2.2 SNPs R2 with inversion The second object required is a vector containing the R2 between the inversion status and the SNPs genotypes. The SNPs with higher R2 will have more influence when computing the similarity score. We have formatted this object as a numeric vector, named with the SNPs ids. This information is provided in the object called SNPsR2. This is a list that contains the R2 of the four available inversions. For instance, this information for the inversion ROIno.8.3 can be get by: data("SNPsR2") names(SNPsR2) ## [1] "ROIno.8.3" "ROIno.17.16" "HsInv0286" "HsInv0396" R2s <- SNPsR2$ROIno.8.3
head(R2s)
##  rs141039449  rs138092889  rs138217047  rs143682252  rs550257716
## 0.0378562621 0.0289266440 0.0264322431 0.0264322431 0.0005282751
##    rs2955571
## 0.0004570336

## 2.3 Heterozygous reference

The last required information is the heterozygous genotypes of the SNPs included in the references. This information is used to ensure that input SNPs have the same coded alleles than those used in the reference. This information can be retrieve from the object called hetRefs that can be inspect by typing:

data("hetRefs")
names(hetRefs)
## [1] "ROIno.8.3"   "ROIno.17.16" "HsInv0286"   "HsInv0396"
hRefs <- hetRefs$ROIno.8.3 head(hRefs) ## rs141039449 rs138092889 rs138217047 rs143682252 rs550257716 rs2955571 ## "CG" "AG" "AC" "AG" "CT" "AG" In that case, hRefs is a character vector contaning the heterozygous genotypes of the SNPs used as references in the ROI. It should be noticed that, in the heterozygous genotype, the alleles MUST BE ordered ALPHABETICALLY. # 3 Running scoreInvHap: calling inversions scoreInvHap deals with data either in a SNPMatrix or as Bioconductor VCF class. In the case of SNPMatrix, a list with two elements is required: • genotypes: a SNPMatrix with individuals in rows and SNPs in columns • map: a data.frame with the SNPs annotation. It must contain the columns allele.1 and allele.2 with the alleles of the SNPs. We can load our data from a ped file or from plink binary format (.bed, .bim) to a SNPMatrix using snpStats: library(snpStats) ## From a bed snps <- read.plink("example.bed") ## From a pedfile snps <- read.pedfile("example.ped", snps = "example.map") In both cases, snps is a list containing the elements genotypes and map. This object can be passed to scoreInvHap functions. We can load a vcf file into a VCF object using the VariantAnnotation. We have included a small vcf in scoreInvHap package to illustrate how to deal with this data. This file contains a subset of SNPs of 30 European individuals belonging to the 1000 Genomes project. All these SNPs are located at the region 7p11.2, the region annotated as HsInv0286 in the Sander’s paper. This vcf file contains imputed data. We can load the vcf with the following code: library(VariantAnnotation) vcf_file <- system.file("extdata", "example.vcf", package = "scoreInvHap") vcf <- readVcf(vcf_file, "hg19") vcf ## class: CollapsedVCF ## dim: 380 30 ## rowRanges(vcf): ## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER ## info(vcf): ## DataFrame with 4 columns: AF, MAF, R2, ER2 ## info(header(vcf)): ## Number Type Description ## AF 1 Float Estimated Alternate Allele Frequency ## MAF 1 Float Estimated Minor Allele Frequency ## R2 1 Float Estimated Imputation Accuracy ## ER2 1 Float Empirical (Leave-One-Out) R-square (available only fo... ## geno(vcf): ## SimpleList of length 3: GT, DS, GP ## geno(header(vcf)): ## Number Type Description ## GT 1 String Genotype ## DS 1 Float Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)] ## GP 3 Float Estimated Posterior Probabilities for Genotypes 0/0, ... We can observe that the object vcf contains 380 SNPs and 30 samples. Now we are ready to classify the samples with regard to the inversion HsInv0286 by using the function scoreInvHap. This function requires four pieces of information: • sample genotypes and their annotation (argument SNPlist), • the list of matrices with the frequency of each genotypes in each inversion population (argument Refs), • the numeric vector with the SNPs R2 (argument SNPsR2), and • the vector with the heterozygous genotypes of the SNPs (argument hetRefs). The HsInv0286 inversion status of the 30 samples from 1000 genomes is obtained by: res <- scoreInvHap(SNPlist = vcf, SNPsR2 = SNPsR2$HsInv0286,
hetRefs = hetRefs$HsInv0286, Refs = Refs$HsInv0286)
res
## scoreInvHapRes
## Samples:  30
## Genotypes' table:
##  IaIa     IaIb    IbIb    NaIa    NaIb    NaNa    NaNb    NbIa    NbIb    NbNb
##  2    0   4   5   5   0   4   3   4   3
## - Inversion genotypes' table:
##  NN   NI  II
##  7    17  6
## - Inversion frequency: 48.33%

The results of scoreInvHap are encapsulated in a object of class scoreInvHapes. This object contains the classification of the samples and the simmilarity scores. We can obtain this data with the following getters:

# Get classification
head(classification(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
##    IbIb    IbIb    NaIb    NbIb    IaIa    NbNb
## Levels: IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
# Get scores
head(scores(res))
##              IaIa      IaIb      IbIb      NaIa      NaIb      NaNa
## HG00096 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742
## HG00097 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742
## HG00099 0.6438462 0.7267291 0.7685422 0.7768338 0.9758600 0.7703418
## HG00100 0.3431981 0.4263283 0.4892990 0.2604141 0.4764038 0.3691768
## HG00101 0.9841178 0.7603894 0.7584714 0.7188997 0.6280646 0.7182172
## HG00102 0.2518848 0.1732271 0.3167020 0.1441663 0.1994766 0.3004813
##              NaNb      NbIa      NbIb      NbNb
## HG00096 0.4156252 0.3784514 0.5670885 0.3911247
## HG00097 0.4156252 0.3784514 0.5670885 0.3911247
## HG00099 0.5317241 0.3025523 0.5345369 0.2689791
## HG00100 0.6083141 0.6376241 0.6834476 0.5605652
## HG00101 0.3558992 0.4545927 0.4165003 0.3266407
## HG00102 0.5812806 0.5527802 0.5480628 0.7223922

## 3.1 Quality control

We can retrieve other values that are useful to evaluate the quality of the classification. For each sample, we can obtain its highest similarity score and the difference between the highest similarity score and the second highest:

# Get max score
head(maxscores(res))
##   HG00096   HG00097   HG00099   HG00100   HG00101   HG00102
## 0.9945596 0.9945596 0.9758600 0.6834476 0.9841178 0.7223922
# Get difference score
head(diffscores(res))
##   HG00096   HG00097   HG00099   HG00100   HG00101   HG00102
## 0.2216032 0.2216032 0.1990263 0.0458235 0.2237283 0.1411116

A classification is good when the highest score is close to 1 and the other scores are small. This means that the samples SNPs are almost the same than in one of the reference haplotypes and that they are different to the other references. We use the difference between the highest score and the second highest score as a measure of how different is the highest score from the rest. We can have a visual evaluation of these quality parameters with the function plotScores:

plotScores(res, pch = 16, main = "QC based on scores")

The horizontal line of the plot is a suggestion of the minimum difference between the highest and the second score that we accept. By default, this value is set to 0.1 but it can be changed with the parameter minDiff. This default value equals to considering that the sample SNPs are at least 10% more similar to the top reference than to the other references. plotScores relies on the base plot function, so we can pass additional parameters to customize the plot.

The other quality control estimate are based on the number of SNPs used in the computation. scoreInvHap allows having some missing measurements in the input data. However, this measurements are excluded from the computation of the scores. To reflect this issue, we have two measurements: the number of SNPs used in the computation and the proportion of non-missing measurements, or call rate:

# Get Number of scores used
head(numSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
##     307     307     307     307     307     307
# Get call rate
head(propSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
##       1       1       1       1       1       1

The number of SNPs must always be taken into account to evaluate the performance of the computation. It is highly recommended to use, at least, 15 SNPs in the computation. We have also included the function plotCallRate to plot the call rate of the samples:

plotCallRate(res, main = "Call Rate QC")

The vertical line is the minimum recommended call rate to consider a sample. By default, it is set to 0.9 but can be changed with the parameter callRate. Again, plotCallRate relies on the base plot function, so we can customize the plot.

The function classification have two parameters that selects samples based on these two QC parameters. The argument minDiff sets the minimum difference between the highest and the second highest score to consider a sample. The argument callRate sets the minimum call rate of a sample to pass the QC. By default, both arguments are set to 0 so all the sample are included:

## No filtering
length(classification(res))
## [1] 30
## QC filtering
length(classification(res, minDiff = 0.1, callRate = 0.9))
## [1] 27

Finally, the function classification has the argument inversion that, when it is set to TRUE, the haplotype based classification is transformed to an inversion based classification. This is useful for inversions HsInv0286 and HsInv0396 that have more than one haplotype per inversion status:

## No filtering
table(classification(res))
##
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
##    2    0    4    5    5    0    4    3    4    3
## QC filtering
table(classification(res, inversion = TRUE))
##
## II NI NN
##  6 17  7

## 3.2 Imputed data

When SNPs data are imputed, we obtain three different types of results: the best-guess, the dosage and the posterior probabilities. By default, scoreInvHap use the best-guess when computing the simmilarity scores. However, we can also use posterior probabilities setting the argument imputed to TRUE:

res_imp <- scoreInvHap(SNPlist = vcf, SNPsR2 = SNPsR2$HsInv0286, hetRefs = hetRefs$HsInv0286, Refs = Refs\$HsInv0286, imputed = TRUE)
res_imp
## scoreInvHapRes
## Samples:  30
## Genotypes' table:
##  IaIa     IaIb    IbIb    NaIa    NaIb    NaNa    NaNb    NbIa    NbIb    NbNb
##  2    0   4   5   5   0   4   3   4   3
## - Inversion genotypes' table:
##  NN   NI  II
##  7    17  6
## - Inversion frequency: 48.33%

In this case, the samples were identically classified in both cases:

table(PostProbs = classification(res_imp),
BestGuess = classification(res))
##          BestGuess
## PostProbs IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
##      IaIa    2    0    0    0    0    0    0    0    0    0
##      IaIb    0    0    0    0    0    0    0    0    0    0
##      IbIb    0    0    4    0    0    0    0    0    0    0
##      NaIa    0    0    0    5    0    0    0    0    0    0
##      NaIb    0    0    0    0    5    0    0    0    0    0
##      NaNa    0    0    0    0    0    0    0    0    0    0
##      NaNb    0    0    0    0    0    0    4    0    0    0
##      NbIa    0    0    0    0    0    0    0    3    0    0
##      NbIb    0    0    0    0    0    0    0    0    4    0
##      NbNb    0    0    0    0    0    0    0    0    0    3

# 4 Other features

There are two additional parameters of scoreInvHap than can reduce computing time: R2 and BPPARAM. R2 indicates the minimum R2 that a SNP should have with the inversion to be included in the score. The less number of SNPs the less time it takes. By default, all SNPs are included in the computation. On the other hand, BPPARAM requires an instance of BiocParallelParam, which allows to parallelize the computation of the score. You can find more information about this class in its help page (?bpparam) and in the BiocParallel vignette.

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] snpStats_1.28.0            Matrix_1.2-11
##  [3] survival_2.41-3            VariantAnnotation_1.24.0
##  [5] Rsamtools_1.30.0           Biostrings_2.46.0
##  [7] XVector_0.18.0             SummarizedExperiment_1.8.0
##  [9] DelayedArray_0.4.0         matrixStats_0.52.2
## [11] Biobase_2.38.0             GenomicRanges_1.30.0
## [13] GenomeInfoDb_1.14.0        IRanges_2.12.0
## [15] S4Vectors_0.16.0           BiocGenerics_0.24.0
## [17] scoreInvHap_1.0.0          BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
##  [1] progress_1.1.2           splines_3.4.2
##  [3] lattice_0.20-35          htmltools_0.3.6
##  [5] rtracklayer_1.38.0       yaml_2.1.14
##  [7] GenomicFeatures_1.30.0   blob_1.1.0
##  [9] XML_3.98-1.9             rlang_0.1.2
## [11] DBI_0.7                  BiocParallel_1.12.0
## [13] bit64_0.9-7              GenomeInfoDbData_0.99.1
## [15] stringr_1.2.0            zlibbioc_1.24.0
## [17] evaluate_0.10.1          memoise_1.1.0
## [19] knitr_1.17               biomaRt_2.34.0
## [21] AnnotationDbi_1.40.0     Rcpp_0.12.13
## [23] backports_1.1.1          BSgenome_1.46.0
## [25] bit_1.1-12               RMySQL_0.10.13
## [27] digest_0.6.12            stringi_1.1.5
## [29] bookdown_0.5             grid_3.4.2
## [31] rprojroot_1.2            tools_3.4.2
## [33] bitops_1.0-6             magrittr_1.5
## [35] RCurl_1.95-4.8           RSQLite_2.0
## [37] tibble_1.3.4             prettyunits_1.0.2
## [39] assertthat_0.2.0         rmarkdown_1.6
## [41] R6_2.2.2                 GenomicAlignments_1.14.0
## [43] compiler_3.4.2