RVS 1.0.0

- 1 Introduction
- 2 Pedigree format
- 3 Computing Standard Sharing Probabilities
- 4 Correcting for Related Founders
- 5 Using Monte Carlo Simulation
- 6 Calculating Sharing Probabilties Across Multiple Families
- 7 Precomputing Sharing Probabilities and Number of Carriers for all Possible Carrier Subsets
- 8 Analyzing variants in genomic sequence
- References

The primary use of the *RVS* package is to compute the rare variant (RV) sharing probabilities outlined in Bureau et al. (2014).

The main input used in this package is a *pedigree* object from the *kinship2* package on CRAN (Therneau et al. (2015)). The package vignette (https://cran.r-project.org/web/packages/kinship2/vignettes/pedigree.pdf) outlines the basic steps to creating a *pedigree*. Only the *id*, *findex*, *mindex* fields are necessary for computing sharing probabilities. The *RVS* package comes with 8 sample pedigrees.

```
data(samplePedigrees) # load sample pedigrees
kinship2::plot.pedigree(samplePedigrees$firstCousinPair) # plot pedigree
```

The primary function for computing sharing probabilities is *RVsharing*. There are two simple cases in which the calculation is straightforward.

In this case, we assume the variant is rare enough so that the probability of more than one founder introducing it to the pedigree is negligible. This is the default scenario for *RVsharing*.

We define the following random variables:

*\(C_i\): Number of copies of the RV received by subject \(i\), *\(F_j\): Indicator variable that founder \(j\) introduced one copy of the RV into the pedigree,

where the expression on the third line results from our assumption of a single copy of that RV among all alleles present in the \(n_f\) founders. The probabilities \(P[F_j] = {1 \over n_f}\) cancel from the numerator and denominator.

`RVsharing(samplePedigrees$firstCousinPair)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.06667`

`## [1] 0.06666667`

In this case, we know the allele frequency of the rare variant in the population the founders are drawn from. This allows for quick, exact calculation of the sharing probability. To specify the allele frequency, use the argument *alleleFreq*.

```
defaultProbs <- sapply(samplePedigrees[1:4], function(p)
suppressMessages(RVsharing(p)))
exactProbs <- list()
freq <- c(0.001,0.0025,0.005,0.01,0.025,0.05)
exactProbs$fistCousinPair <- sapply(freq, function(f) suppressMessages(
RVsharing(samplePedigrees$firstCousinPair, alleleFreq=f)))
exactProbs$secondCousinPair <- sapply(freq, function(f) suppressMessages(
RVsharing(samplePedigrees$secondCousinPair, alleleFreq=f)))
exactProbs$firstCousinTriple <- sapply(freq, function(f) suppressMessages(
RVsharing(samplePedigrees$firstCousinTriple, alleleFreq=f)))
exactProbs$secondCousinTriple <- sapply(freq, function(f) suppressMessages(
RVsharing(samplePedigrees$secondCousinTriple, alleleFreq=f)))
plot(NULL, xlim=c(0.001,0.05), ylim=c(0,0.12),log='x',xaxt='n',
ylab='probability of sharing', xlab='variant frequency [%]')
axis(side=1, at=freq, labels=freq*100)
invisible(sapply(exactProbs, lines, x=freq))
invisible(sapply(exactProbs, points, x=freq, pch=19))
abline(h=defaultProbs,lty=2)
```

Similiar to Figure 2 in Bureau et al. (2014)

*RVsharing* also allows for estimating sharing probabilities through monte carlo simulation. The primary use of this feature is for calculating sharing probabilities under non standard assumptions about the founders. However, this feature is available for the standard assumptions as well. To run a monte carlo simulation, specify all parameters as normal and additionally provide the *nSim* parameter specifying how many simulations should be run.

`p <- RVsharing(samplePedigrees$firstCousinPair)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.06667`

`p <- RVsharing(samplePedigrees$firstCousinPair, nSim=1e4)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.07032`

`p <- RVsharing(samplePedigrees$firstCousinPair, alleleFreq=0.01)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.07631`

`p <- RVsharing(samplePedigrees$firstCousinPair, alleleFreq=0.01, nSim=1e4)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.07494`

`p <- RVsharing(samplePedigrees$firstCousinPair, kinshipCoeff=0.05)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.1262`

`p <- RVsharing(samplePedigrees$firstCousinPair, kinshipCoeff=0.05, nSim=1e4)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.1195`

This method allows for more complex relationships among the founders to be given. *RVsharing* allows for a complete distribution among the founders to be passed in as the parameter *founderDist*. This function should accept a single argument, N, and should return a vector of length N with values in {0,1,2} representing the number of copies of the variant each founder has.

```
# assumption that 1 founder introduces
fDist <- function(N) sample(c(rep(0,N-1), 1))
RVsharing(samplePedigrees$firstCousinPair, nSim=1e4, founderDist=fDist)
```

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.05992`

`## [1] 0.05991868`

`RVsharing(samplePedigrees$firstCousinPair)`

`## Probability subjects 7 8 among 7 8 share a rare variant: 0.06667`

`## [1] 0.06666667`

For variants seen in only one family, the sharing probability can be interpreted directly as a P-value from a Bernoulli trial. For variants seen in M families and shared by affected relatives in a subset S of them, the P-value can be obtained as the sum of the probability of events as (or more) extreme as the observed sharing in the family subset S. The function *multipleFamilyPValue* takes a vector of sharing probabilities and a vector of TRUE/FALSE describing whether or not the variant was shared among the carriers.

```
probs <- sapply(samplePedigrees, function(p) suppressMessages(RVsharing(p)))
observed <- c(rep(FALSE, 3), rep(TRUE, 2), rep(FALSE, 4))
multipleFamilyPValue(probs, observed)
```

`## [1] 0.0002114833`

The *RVgene* function requires the lists *pattern.prob.list* of vectors of sharing probabilities and *N.list* of number of carriers for all possible affected carrier subsets in each family in the sample being analyzed. The elements of both of these lists must have the same names as the pedigree objects in the *ped.listfams* argument. When all affected subjecs in a family are final descendants, the sharing probabilities and number of carriers for all subsets can be generated automatically. Here is an exanple with three second cousins:

```
carriers = c(15,16,17)
carrier.sets = list()
for (i in length(carriers):1)
carrier.sets = c(carrier.sets, combn(carriers,i,simplify=FALSE))
fam15157.pattern.prob = sapply(carrier.sets,function (vec) RVsharing(samplePedigrees$secondCousinTriple,carriers=vec))
```

`## Probability subjects 15 16 17 among 15 16 17 share a rare variant: 0.001342`

`## Probability subjects 15 16 among 15 16 17 share a rare variant: 0.009396`

`## Probability subjects 15 17 among 15 16 17 share a rare variant: 0.009396`

`## Probability subjects 16 17 among 15 16 17 share a rare variant: 0.009396`

`## Probability subjects 15 among 15 16 17 share a rare variant: 0.3235`

`## Probability subjects 16 among 15 16 17 share a rare variant: 0.3235`

`## Probability subjects 17 among 15 16 17 share a rare variant: 0.3235`

`fam15157.N = sapply(carrier.sets,length)`

While this code applies to any configuration of affected final descendants, symmetries in the relationships of these third cousins results in equal sharing probabilities for multiple subsets. Subsets with the same probabilities are equivalent, and the optional argument *nequiv.list* can be used to indicate the number of equivalent subset for each sharing probability. While shorter vectors in *pattern.prob.list* and *N.list* result in more efficient computation, identification of the equivalent subsets is not easily automated, and will usually require custom code for each pedigree in a sample. With three second cousins we can use:

```
fam15157.pattern.prob = c(RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15,16,17)),
RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15,16)),
RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15)))
```

`## Probability subjects 15 16 17 among 15 16 17 share a rare variant: 0.001342`

`## Probability subjects 15 16 among 15 16 17 share a rare variant: 0.009396`

`## Probability subjects 15 among 15 16 17 share a rare variant: 0.3235`

```
fam15157.N = 3:1
fam15157.nequiv = c(1,3,3)
```

It is then easy to check that the distribution sums to 1:

`sum(fam15157.pattern.prob*fam15157.nequiv)`

`## [1] 1`

When some affected subjects are not final descendants, some subsets are incompatible with a variant being IBD among carriers. Assume individual 3, the grand-father of subject 15 in family 15157, is also affected and his genotype is available.

```
fam15157 <- samplePedigrees$secondCousinTriple
fam15157$affected[3] = 1
plot(fam15157)
```

Then the carrier subsets (15,16,17), (15,16) and (15,17) involving subject 15 but not 3 are incompatible with sharing IBD and must be removed from the list of subsets. The code then becomes:

```
carriers = c(3,15,16,17)
carrier.sets = list()
for (i in length(carriers):1)
carrier.sets = c(carrier.sets, combn(carriers,i,simplify=FALSE))
carrier.sets
```

```
## [[1]]
## [1] 3 15 16 17
##
## [[2]]
## [1] 3 15 16
##
## [[3]]
## [1] 3 15 17
##
## [[4]]
## [1] 3 16 17
##
## [[5]]
## [1] 15 16 17
##
## [[6]]
## [1] 3 15
##
## [[7]]
## [1] 3 16
##
## [[8]]
## [1] 3 17
##
## [[9]]
## [1] 15 16
##
## [[10]]
## [1] 15 17
##
## [[11]]
## [1] 16 17
##
## [[12]]
## [1] 3
##
## [[13]]
## [1] 15
##
## [[14]]
## [1] 16
##
## [[15]]
## [1] 17
```

```
carrier.sets = carrier.sets[-c(5,9,10)]
fam15157.pattern.prob = sapply(carrier.sets,function (vec) RVsharing(fam15157,carriers=vec,useAffected=TRUE))
```

`## Probability subjects 3 15 16 17 among 3 15 16 17 share a rare variant: 0.001121`

`## Probability subjects 3 15 16 among 3 15 16 17 share a rare variant: 0.007848`

`## Probability subjects 3 15 17 among 3 15 16 17 share a rare variant: 0.007848`

`## Probability subjects 3 16 17 among 3 15 16 17 share a rare variant: 0.003363`

`## Probability subjects 3 15 among 3 15 16 17 share a rare variant: 0.05493`

`## Probability subjects 3 16 among 3 15 16 17 share a rare variant: 0.02354`

`## Probability subjects 3 17 among 3 15 16 17 share a rare variant: 0.02354`

`## Probability subjects 16 17 among 3 15 16 17 share a rare variant: 0.004484`

`## Probability subjects 3 among 3 15 16 17 share a rare variant: 0.1648`

`## Probability subjects 15 among 3 15 16 17 share a rare variant: 0.2152`

`## Probability subjects 16 among 3 15 16 17 share a rare variant: 0.2466`

`## Probability subjects 17 among 3 15 16 17 share a rare variant: 0.2466`

`fam15157.N = sapply(carrier.sets,length)`

Notice the use of the option *useAffected=TRUE* with affected subjects who are not final descendants. Again, one can check that the distribution sums to 1:

`sum(fam15157.pattern.prob)`

`## [1] 1`

Precomputed sharing probabilities and numbers of carriers can be used directly to obtain p-values of observed sharing events, by summing the probability of all events as or more extreme as the one observed (both in terms of sharing probability and number of carriers), i.e. this is a one-sided exact test. For instance, if subjects 3, 16 and 17 share a rare variant, the probability of that event is

`pobs = RVsharing(fam15157,carriers=c(3,16,17),useAffected=TRUE)`

`## Probability subjects 3 16 17 among 3 15 16 17 share a rare variant: 0.003363`

The p-value of that sharing event is then:

`sum(fam15157.pattern.prob[fam15157.pattern.prob<=pobs & fam15157.N >= 3])`

`## [1] 0.004484305`

The *RVgene* function enables these computations with more than one family harboring the same or different variants. Once the vectors of sharing probabilities and number of carriers have been computed for all families in the sample, they need to be stored in lists. We return to the original second cousin triple family and add a first and second cousin triple family. Then we create the lists of pattern probabilities, number of equivalent subsets and number of carriers in the subsets.

```
fam15157.pattern.prob = c(RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15,16,17)),
RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15,16)),
RVsharing(samplePedigrees$secondCousinTriple,carriers=c(15)))
```

`## Probability subjects 15 16 17 among 15 16 17 share a rare variant: 0.001342`

`## Probability subjects 15 16 among 15 16 17 share a rare variant: 0.009396`

`## Probability subjects 15 among 15 16 17 share a rare variant: 0.3235`

```
fam15157.N = 3:1
fam15157.nequiv = c(1,3,3)
fam28003.pattern.prob = c(RVsharing(samplePedigrees$firstAndSecondCousinsTriple,carriers=c(36,104,110)),
RVsharing(samplePedigrees$firstAndSecondCousinsTriple,carriers=c(36,104)),
RVsharing(samplePedigrees$firstAndSecondCousinsTriple,carriers=c(104,110)),
RVsharing(samplePedigrees$firstAndSecondCousinsTriple,carriers=c(36)),
RVsharing(samplePedigrees$firstAndSecondCousinsTriple,carriers=c(104)))
```

`## Probability subjects 36 104 110 among 36 104 110 share a rare variant: 0.00277`

`## Probability subjects 36 104 among 36 104 110 share a rare variant: 0.00831`

`## Probability subjects 104 110 among 36 104 110 share a rare variant: 0.04155`

`## Probability subjects 36 among 36 104 110 share a rare variant: 0.3352`

`## Probability subjects 104 among 36 104 110 share a rare variant: 0.3019`

```
fam28003.N = c(3,2,2,1,1)
fam28003.nequiv = c(1,2,1,1,2)
ex.pattern.prob.list = list("15157"=fam15157.pattern.prob,"28003"=fam28003.pattern.prob)
ex.nequiv.list = list("15157"=fam15157.nequiv,"28003"=fam28003.nequiv)
ex.N.list = list("15157"=fam15157.N,"28003"=fam28003.N)
```

We now turn to the analysis of variants observed in the simulated genomic sequence of the gene *PEAR1* in a sample of related affected subjects. The processing of the sequence data results in Variant Call Format (VCF) files, which can be read into R with the function *readVcf* from the *variantAnnotatin* package. Two *VCF* objects obtained with *readVcf* from VCF files of sequence data for the second cousin triple and first and second cousin triple families are contained in the *famVCF* data. These VCF files are converted to *snpMatrix* objects using the *genotypeToSnpMatrix* function.

```
data(famVCF)
library(VariantAnnotation)
```

`## Loading required package: BiocGenerics`

`## Loading required package: parallel`

```
##
## Attaching package: 'BiocGenerics'
```

```
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
```

```
## The following objects are masked from 'package:Matrix':
##
## colMeans, colSums, rowMeans, rowSums, which
```

```
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
```

```
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colMeans, colSums, colnames, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, mapply, match, mget, order, paste,
## pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans, rowSums,
## rownames, sapply, setdiff, sort, table, tapply, union, unique,
## unsplit, which, which.max, which.min
```

`## Loading required package: GenomeInfoDb`

`## Loading required package: S4Vectors`

`## Loading required package: stats4`

```
##
## Attaching package: 'S4Vectors'
```

```
## The following object is masked from 'package:Matrix':
##
## expand
```

```
## The following object is masked from 'package:base':
##
## expand.grid
```

`## Loading required package: IRanges`

`## Loading required package: GenomicRanges`

`## Loading required package: SummarizedExperiment`

`## Loading required package: Biobase`

```
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
```

`## Loading required package: DelayedArray`

`## Loading required package: matrixStats`

```
##
## Attaching package: 'matrixStats'
```

```
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
```

```
##
## Attaching package: 'DelayedArray'
```

```
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
```

```
## The following object is masked from 'package:base':
##
## apply
```

`## Loading required package: Rsamtools`

`## Loading required package: Biostrings`

`## Loading required package: XVector`

```
##
## Attaching package: 'Biostrings'
```

```
## The following object is masked from 'package:DelayedArray':
##
## type
```

```
## The following object is masked from 'package:base':
##
## strsplit
```

```
##
## Attaching package: 'VariantAnnotation'
```

```
## The following object is masked from 'package:base':
##
## tabulate
```

`fam15157.snp = genotypeToSnpMatrix(fam15157.vcf)`

`## non-single nucleotide variations are set to NA`

`## Warning in .local(x, ...): non-diploid variants are set to NA`

`fam28003.snp = genotypeToSnpMatrix(fam28003.vcf)`

`## non-single nucleotide variations are set to NA`

`## Warning in .local(x, ...): non-diploid variants are set to NA`

*RVgene* requires lists of the *snpMatrix* and *pedigree* objects for these two families. The names given to the elements of these lists are not used by *RVgene* and are thus arbitrary. Family IDs are extracted from the *famid* element of the *pedigree* objects. Please note that currently *RVgene* does not accept a *pedigreeList*, but only a plain list of *pedigree* objects.

```
ex.SnpMatrix.list = list(fam15157=fam15157.snp$genotypes,fam28003=fam28003.snp$genotypes)
ex.ped.obj = list(fam15157=samplePedigrees$secondCousinTriple,fam28003=samplePedigrees$firstAndSecondCousinsTriple)
```

In the sequence segment, one can specify which variants are rare and possibly satisfy other filtering criteria (e.g. coding variants) using the *sites* argument. Here, we will focus on two sites: 92 where the three second cousins of family 15157 share the rare allele and 119 where the two first cousins of family 28003 share the rare allele but not their second cousin.

```
sites = c(92,119)
ex.SnpMatrix.list[["fam15157"]][,sites[1]]@.Data
```

```
## rs187756653
## 1 00
## 2 00
## 3 00
## 4 00
## 5 00
## 6 00
## 7 00
## 8 00
## 9 00
## 10 00
## 11 00
## 12 00
## 13 00
## 14 00
## 15 02
## 16 02
## 17 02
```

`ex.SnpMatrix.list[["fam28003"]][,sites[2]]@.Data`

```
## rs199628333
## 3 00
## 4 00
## 6 00
## 7 00
## 11 00
## 12 00
## 13 00
## 15 00
## 27 00
## 28 00
## 36 01
## 103 00
## 104 02
## 109 00
## 110 02
```

Finally, the call to *RVgene* returns the P-value of the exact rare variant sharing test allowing for sharing by a subset of affected subjects (p), the P-value of the exact rare variant sharing test requiring sharing by all affected subjects (pall) and the minimum achievable p-value if all affected subjects were carriers of a rare variant (potentialp).

```
RVgene(ex.SnpMatrix.list,ex.ped.obj,sites,pattern.prob.list=ex.pattern.prob.list,
nequiv.list=ex.nequiv.list,N.list=ex.N.list,type="count")
```

```
## $p
## [1] 0.000159884
##
## $pall
## [1] 0.001342282
##
## $potentialp
## [1] 3.718232e-06
```

Bureau, A., S. Younkin, M. Parker, J. Bailey-Wilson, M. Marazita, J. Murray, E. Mangold, H. Albacha-Hejazi, T. Beaty, and I. Ruczinski. 2014. “Inferring Rare Disease Risk Variants Based on Exact Probabilities of Sharing by Multiple Affected Relatives.” *Bioinformatics* 30 (15): 2189–96.

Therneau, T., S. Daniel, J. Sinnwell, and E. Atkinson. 2015. “Kinship2: Pedigree Functions.” *CRAN*. https://cran.r-project.org/package=kinship2.