Within omics sample relationship verification

Here we illustrate how to detect data linkage errors with the omicsPrint package using both artificially generated data and experimental data. Several additional vignettes are available that show the use of the package on further experimental data in different settings, i.e., 450k DNA methylation and imputed genotypes.

Create toy data

Here we generate a single vector with 100 randomly drawn integers from the set; 1, 2, 3, representing 100 SNP calls from a single individual. Three additional individuals are generated by randomly swapping a certain fraction of the SNPs. Swapping 5 SNPs will introduce a few mismatches mimicking a situation where the same individual was measured twice (replicate) but with measurement error. Swapping 50% of the SNPs will be similar to the difference in genotypes between parents and offspring. Swapping all SNPs will result in a situation similar to comparing two unrelated individuals.

swap <- function(x, frac=0.05) {
    n <- length(x)
    k <- floor(n*frac)
    x1 <- sample(1:n,k)
    x2 <- sample(1:n,k) ##could be overlapping
    x[x2] <- x[x1]
    x
}
x1 <- 1 + rbinom(100, size=2, prob=1/3)
x2 <- swap(x1, 0.05) ##strongly related e.g. replicate
x3 <- swap(x1, 0.5) ##related e.g. parent off spring
x4 <- swap(x1, 1) ##unrelated
x <- cbind(x1, x2, x3, x4)

Now x contains the 100 SNPs for the four individuals using head we can inspect the first six SNPs.

head(x)
##      x1 x2 x3 x4
## [1,]  1  1  1  2
## [2,]  2  2  2  3
## [3,]  1  1  2  2
## [4,]  1  1  1  3
## [5,]  1  1  1  2
## [6,]  2  2  2  1

Running the allelesharing algorithm

We use Identity by State (IBS) for the set of SNPs to infer sample relations. See Abecasis et al. (2001), for the description of this approach applied to genetic data. Briefly, between each sample pair, the IBS-vector is calculated, which is a measure of genetic distance between individuals. Next, the vector is summarized by its mean and variance. A mean of 2 and variance of 0 indicates that the samples are identical.

data <- alleleSharing(x, verbose=TRUE)
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 100 polymorphic SNPs to determine allele sharing.
## Running `square` IBS algorithm!

The set of SNPs may contain uninformative SNPs, SNPs of bad quality or even SNPs could be missing. The following pruning steps are implemented to yield the most informative set of SNPs (thresholds can be adjusted see ?alleleSharing):

  1. a SNP is remove if missing in >5% of the samples (callRate = 0.95)
  2. a sample should have at least 2/3 of the SNPs called (coverageRate = 2/3)
  3. a SNP is can be removed if it violates the Hardy-Weinberg equilibrium (alpha = 0).
  4. a SNP is can be removed if the minor allele frequency is below the given threshold (maf = 0)

Hardy-Weinberg test-statistics is calculated using a \(\chi^2\)-test and Bonferonni multiple testing correction is performed.

data
##    mean        var colnames.x colnames.y  relation
## 1  2.00 0.00000000         x1         x1 identical
## 2  1.98 0.01979798         x2         x1 unrelated
## 3  1.70 0.23232323         x3         x1 unrelated
## 4  1.34 0.38828283         x4         x1 unrelated
## 5  2.00 0.00000000         x2         x2 identical
## 6  1.68 0.24000000         x3         x2 unrelated
## 7  1.34 0.38828283         x4         x2 unrelated
## 8  2.00 0.00000000         x3         x3 identical
## 9  1.34 0.36808081         x4         x3 unrelated
## 10 2.00 0.00000000         x4         x4 identical

By default no relations are assumed except for the self-self relations.

The output is a data.frame containing all pairwise comparisons with the mean and variance of the IBS over the set of SNPs and the reported sample relationship, including the identifiers.

Report mismatches and provide graphical summary

Since we provided a list of known relations and assume that the majority is correct, we can build a classifier to discover misclassified relations. Linear discriminant analysis is used to generate a confusion-matrix, which is subsequently used to graphically represent the classification boundaries and generate an output file with misclassified sample pairs.

mismatches <- inferRelations(data)
Scatter-plot of IBS mean and variance with classification boundary for pairwise comparison between the samples without specifying sample relationships using artificially generated data.

Scatter-plot of IBS mean and variance with classification boundary for pairwise comparison between the samples without specifying sample relationships using artificially generated data.

mismatches
##   mean        var colnames.x colnames.y  relation predicted
## 2 1.98 0.01979798         x2         x1 unrelated identical

There is one misclassified sample, namely the replicate that we introduced but was not a priori specified as an existing relationship. The true relationship with between sample x1 and sample x2 is an identical relation. Furthermore, we see two sample pairs with mean IBS of 1.98 and variance 0.02 which is an indication that also these pairs share a considerable number of alleles. If known, such relationships can be specified prior to analysis.

relations <- expand.grid(idx = colnames(x), idy= colnames(x))
relations$relation_type <- "unrelated"
relations$relation_type[relations$idx == relations$idy] <- "identical"
relations$relation_type[c(2,5)] <- "identical" ##replicate
relations$relation_type[c(3,7,9,10)] <- "parent offspring"
relations
##    idx idy    relation_type
## 1   x1  x1        identical
## 2   x2  x1        identical
## 3   x3  x1 parent offspring
## 4   x4  x1        unrelated
## 5   x1  x2        identical
## 6   x2  x2        identical
## 7   x3  x2 parent offspring
## 8   x4  x2        unrelated
## 9   x1  x3 parent offspring
## 10  x2  x3 parent offspring
## 11  x3  x3        identical
## 12  x4  x3        unrelated
## 13  x1  x4        unrelated
## 14  x2  x4        unrelated
## 15  x3  x4        unrelated
## 16  x4  x4        identical

Rerun the allelesharing algorithm now provided with the known relations.

data <- alleleSharing(x, relations=relations)
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 100 polymorphic SNPs to determine allele sharing.
## Running `square` IBS algorithm!
data
##    mean        var colnames.x colnames.y         relation
## 1  2.00 0.00000000         x1         x1        identical
## 2  1.98 0.01979798         x2         x1        identical
## 3  1.70 0.23232323         x3         x1 parent offspring
## 4  1.34 0.38828283         x4         x1        unrelated
## 5  2.00 0.00000000         x2         x2        identical
## 6  1.68 0.24000000         x3         x2 parent offspring
## 7  1.34 0.38828283         x4         x2        unrelated
## 8  2.00 0.00000000         x3         x3        identical
## 9  1.34 0.36808081         x4         x3        unrelated
## 10 2.00 0.00000000         x4         x4        identical
mismatches <- inferRelations(data)
Scatter-plot of IBS mean and variance with classification boundaries for pairwise comparison between the samples with specifying sample relationships using artificially generated data.

Scatter-plot of IBS mean and variance with classification boundaries for pairwise comparison between the samples with specifying sample relationships using artificially generated data.

mismatches
## [1] mean       var        colnames.x colnames.y relation   predicted 
## <0 rows> (or 0-length row.names)

All misclassified relations were resolved.

Across omics data type sample relationship verification

The previous example showed how to perform sample relationship verification within a single omics data type. If a second set of SNPs is available obtained from a different omic data type (and the SNPs are partly overlapping), omicsPrint can be used to verify relationships between omics types, e.g. to establish whether two omics data types were indeed generated for the same individual in order to exclude or detect sample mix-ups.

In this artificial example a random subset of 80 SNPs is selected as the set of SNPs from a different omic type. First running without providing the known relations.

rownames(x) <- paste0("rs", 1:100)
y <- x[sample(1:100, 80),]
data <- alleleSharing(x, y)
## Hash relations
## Pruning 100 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Pruning 80 SNPs ...
## 0 SNPs removed because of low call rate!
## 0 samples removed because too few SNPs called!
## Using 80 polymophic SNPs to determine allele sharing.
## Running `rectangular` IBS algorithm!

Note that pruning is performed on both data types and automatically a set of overlapping SNPs (80) is used provided that the rownames of x and y are identical (this also holds for sample relations where the relation identifiers idx and idy should match with the columnames of x and y).

data
##      mean        var colnames.x colnames.y  relation
## 1  2.0000 0.00000000         x1         x1 identical
## 2  1.9750 0.02468354         x2         x1 unrelated
## 3  1.7125 0.23275316         x3         x1 unrelated
## 4  1.3500 0.38227848         x4         x1 unrelated
## 5  1.9750 0.02468354         x1         x2 unrelated
## 6  2.0000 0.00000000         x2         x2 identical
## 7  1.6875 0.24287975         x3         x2 unrelated
## 8  1.3500 0.38227848         x4         x2 unrelated
## 9  1.7125 0.23275316         x1         x3 unrelated
## 10 1.6875 0.24287975         x2         x3 unrelated
## 11 2.0000 0.00000000         x3         x3 identical
## 12 1.3625 0.36060127         x4         x3 unrelated
## 13 1.3500 0.38227848         x1         x4 unrelated
## 14 1.3500 0.38227848         x2         x4 unrelated
## 15 1.3625 0.36060127         x3         x4 unrelated
## 16 2.0000 0.00000000         x4         x4 identical
mismatches <- inferRelations(data)