1 Introduction

1.1 Background

Differential Allelic Representation (DAR) describes a situation commonly encountered in RNA-seq experiments involving organisms that are not isogenic, and where modern techniques such as CRISPR have been used for generation of individual organisms. DAR compromises the integrity of Differential Expression analysis results as it can bias expression, influencing the classification of genes (or transcripts) as being differentially expressed (DE). The concern does not lie within the statistical measures for detecting differential expression, because the underlying biology does indeed result in differential expression. However, without DAR analysis, the impacts of DAR are confounded with the experimental condition of interest and features can be incorrectly inferred as DE in response to the experimental condition, when they are actually an artefact of DAR. DAR analysis is therefore recommended as a complementary technique alongside Differential Expression analysis where non-isogenic conditions are present.

DAR occurs when the construction of experimental sample groups unexpectedly results in an unequal distribution of polymorphic alleles between the groups. This is often due to random chance when selecting group composition, however, may also be compounded by a study’s experimental design (described further below). When unequally-represented polymorphic alleles are also expression quantitative trait loci (eQTLs), expression differences between experimental groups are observed similarly to that of a functional response to an experimental condition (Figure 1). Analysis of gene expression in absence of the consideration of DAR is problematic as the source of differential expression will be unclear.

Two confounding scenarios that result in differential gene expression between experimental groups when samples are not isogenic. *Experimental grouping is illustrated as mutant vs wild-type control, but is also applicable for other experimental designs (e.g. treatment vs control). The key aspect here is that polymorphic regions of the homologous chromosomes differ between the experimental groups. **A)** Functional differential expression. The experimental factor (genotype of the gene-of-interest, GOI) has a functional interaction with a bystander gene (linked gene, LG), resulting in expression differences between the experimental groups. This is the assumed mechanism when inferring differential expression outcomes. However, a lack of between-group isogenicity complicates this assumption. **B)** eQTL-driven differential expression. A polymorphism in the promoter region of mutant samples, for example, causes expression differences between mutant and wild-type groups in the absence of a functional interaction between GOI and LG. This may incorrectly be inferred as differential expression due to the experimental condition.*

Figure 1: Two confounding scenarios that result in differential gene expression between experimental groups when samples are not isogenic
Experimental grouping is illustrated as mutant vs wild-type control, but is also applicable for other experimental designs (e.g. treatment vs control). The key aspect here is that polymorphic regions of the homologous chromosomes differ between the experimental groups. A) Functional differential expression. The experimental factor (genotype of the gene-of-interest, GOI) has a functional interaction with a bystander gene (linked gene, LG), resulting in expression differences between the experimental groups. This is the assumed mechanism when inferring differential expression outcomes. However, a lack of between-group isogenicity complicates this assumption. B) eQTL-driven differential expression. A polymorphism in the promoter region of mutant samples, for example, causes expression differences between mutant and wild-type groups in the absence of a functional interaction between GOI and LG. This may incorrectly be inferred as differential expression due to the experimental condition.

The presence of DAR is influenced by the nature of an experiment’s design. In studies that select experimental groups from a common sample pool, for example “treatment vs control” designs, DAR is predominantly driven by the isogenicity within the initial pool and the stochastic selection of samples. However, in studies involving the selection of groups based on a genetic feature, for example “mutant vs wild-type”, the presence of DAR is often intensified on the chromosome(s) containing the determining feature. This is because the selection criteria also drives selection for alleles linked to the determining feature (Figure 2).

DAR is frequently encountered in studies that involve experimental group selection based on the presence of a genetic feature, and is intensified on the chromosome containing the feature. *This figure illustrates the scenario where experimental sample groups are selected based on the presence of a mutant locus. Zebrafish are used as an example to signify that the model organism is not isogenic. **A)** A common breeding scheme for an intrafamily homozygous mutant vs wild-type transcriptome comparison. As a result of selection across multiple generations for the mutant chromosome (indicated in red), which originates from a single F~0~ fish (not pictured), homozygous mutant F~y~ fish posess two exact copies of the chromosome harbouring the mutation, disregarding recombination. However, wild-type F~y~ fish likely posses two different wild-type chromosomes (shaded using different stripe pattterns to indicate they are not isogenic). **B)** Experimental selection of progeny homozygous for a mutant allele involves increased homozygosity for alleles of genes linked to that mutation (i.e. on the same chromosome). High DAR is observed at the eQTL location, resulting in expression differences between the groups that are unrelated to the impact of the mutation. Moderate DAR is observed at the polymorphic locus, due to one of the wild-type chromosomes posessing the same polymorphism. If this polymorphism was also an eQTL, expression differences would be observed to a lesser extent.*

Figure 2: DAR is frequently encountered in studies that involve experimental group selection based on the presence of a genetic feature, and is intensified on the chromosome containing the feature
This figure illustrates the scenario where experimental sample groups are selected based on the presence of a mutant locus. Zebrafish are used as an example to signify that the model organism is not isogenic. A) A common breeding scheme for an intrafamily homozygous mutant vs wild-type transcriptome comparison. As a result of selection across multiple generations for the mutant chromosome (indicated in red), which originates from a single F0 fish (not pictured), homozygous mutant Fy fish posess two exact copies of the chromosome harbouring the mutation, disregarding recombination. However, wild-type Fy fish likely posses two different wild-type chromosomes (shaded using different stripe pattterns to indicate they are not isogenic). B) Experimental selection of progeny homozygous for a mutant allele involves increased homozygosity for alleles of genes linked to that mutation (i.e. on the same chromosome). High DAR is observed at the eQTL location, resulting in expression differences between the groups that are unrelated to the impact of the mutation. Moderate DAR is observed at the polymorphic locus, due to one of the wild-type chromosomes posessing the same polymorphism. If this polymorphism was also an eQTL, expression differences would be observed to a lesser extent.

DAR analysis results in an easy-to-interpret value between 0 and 1 for each genetic feature of interest, where 0 represents identical allelic representation and 1 represents complete diversity. This metric can be used to identify features prone to false-positive calls in Differential Expression analysis, and can be leveraged with statistical methods to alleviate the impact of such artefacts on RNA-seq data.

1.2 Further reading

  • The methodologies of this package were developed and described in Baer et al. (2023).
  • This phenomenon was flagged as problematic in White et al. (2022).

2 Setup

2.1 Installation

BiocManager is recommended for the installation of packages required in this vignette. BiocManager handles the installation of packages from both the CRAN and Bioconductor repositories.

if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager")
pkgs <- c("tidyverse", "limma", "tadar")
BiocManager::install(pkgs, update = FALSE)

Now we can load the required packages.

library(tidyverse)
library(limma)
library(tadar)

2.2 Data

The example data used in this vignette is provided within the tadar package.

The VCF file chr1.vcf.bgz contains multi-sample genotype calls produced from raw RNA-seq FASTQ data. The data originates from zebrafish (Danio rerio) brain transcriptomes, and further information can be found in the associated manuscript by Gerken et al. (2023). The VCF has been modified to remove headers and other data that are not essential for example purposes, and has been subset to a single chromosome (Chromosome 1).

The GRanges object chr1_genes contains gene feature information from the Ensembl database (release version 101) for Danio rerio Chromosome 1 (Cunningham et al. (2022)).

vcf <- system.file("extdata", "chr1.vcf.bgz", package = "tadar")
data("chr1_genes")

3 DAR analysis

DAR analysis is performed by incorporating single nucleotide-level genotype calls from variant calling software. We recommend the GATK best practices workflow for RNAseq short variant discovery (SNPs + Indels) as a reference for the generation of data required to begin DAR analysis. However, should individual genomes be also available for the experimental samples, these can add significant important information. Ultimately, we require a multi-sample VCF file with each entry representing the genomic location of a single nucleotide polymorphism (SNP) that differs from the reference genome in at least one sample.

The functions contained in this package are intended to be implemented as a sequential workflow, with each function addressing a discrete step in the data processing/analysis procedure. Please follow the steps in order as outlined below.

3.1 Loading genotype data

Genotype data from a VCF file is parsed into a GRanges object using the readGenotypes() function. This function is essentially a wrapper to VariantAnnotation::readVcf(), but only loads the data required for DAR analysis. By default, phasing information is removed as it is not required for DAR analysis and complicates downstream processing. This simply converts genotype calls represented as, for example 0|1 to 0/1, and is required if proceeding with the DAR analysis workflow. This can optionally be turned off with the unphase argument if this function is intended to be used for alternative purposes. The genome option is also available to override the genome automatically detected from the VCF. We intend to work with multiple GRanges objects and keeping the genome consistent avoids downstream errors. Additional arguments will be passed directly to VariantAnnotation::readVcf().

genotypes <- readGenotypes(file = vcf)
#> GRanges object with 6 ranges and 13 metadata columns:
#>       seqnames    ranges strand |     sample1     sample2     sample3     sample4     sample5     sample6     sample7     sample8     sample9    sample10    sample11    sample12    sample13
#>          <Rle> <IRanges>  <Rle> | <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character>
#>   [1]        1      6698      * |         0/1         0/1         0/1         0/1         0/1         0/1         1/1         0/1         0/1         0/1         1/1         0/1         0/1
#>   [2]        1      7214      * |         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         1/1
#>   [3]        1      7218      * |         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         1/1
#>   [4]        1      7274      * |         ./.         0/0         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         ./.         0/0         0/1
#>   [5]        1      9815      * |         ./.         ./.         ./.         ./.         ./.         0/0         ./.         ./.         ./.         ./.         1/1         ./.         ./.
#>   [6]        1      9842      * |         ./.         ./.         ./.         ./.         ./.         0/0         ./.         ./.         ./.         ./.         1/1         ./.         ./.
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

Genotypes are reported as numeric indices. 0 indicates the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, and 3 is the third alternate allele. Even though we are working with a diploid organism, four alleles are theoretically possible as the VCF consists of multiple samples. Genotypes that could not be determined by the variant calling software are reported as ./.. We work directly with the indices as they are consistent across all samples for a single variant position.

3.2 Counting alleles

We aim to calculate a DAR value at each suitable variant locus. This requires us to firstly summarise the genotype data into counts of the alleles reported at each variant locus. First we define our sample grouping structure as a list, where each element contains a character vector of samples within a single group.

groups <- list(
    group1 = paste0("sample", 1:6),
    group2 = paste0("sample", 7:13)
)

Now at each locus we can count the number of alleles that exist within each group with countAlleles(). This returns a GRangesList with each element corresponding to a different sample group. Columns indicate how many samples had called alleles, how many were missing and the counts of reference allele, followed by alternate alleles n_1, n_2 and n_3 ensuring multi-allelic sites are retained.

counts <- countAlleles(genotypes = genotypes, groups = groups)
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 16994 ranges and 6 metadata columns:
#>           seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>              <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>       [1]        1      6698      * |        12         0         6         6         0         0
#>       [2]        1      7214      * |         0        12         0         0         0         0
#>       [3]        1      7218      * |         0        12         0         0         0         0
#>       [4]        1      7274      * |         2        10         2         0         0         0
#>       [5]        1      9815      * |         2        10         2         0         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [16990]        1  59566458      * |         4         8         4         0         0         0
#>   [16991]        1  59567075      * |         6         6         5         1         0         0
#>   [16992]        1  59567620      * |         2        10         2         0         0         0
#>   [16993]        1  59570107      * |         0        12         0         0         0         0
#>   [16994]        1  59570454      * |         2        10         2         0         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome
#> 
#> $group2
#> GRanges object with 16994 ranges and 6 metadata columns:
#>           seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>              <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>       [1]        1      6698      * |        14         0         5         9         0         0
#>       [2]        1      7214      * |         2        12         0         2         0         0
#>       [3]        1      7218      * |         2        12         0         2         0         0
#>       [4]        1      7274      * |         4        10         3         1         0         0
#>       [5]        1      9815      * |         2        12         0         2         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [16990]        1  59566458      * |         4        10         3         1         0         0
#>   [16991]        1  59567075      * |         4        10         4         0         0         0
#>   [16992]        1  59567620      * |         6         8         6         0         0         0
#>   [16993]        1  59570107      * |         0        14         0         0         0         0
#>   [16994]        1  59570454      * |         2        12         0         2         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

3.3 Removing undesired loci

Not all samples were assigned genotypes during the variant calling procedure. Loci with a large number of assigned genotypes across all samples possess information that can be used to more accurately determine differences in allelic representation between experimental groups. We can remove the less informative data by filtering variant loci independently within each sample group with the filterLoci() function. The filter argument controls the criterion for selection of loci that we want to keep, by providing a logical expression using the metadata column names of our counts object (i.e. n_called, n_missing, n_0, n_1, n_2 and n_3). By default, filterLoci() keeps loci that satisfy the criterion: number of samples with called genotypes > number of samples with missing genotypes.

counts_filt <- filterLoci(counts = counts)
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 10701 ranges and 6 metadata columns:
#>           seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>              <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>       [1]        1      6698      * |        12         0         6         6         0         0
#>       [2]        1     10093      * |        12         0         6         6         0         0
#>       [3]        1     13098      * |        12         0        12         0         0         0
#>       [4]        1     13300      * |        12         0        12         0         0         0
#>       [5]        1     13558      * |        12         0        12         0         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [10697]        1  59563688      * |         8         4         6         2         0         0
#>   [10698]        1  59563996      * |        10         2        10         0         0         0
#>   [10699]        1  59564693      * |        10         2         8         2         0         0
#>   [10700]        1  59565427      * |         8         4         8         0         0         0
#>   [10701]        1  59565445      * |         8         4         8         0         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome
#> 
#> $group2
#> GRanges object with 12199 ranges and 6 metadata columns:
#>           seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>              <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>       [1]        1      6698      * |        14         0         5         9         0         0
#>       [2]        1     10093      * |        14         0         9         5         0         0
#>       [3]        1     13098      * |        14         0        13         1         0         0
#>       [4]        1     13300      * |        14         0        14         0         0         0
#>       [5]        1     13558      * |        14         0        14         0         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [12195]        1  59564693      * |        12         2        11         1         0         0
#>   [12196]        1  59565850      * |        10         4         8         2         0         0
#>   [12197]        1  59565874      * |        12         2        10         2         0         0
#>   [12198]        1  59566219      * |        10         4        10         0         0         0
#>   [12199]        1  59566270      * |         8         6         4         4         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

Specifying a filter with high stringency will result in greater removal of loci, but more accurate downstream results. For example, we can remove all loci where at least one sample does not possess a genotype call, resulting in complete information of allelic representation in each sample group.

counts_filt_2 <- filterLoci(counts = counts, filter = n_missing == 0)
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 7551 ranges and 6 metadata columns:
#>          seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>             <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>      [1]        1      6698      * |        12         0         6         6         0         0
#>      [2]        1     10093      * |        12         0         6         6         0         0
#>      [3]        1     13098      * |        12         0        12         0         0         0
#>      [4]        1     13300      * |        12         0        12         0         0         0
#>      [5]        1     13558      * |        12         0        12         0         0         0
#>      ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [7547]        1  59332656      * |        12         0        11         1         0         0
#>   [7548]        1  59335556      * |        12         0         6         6         0         0
#>   [7549]        1  59335583      * |        12         0         7         5         0         0
#>   [7550]        1  59335775      * |        12         0        12         0         0         0
#>   [7551]        1  59335805      * |        12         0         9         3         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome
#> 
#> $group2
#> GRanges object with 8425 ranges and 6 metadata columns:
#>          seqnames    ranges strand |  n_called n_missing       n_0       n_1       n_2       n_3
#>             <Rle> <IRanges>  <Rle> | <integer> <integer> <integer> <integer> <integer> <integer>
#>      [1]        1      6698      * |        14         0         5         9         0         0
#>      [2]        1     10093      * |        14         0         9         5         0         0
#>      [3]        1     13098      * |        14         0        13         1         0         0
#>      [4]        1     13300      * |        14         0        14         0         0         0
#>      [5]        1     13558      * |        14         0        14         0         0         0
#>      ...      ...       ...    ... .       ...       ...       ...       ...       ...       ...
#>   [8421]        1  59335556      * |        14         0         9         5         0         0
#>   [8422]        1  59335583      * |        14         0         9         5         0         0
#>   [8423]        1  59335775      * |        14         0        13         1         0         0
#>   [8424]        1  59406400      * |        14         0        12         2         0         0
#>   [8425]        1  59559081      * |        14         0        10         4         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

Note that because the filter is applied within each experimental sample group, we now have loci that are present in one group but not the other. This allows greater flexibility for more complex experimental designs that involve multiple comparisons between a greater number of sample groups than provided in this example. Ultimately, comparisons of allelic representation will be performed only on the intersection of loci present in each sample group.

3.4 Normalisation of allele counts

Depending on the stringency of the previously applied filter, it’s still possible that not every sample will have a genotype call. One experimental group may also contain more samples than another, which is the case in this example, where group1 contains 6 samples and group2 contains 7 samples. We account for this by normalising the allele counts as a proportion of total counts at each locus.

props <- countsToProps(counts = counts_filt)
#> GRangesList object of length 2:
#> $group1
#> GRanges object with 10701 ranges and 4 metadata columns:
#>           seqnames    ranges strand |    prop_0    prop_1    prop_2    prop_3
#>              <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>       [1]        1      6698      * |       0.5       0.5         0         0
#>       [2]        1     10093      * |       0.5       0.5         0         0
#>       [3]        1     13098      * |       1.0       0.0         0         0
#>       [4]        1     13300      * |       1.0       0.0         0         0
#>       [5]        1     13558      * |       1.0       0.0         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...
#>   [10697]        1  59563688      * |      0.75      0.25         0         0
#>   [10698]        1  59563996      * |      1.00      0.00         0         0
#>   [10699]        1  59564693      * |      0.80      0.20         0         0
#>   [10700]        1  59565427      * |      1.00      0.00         0         0
#>   [10701]        1  59565445      * |      1.00      0.00         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome
#> 
#> $group2
#> GRanges object with 12199 ranges and 4 metadata columns:
#>           seqnames    ranges strand |    prop_0    prop_1    prop_2    prop_3
#>              <Rle> <IRanges>  <Rle> | <numeric> <numeric> <numeric> <numeric>
#>       [1]        1      6698      * |  0.357143 0.6428571         0         0
#>       [2]        1     10093      * |  0.642857 0.3571429         0         0
#>       [3]        1     13098      * |  0.928571 0.0714286         0         0
#>       [4]        1     13300      * |  1.000000 0.0000000         0         0
#>       [5]        1     13558      * |  1.000000 0.0000000         0         0
#>       ...      ...       ...    ... .       ...       ...       ...       ...
#>   [12195]        1  59564693      * |  0.916667 0.0833333         0         0
#>   [12196]        1  59565850      * |  0.800000 0.2000000         0         0
#>   [12197]        1  59565874      * |  0.833333 0.1666667         0         0
#>   [12198]        1  59566219      * |  1.000000 0.0000000         0         0
#>   [12199]        1  59566270      * |  0.500000 0.5000000         0         0
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

3.5 Calculating DAR

Now that we have normalised values of allelic representation at each variant locus within our sample groups, we can calculate the DAR metric between our experimental groups. We require at least two sample groups to proceed with DAR analysis, however we can set up more comparisons depending on how many groups are present. We define these comparisons as a contrast matrix.

contrasts <- matrix(
    data = c(1, -1),
    dimnames = list(
        Levels = c("group1", "group2"),
        Contrasts = c("group1v2")
    )
)

Alternatively, this can be simplified with the makeContrasts() function from the limma package (Ritchie et al. (2015)).

contrasts <- makeContrasts(
    group1v2 = group1 - group2,
    levels = names(groups)
)

DAR is calculated by firstly determining the Euclidean distance between allelic proportions of the contrasted sample groups. The Euclidean distance is then converted to the DAR metric by dividing by the maximum possible distance, \(\sqrt{2}\), resulting in an easy-to-interpret value between 0 and 1, where 0 represents identical allelic representation and 1 represents complete diversity. This is handled within the dar() function by passing our allelic proportions and intended contrasts as arguments.

Two types of DAR values are reported by the dar() function as metadata columns of the resulting GRanges objects:

  • dar_origin: The raw DAR values calculated at single nucleotide positions (the origin) between sample groups. These values represent DAR estimates at a precise locus.
  • dar_region: The mean of raw DAR values in a specified region surrounding the origin. This is optionally returned using either of the region_fixed or region_loci arguments, which control the strategy and size for establishing regions (more information below). This option exists because eQTLs don’t necessarily confer their effects on genes in close proximity. Therefore, DAR estimates that are representative of regions may be more suitable for downstream assignment DAR values to genomic features.

region_fixed and region_loci are optional arguments that offer users dar_region values in conjunction with dar_origin values. While only one of these two arguments is required, if both are provided, region_fixed takes precedence. Understanding the distinctions in how each option defines regions around an origin is important, as their selection carries subjective implications downstream.

  • region_fixed: Establishes a region with a fixed size around the origin. The total region size is defined in base pairs, meaning that the region will extend region_fixed / 2 base pairs either side of the origin. For example if region_fixed = 101 is specified, an origin at position 500 will have an associated region spanning positions 450-550.
  • region_loci: Establishes an elastic region to average the specified number of loci with dar_origin values, such that this number of values are included. Note that whilst the number of points around the central value may be equal, the genomic distances may not be. In contrast to region_fixed, region_loci results in DAR estimates that cover a substantial proportion of each chromosome (i.e from the first locus to the last), which is useful for downstream assignment of DAR values to a greater number of genomic features.

The remainder of this vignette uses examples produced from specifying the region_loci argument.

region_loci <- 5

With a chosen elastic region size of 5 loci, this will smooth the DAR metric at each origin locus with the DAR values of the 2 loci either side.

dar <- dar(props = props, contrasts = contrasts, region_loci = region_loci)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 10369 ranges and 2 metadata columns:
#>           seqnames    ranges strand | dar_origin dar_region
#>              <Rle> <IRanges>  <Rle> |  <numeric>  <numeric>
#>       [1]        1      6698      * |  0.1428571         NA
#>       [2]        1     10093      * |  0.1428571         NA
#>       [3]        1     13098      * |  0.0714286  0.0714286
#>       [4]        1     13300      * |  0.0000000  0.0547619
#>       [5]        1     13558      * |  0.0000000  0.1061905
#>       ...      ...       ...    ... .        ...        ...
#>   [10365]        1  59562848      * |   0.000000  0.1300000
#>   [10366]        1  59563684      * |   0.150000  0.0800000
#>   [10367]        1  59563688      * |   0.150000  0.0833333
#>   [10368]        1  59563996      * |   0.000000         NA
#>   [10369]        1  59564693      * |   0.116667         NA
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

We now have DAR values for loci shared between the two sample groups of the defined contrasts. Each element of the resulting GRangesList object represents a single contrast and are named as defined in our contrasts object.

3.6 Assigning DAR values to features

The final step of DAR analysis involves assigning DAR values to features of interest. It makes sense to select features that were tested for differential expression, because this step provides an estimate of the potential for eQTL impacts on a feature’s expression. This is performed with the assignFeatureDar() function, passing the features of interest as a GRanges object. In this example we use the genes contained in chr1_genes, which was loaded earlier.

The dar_val argument controls whether origin or region DAR values are used when assigned to features. In the code below we specify dar_val = "origin", as the ranges of our GRanges objects held in dar currently represent those associated with origin DAR values (read further to see the alternative scenario). The fill_missing argument controls what value is assigned to features with no overlaps. This defaults to NA for easy filtering of features with unassigned DAR values.

For each feature, assignFeatureDar() takes the mean of DAR values for any associated ranges that overlap the feature range. With the following configuration, it means the resulting assigned DAR values represent the average DAR that exists solely within the feature.

gene_dar <- assignFeatureDar(
    dar = dar,
    features = chr1_genes,
    dar_val = "origin",
    fill_missing = NA
)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 1456 ranges and 3 metadata columns:
#>          seqnames            ranges strand |            gene_id      gene_name       dar
#>             <Rle>         <IRanges>  <Rle> |        <character>    <character> <numeric>
#>      [1]        1        6408-12027      - | ENSDARG00000099104          rpl24 0.1428571
#>      [2]        1       11822-16373      + | ENSDARG00000102407          cep97 0.1352381
#>      [3]        1       18716-23389      + | ENSDARG00000102097         nfkbiz 0.1622024
#>      [4]        1       25585-27255      + | ENSDARG00000099319     CU651657.1        NA
#>      [5]        1       27690-34330      + | ENSDARG00000099640            eed 0.0794872
#>      ...      ...               ...    ... .                ...            ...       ...
#>   [1452]        1 59525648-59528626      - | ENSDARG00000105092     FP236157.3        NA
#>   [1453]        1 59533317-59539898      + | ENSDARG00000099880            sp6        NA
#>   [1454]        1 59542001-59542908      - | ENSDARG00000102872     FP236157.2        NA
#>   [1455]        1 59555936-59567685      - | ENSDARG00000098899 zmp:0000001082 0.0810516
#>   [1456]        1 59569516-59571785      - | ENSDARG00000101364        wfikkn1        NA
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

Whilst eQTLs don’t necessarily exist within a feature itself, smoothed values within a feature may still be representative of the surrounding region as inheritied haplotypes. Additionally, we can opt to use the region DAR values (dar_region metadata column in the dar object) to assign DAR from the regions surrounding a feature to increase the amount of smoothing. The size of this region is controlled with the region_loci argument in the dar() function, which we previously set as 5.

To assign DAR values based on regions, we must firstly utilise the flipRanges() function. We can also specify the extend_edges option to extend the outermost ranges of each chromosome to encompass the entire chromosome. This is useful for ensuring the assignment of DAR values to all features. However, caution should be taken with features that exist toward the edges of a chromosome as any assigned DAR values may be less accurate.

dar_regions <- flipRanges(dar = dar, extend_edges = TRUE)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 10365 ranges and 3 metadata columns:
#>     seqnames            ranges strand | dar_origin dar_region origin_ranges
#>        <Rle>         <IRanges>  <Rle> |  <numeric>  <numeric>     <IRanges>
#>   1        1           1-13558      * |  0.0714286  0.0714286         13098
#>   1        1       10093-14083      * |  0.0000000  0.0547619         13300
#>   1        1       13098-14740      * |  0.0000000  0.1061905         13558
#>   1        1       13300-15050      * |  0.0595238  0.1204762         14083
#>   1        1       13558-15772      * |  0.4000000  0.1490476         14740
#>   .      ...               ...    ... .        ...        ...           ...
#>   1        1 59559120-59562848      * |       0.25  0.1000000      59559963
#>   1        1 59559331-59563684      * |       0.10  0.1200000      59559970
#>   1        1 59559963-59563688      * |       0.00  0.1300000      59562848
#>   1        1 59559970-59563996      * |       0.15  0.0800000      59563684
#>   1        1 59562848-59578282      * |       0.15  0.0833333      59563688
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

flipRanges() can also be used to revert back to ranges that represent the origins.

identical(dar, flipRanges(dar_regions))
#> [1] TRUE

Now we can assign DAR values to features based on their surrounding region, by supplying our new ranges and selecting dar_val = region. A warning will be produced if the ranges don’t match the selected dar_val, as this is likely unintended by the user. However, use cases may exist and the warning can be ignored if this is intended.

gene_dar_regions <- assignFeatureDar(
    dar = dar_regions,
    features = chr1_genes,
    dar_val = "region"
)
#> GRangesList object of length 1:
#> $group1v2
#> GRanges object with 1456 ranges and 3 metadata columns:
#>          seqnames            ranges strand |            gene_id      gene_name       dar
#>             <Rle>         <IRanges>  <Rle> |        <character>    <character> <numeric>
#>      [1]        1        6408-12027      - | ENSDARG00000099104          rpl24 0.0630952
#>      [2]        1       11822-16373      + | ENSDARG00000102407          cep97 0.1467262
#>      [3]        1       18716-23389      + | ENSDARG00000102097         nfkbiz 0.1711012
#>      [4]        1       25585-27255      + | ENSDARG00000099319     CU651657.1 0.1585714
#>      [5]        1       27690-34330      + | ENSDARG00000099640            eed 0.0833333
#>      ...      ...               ...    ... .                ...            ...       ...
#>   [1452]        1 59525648-59528626      - | ENSDARG00000105092     FP236157.3 0.0966071
#>   [1453]        1 59533317-59539898      + | ENSDARG00000099880            sp6 0.0966071
#>   [1454]        1 59542001-59542908      - | ENSDARG00000102872     FP236157.2 0.0966071
#>   [1455]        1 59555936-59567685      - | ENSDARG00000098899 zmp:0000001082 0.0945833
#>   [1456]        1 59569516-59571785      - | ENSDARG00000101364        wfikkn1 0.0833333
#>   -------
#>   seqinfo: 1 sequence from GRCz11 genome

Note that because elastic regions were defined, and extend_edges was set to TRUE, all genes in chr1_genes now have an assigned DAR value.

3.7 DAR analysis using the pipe operator

The functions in this package have been designed to enable DAR analysis with minimal effort. The steps above can be performed in several lines of code as follows.

First load the required packages.

library(tidyverse)
library(limma)
library(tadar)

Then define the input objects.

vcf <- system.file("extdata", "chr1.vcf.bgz", package="tadar")
data("chr1_genes")
groups <- list(
    group1 = paste0("sample", 1:6),
    group2 = paste0("sample", 7:13)
)
contrasts <- makeContrasts(
    group1v2 = group1 - group2,
    levels = names(groups)
)
region_loci <- 5

Now use the base pipe to assign DAR values to genes based on their surrounding region.

gene_dar <- readGenotypes(vcf) |>
    countAlleles(groups = groups) |>
    filterLoci() |>
    countsToProps() |>
    dar(contrasts = contrasts, region_loci = region_loci) |>
    flipRanges(extend_edges = TRUE) |>
    assignFeatureDar(features = chr1_genes, dar_val = "region")

4 Visualisation

4.1 The trend in DAR along a chromosome

The plotChrDar() function produces a localised visualisation of the trend in DAR along a chromosomal axis with the option to overlay the positions of features of interest. This allows us to compare DAR between regions and/or chromosomes, and visually assess if features may be prone eQTL artefacts. plotChrDar() is a convenience function that returns a GViz plot consisting of an optional number of tracks. To produce a minimal plot, this function requires the dar argument, expecting a GRanges object with DAR values in the metadata columns, and the dar_val argument as a character string to select the appropriate DAR values from either the dar_origin or dar_region metadata columns. These arguments form the DataTrack displaying the trend in DAR. Features of interest (e.g. a mutation locus) can be passed to the foi argument as a GRanges object and will be plotted along the GenomeAxisTrack. Additional features (e.g. DE genes) can similarly be passed to the features argument to be plotted as a separate AnnotationTrack. The *_anno and *_highlight arguments are used to select the metadata column containing feature labels, and to choose if their positions should be highlighted over the DataTrack displaying the trend in DAR respectively. We can also add a title with the title argument. GViz requires that any GRanges objects consist of ranges on a single chromosome, however these can be internally subset using the chr argument (which will also control the title of the GenomeAxisTrack).

set.seed(230822)
foi <- sample(chr1_genes, 1)
features <- sample(chr1_genes, 20)
plotChrDar(
    dar = dar$group1v2, dar_val = "region", chr = "1",
    foi = foi, foi_anno = "gene_name", foi_highlight = TRUE,
    features = features, features_anno = "gene_name", features_highlight = TRUE,
    title = "Example plot of DAR along Chromosome 1"
)

4.2 Comparing DAR between chromosomes

It may be of use to quickly assess whether a particular chromosome is affected by high DAR more so than other chromosomes. This is particularly relevant when a study’s experimental design causes increased DAR on a particular chromosome. The plotDarECDF() function facilitates this purpose by plotting the Empirical Cumulative Distribution Function (ECDF) of DAR for each chromosome. This function returns a ggplot2 object, allowing us to add our own styling. For this example we use simulated data that results in a visualisation commonly observed when experimental sample groups have been constructed based on the presence/absence of a genetic feature of interest (e.g. a mutant locus).

set.seed(230704)
simulate_dar <- function(n, mean) {
    vapply(
        rnorm(n = n, mean = mean),
        function(x) exp(x) / (1 + exp(x)),
        numeric(1)
    )
}
gr <- GRanges(
    paste0(rep(seq(1,25), each = 100), ":", seq(1,100)),
    dar_origin = c(simulate_dar(2400, -2), simulate_dar(100, 0.5))
)
plotDarECDF(dar = gr, dar_val = "origin") +
    theme_bw()

We can also choose to highlight the chromosome of interest with the highlight argument.

plotDarECDF(dar = gr, dar_val = "origin", highlight = "25") +
    scale_colour_manual(values = c("TRUE" = "red", "FALSE" = "grey")) +
    theme_bw()

5 Bibliography

Baer, Lachlan, Karissa Barthelson, John Postlethwait, David Adelson, Stephen Pederson, and Michael Lardelli. 2023. “Differential Allelic Representation (DAR) Identifies Candidate eQTLs and Improves Transcriptome Analysis.” Preprint. Genetics. https://doi.org/10.1101/2023.03.02.530865.

Cunningham, Fiona, James E Allen, Jamie Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Olanrewaju Austine-Orimoloye, et al. 2022. “Ensembl 2022.” Nucleic Acids Research 50 (D1): D988–D995. https://doi.org/10.1093/nar/gkab1049.

Gerken, Ewan, Syahida Ahmad, Karissa Barthelson, and Michael Lardelli. 2023. “Zebrafish Models of Mucopolysaccharidosis Types IIIA, B, & C Show Hyperactivity and Changes in Oligodendrocyte State.” Preprint. Genetics. https://doi.org/10.1101/2023.08.02.550904.

Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–e47. https://doi.org/10.1093/nar/gkv007.

White, Richard J, Eirinn Mackay, Stephen W Wilson, and Elisabeth M Busch-Nentwich. 2022. “Allele-Specific Gene Expression Can Underlie Altered Transcript Abundance in Zebrafish Mutants.” eLife 11 (February): e72825. https://doi.org/10.7554/eLife.72825.

6 Session information

#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] tadar_1.2.0          GenomicRanges_1.56.0 GenomeInfoDb_1.40.0 
#>  [4] IRanges_2.38.0       S4Vectors_0.42.0     BiocGenerics_0.50.0 
#>  [7] limma_3.60.0         lubridate_1.9.3      forcats_1.0.0       
#> [10] stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2         
#> [13] readr_2.1.5          tidyr_1.3.1          tibble_3.2.1        
#> [16] ggplot2_3.5.1        tidyverse_2.0.0      BiocStyle_2.32.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.16.0          
#>   [3] jsonlite_1.8.8              magrittr_2.0.3             
#>   [5] GenomicFeatures_1.56.0      farver_2.1.1               
#>   [7] rmarkdown_2.26              BiocIO_1.14.0              
#>   [9] zlibbioc_1.50.0             vctrs_0.6.5                
#>  [11] memoise_2.0.1               Rsamtools_2.20.0           
#>  [13] RCurl_1.98-1.14             base64enc_0.1-3            
#>  [15] htmltools_0.5.8.1           S4Arrays_1.4.0             
#>  [17] progress_1.2.3              curl_5.2.1                 
#>  [19] SparseArray_1.4.0           Formula_1.2-5              
#>  [21] sass_0.4.9                  bslib_0.7.0                
#>  [23] htmlwidgets_1.6.4           Gviz_1.48.0                
#>  [25] httr2_1.0.1                 cachem_1.0.8               
#>  [27] GenomicAlignments_1.40.0    lifecycle_1.0.4            
#>  [29] pkgconfig_2.0.3             Matrix_1.7-0               
#>  [31] R6_2.5.1                    fastmap_1.1.1              
#>  [33] GenomeInfoDbData_1.2.12     MatrixGenerics_1.16.0      
#>  [35] digest_0.6.35               colorspace_2.1-0           
#>  [37] AnnotationDbi_1.66.0        Hmisc_5.1-2                
#>  [39] RSQLite_2.3.6               labeling_0.4.3             
#>  [41] filelock_1.0.3              fansi_1.0.6                
#>  [43] timechange_0.3.0            httr_1.4.7                 
#>  [45] abind_1.4-5                 compiler_4.4.0             
#>  [47] bit64_4.0.5                 withr_3.0.0                
#>  [49] htmlTable_2.4.2             backports_1.4.1            
#>  [51] BiocParallel_1.38.0         DBI_1.2.2                  
#>  [53] highr_0.10                  biomaRt_2.60.0             
#>  [55] rappdirs_0.3.3              DelayedArray_0.30.0        
#>  [57] rjson_0.2.21                tools_4.4.0                
#>  [59] foreign_0.8-86              nnet_7.3-19                
#>  [61] glue_1.7.0                  restfulr_0.0.15            
#>  [63] grid_4.4.0                  checkmate_2.3.1            
#>  [65] cluster_2.1.6               generics_0.1.3             
#>  [67] gtable_0.3.5                BSgenome_1.72.0            
#>  [69] tzdb_0.4.0                  ensembldb_2.28.0           
#>  [71] data.table_1.15.4           hms_1.1.3                  
#>  [73] xml2_1.3.6                  utf8_1.2.4                 
#>  [75] XVector_0.44.0              pillar_1.9.0               
#>  [77] BiocFileCache_2.12.0        lattice_0.22-6             
#>  [79] deldir_2.0-4                rtracklayer_1.64.0         
#>  [81] bit_4.0.5                   biovizBase_1.52.0          
#>  [83] tidyselect_1.2.1            Biostrings_2.72.0          
#>  [85] knitr_1.46                  gridExtra_2.3              
#>  [87] bookdown_0.39               ProtGenerics_1.36.0        
#>  [89] SummarizedExperiment_1.34.0 xfun_0.43                  
#>  [91] Biobase_2.64.0              statmod_1.5.0              
#>  [93] matrixStats_1.3.0           stringi_1.8.3              
#>  [95] UCSC.utils_1.0.0            lazyeval_0.2.2             
#>  [97] yaml_2.3.8                  evaluate_0.23              
#>  [99] codetools_0.2-20            interp_1.1-6               
#> [101] BiocManager_1.30.22         cli_3.6.2                  
#> [103] rpart_4.1.23                munsell_0.5.1              
#> [105] jquerylib_0.1.4             Rcpp_1.0.12                
#> [107] dichromat_2.0-0.1           dbplyr_2.5.0               
#> [109] png_0.1-8                   XML_3.99-0.16.1            
#> [111] parallel_4.4.0              blob_1.2.4                 
#> [113] prettyunits_1.2.0           jpeg_0.1-10                
#> [115] latticeExtra_0.6-30         AnnotationFilter_1.28.0    
#> [117] bitops_1.0-7                VariantAnnotation_1.50.0   
#> [119] scales_1.3.0                crayon_1.5.2               
#> [121] rlang_1.1.3                 KEGGREST_1.44.0