1 Overview

ABAEnrichment is designed to test user-defined genes for expression enrichment in different human brain regions. The package integrates the expression of the input gene set and the structural information of the brain using an ontology, both provided by the Allen Brain Atlas project [1-4]. The statistical analysis is performed by the core function aba_enrich which interfaces with the ontology enrichment software FUNC [5]. Additional functions provided in this package are get_expression, plot_expression, get_name, get_id, get_sampled_substructures, get_superstructures and get_annotated_genes supporting the exploration and visualization of the expression data.

1.1 Expression data

The package incorporates three different brain expression datasets:

  1. microarray data from six adult individuals
  2. RNA-seq data from 42 individuals of five different developmental stages (prenatal, infant, child, adolescent, adult)
  3. developmental effect scores measuring the age effect on expression per gene

All three datasets are filtered for protein-coding genes and gene expression is averaged across donors. Although the third dataset does not contain expression data, but a derived score, for simplicity we only refer to ‘expression’ in this documentation. For details on the datasets see the ABAData vignette.

1.2 Annotation of genes to brain regions

Using the ontology that describes the hierarchical organization of the brain, brain regions get annotated all genes that are expressed in the brain region itself or in any of its substructures. The boundary between ‘expressed’ and ‘not expressed’ is defined by different expression quantiles (e.g. using a quantile of 0.4, the lowest 40% of gene expression in the brain are considered ‘not expressed’ and the upper 60% are considered ‘expressed’). These cutoffs are set with the parameter cutoff_quantiles and an analysis is run for every cutoff separately. The default cutoffs are 10% to 90% in steps of 10%.

1.3 Enrichment analysis

The enrichment analysis is performed by using either the hypergeometric test, the Wilcoxon rank-sum test, the binomial test or the 2x2 contingency table test implemented in the ontology enrichment software FUNC [5]. The hypergeometric test evaluates the enrichment of annotated (expressed) candidate genes compared to annotated background genes for each brain region. The background genes can be defined explicitly like the candidate genes or, by default, consist of all protein-coding genes from the dataset that are not contained in the set of candidate genes. In contrast to this binary distinction between candidate and background genes, the Wilcoxon rank-sum test uses user-defined scores that are assigned to the input genes. It then tests every brain region for an enrichment of genes with high scores in the set of expressed input genes. When genes are associated with two counts (A and B), e.g. amino-acid changes since a common ancestor in two species, a binomial test can be used to identify brain regions with an enrichment of expressed genes with a high fraction of A compared to the fraction of A in the brain in general. When genes are associated with four counts (A-D), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [6], the 2x2 contingency table test can be used. It can identify brain regions which have a high ratio of A/B compared to C/D in their expressed genes.

To account for multiple testing, FUNC computes the family-wise error rate (FWER) using randomsets. The randomsets are generated by permuting the gene-associated variables (e.g. candidate and background genes or the scores assigned to genes for the hypergeometric and Wilcoxon rank-sum test, respectively, see Schematic 1 below). This is also the default behavior in ABAEnrichment. For the hypergeometric test, ABAEnrichment additionally provides the option to correlate the chance of a background gene to be selected as a random candidate gene with the length of the background gene (option gene_len). Furthermore, instead of defining genes explicitly, whole genomic regions can be provided as input. ABAEnrichment then tests brain regions for enrichment of expressed genes located in the candidate regions, compared to expressed genes located in the background regions. The randomsets then also consist of randomly chosen candidate regions inside the background regions, either as a whole block in one background region (default), or on the same chromosome allowing to overlap multiple background regions on that chromosome (option circ_chrom, see Schematic 2 below).

1.4 Functions included in ABAEnrichment:

function description
aba_enrich core function for performing enrichment analyses given a candidate gene set
get_expression returns expression data for a given set of genes and brain regions
plot_expression plots a heatmap with expression data for a given set of genes and brain regions
get_name returns the full name of a brain region given a structure ID
get_sampled_substructures returns the substructures of a given brain region that have expression data available
get_superstructures returns the superstructures of a given brain region
get_id returns the structure ID given the name of a brain region
get_annotated_genes returns genes annotated to enriched or user-defined brain regions

2 Examples

2.1 Test gene expression enrichment using the hypergeometric test

For a random set of 13 candidate genes, two analyses to identify human brain regions with enriched expression of the candidate genes are performed: one using data from adult donors (from Allen Human Brain Atlas [3]) and one using data from five developmental stages (from BrainSpan Atlas of the Developing Human Brain [4]). The hypergeometric test evaluates the over-representation of a set of expressed candidate genes in brain regions, compared to a set of expressed background genes (see Schematic 1 below). The input for the hypergeometric test is a dataframe with two columns: (1) a column with gene identifiers (Entrez-ID, Ensembl-ID or gene-symbol) and (2) a binary column with 1 for a candidate gene and 0 for a background gene. In this example no background genes are defined, so all remaining protein-coding genes of the dataset are used as default background.

## load ABAEnrichment package
## create input vector with candidate genes
gene_ids = c('NCAPG', 'APOL4', 'NGFR', 'NXPH4', 'C21orf59', 'CACNG2', 'AGTR1',
    'ANO1', 'BTBD3', 'MTUS1', 'CALB1', 'GYG1', 'PAX2')
is_candidate = rep(1, length(gene_ids))
input_hyper = data.frame(gene_ids, is_candidate)
##   gene_ids is_candidate
## 1    NCAPG            1
## 2    APOL4            1
## 3     NGFR            1
## 4    NXPH4            1
## 5 C21orf59            1
## 6   CACNG2            1

The core function aba_enrich performs the enrichment analysis. It takes the genes vector as input, together with a dataset argument which is set to adult (default) or 5_stages for the analyses of the adult and the developing human brain, respectively. An example with the developmental effect score (dev_effect) can be found below.

## run enrichment analyses with default parameters 
## for the adult and developing human brain
res_adult = aba_enrich(input_hyper, dataset='adult')
res_devel = aba_enrich(input_hyper, dataset='5_stages')

In the following examples two additional parameters are set to lower computation time: cutoff_quantiles=c(0.5,0.7,0.9) to use the 50%, 70% and 90% expression quantiles across all genes as the boundary between ‘expressed’ and ‘not expressed’ genes, and n_randsets=100 to use 100 random permutations to calculate the FWER. cutoff_quantiles and n_randsets have default values seq(0.1,0.9,0.1) and 1000, respectively.

## run enrichment analysis with less cutoffs and randomsets
## to save computation time
res_devel = aba_enrich(input_hyper, dataset='5_stages',
    cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100)

The function aba_enrich returns a list, the first element of which contains the results of the statistical analysis for each brain region and age category (analyses are performed independently for each developmental stage):

## extract first element from the output list, which contains the statistics
fwers_devel = res_devel[[1]]
## see results for the brain regions with highest enrichment
## for children (3-11 yrs, age_category 3)
fwers_3 = fwers_devel[fwers_devel[,1]==3, ]
##    age_category structure_id                                     structure n_significant mean_FWER min_FWER
## 55            3  Allen:10657                         CBC_cerebellar cortex             0 0.5000000     0.36
## 56            3  Allen:10361                        AMY_amygdaloid complex             0 0.9500000     0.85
## 57            3  Allen:10163    M1C_primary motor cortex (area M1, area 4)             0 0.9566667     0.87
## 58            3  Allen:10225 IPC_posteroventral (inferior) parietal cortex             0 0.9733333     0.92
## 59            3  Allen:10173            DFC_dorsolateral prefrontal cortex             0 0.9833333     0.96
## 60            3  Allen:10161                         FCx_frontal neocortex             0 0.9866667     0.96
##                                          equivalent_structures          FWERs
## 55 Allen:10657;Allen:10656;Allen:10655;Allen:10654;Allen:10653 0.66;0.48;0.36
## 56                                                 Allen:10361       0.85;1;1
## 57                                     Allen:10163;Allen:10162       0.87;1;1
## 58                                     Allen:10225;Allen:10214       0.92;1;1
## 59                                                 Allen:10173    0.96;1;0.99
## 60                                                 Allen:10161       0.96;1;1

The rows in the output data frame are ordered by age_category, n_significant, min_FWER and mean_FWER; with e.g. min_FWER denoting the minimum FWER for enrichment of expressed candidate genes in that brain region across all expression cutoffs. ‘n_significant’ reports the number of cutoffs at which the FWER was below 0.05. The column FWERs lists the individual FWERs for each cutoff.
The column equivalent_structures lists brain regions with identical expression data due to lack of independent expression measurements in all regions. Nodes (brain regions) in the ontology inherit data from their children (substructures), and in the case of only one child node with expression data, the parent node inherits the child’s data leading to identical enrichment statistics.

In addition to the statistics, the list that is returned from aba_enrich also contains the input genes for which expression data are available, and for each age category the gene expression values that correspond to the requested cutoff_quantiles:

## $genes
##    gene_ids is_candidate
## 1     AGTR1            1
## 2      ANO1            1
## 3     BTBD3            1
## 4  C21orf59            1
## 5    CACNG2            1
## 6     CALB1            1
## 7      GYG1            1
## 8     MTUS1            1
## 9     NCAPG            1
## 10     NGFR            1
## 11    NXPH4            1
## 12     PAX2            1
## $cutoffs
##     age_category_1 age_category_2 age_category_3 age_category_4 age_category_5
## 50%       3.144481       2.854802       2.716617       2.776235       2.862117
## 70%       7.823920       7.017616       6.897414       6.842193       7.118609
## 90%      23.768641      22.478328      23.124388      21.625395      22.680811

For example, in the enrichment analysis of age category 2 (infant) with an expression cutoff of 0.7 (70%), genes are considered ‘expressed’ in a particular brain region when their expression value in that region is at least 7.017616.

2.1.1 Choose random candidate regions dependent on gene length

The default behavior of aba_enrich is to permute candidate and background genes randomly to compute the FWER. With the option gene_len=TRUE, random selection of background genes as candidate genes is dependent on the gene length, i.e. a gene twice as long as another gene also is twice as likely selected as a candidate gene in a randomset. This is useful when the procedure that led to the identification of the candidate gene set is also more likely to discover longer genes. Gene coordinates were obtained from (GRCh37.p13). The option ref_genome='grch38' uses gene coordinates from the GRCh38 genome (GRCh38.p10) obtained from

## run enrichment analysis, with randomsets dependent on gene length
res_len = aba_enrich(input_hyper, gene_len=TRUE)
## run the same analysis using gene coordinates
## from GRCh38 instead of the default GRCh37
res_len_grch38 = aba_enrich(input_hyper, gene_len=TRUE, ref_genome='grch38')

2.2 Test gene expression enrichment using the Wilcoxon rank-sum test

When the genes are not divided into candidate and background genes, but are ranked by scores, a Wilcoxon rank-sum test can be performed to find brain regions with a high proportion of genes with high scores in the set of expressed genes. The second column of the genes input dataframe then contains the scores assigned to the genes. The output is identical to the one produced with the hypergeometric test.

## assign random scores to the genes used above
scores = sample(1:50, length(gene_ids))
input_wicox = data.frame(gene_ids, scores)
##   gene_ids scores
## 1    NCAPG     28
## 2    APOL4     21
## 3     NGFR     46
## 4    NXPH4     37
## 5 C21orf59     24
## 6   CACNG2     20
## test for enrichment of expressed genes with high scores in the adult brain
## using the Wilcoxon rank-sum test
res_wilcox = aba_enrich(input_wicox, test='wilcoxon',
    cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100)
##   age_category structure_id                                 structure n_significant mean_FWER min_FWER
## 1            5   Allen:9017 ICjl_interstitial nucleus of Cajal, Right             0 0.8600000     0.59
## 2            5   Allen:4314           SI_Substantia Innominata, Right             0 0.9000000     0.74
## 3            5  Allen:12902            AHA_Anterior Hypothalamic Area             0 0.9033333     0.74
## 4            5  Allen:12919     VMH_Ventromedial Hypothalamic Nucleus             0 0.9033333     0.74
## 5            5   Allen:4598      AHA_Anterior Hypothalamic Area, Left             0 0.9033333     0.74
## 6            5   Allen:4627      LHA_Lateral hypothalamic area, Right             0 0.9033333     0.74
##   equivalent_structures          FWERs
## 1            Allen:9017    1;0.99;0.59
## 2            Allen:4314 0.97;0.99;0.74
## 3           Allen:12902    0.97;1;0.74
## 4           Allen:12919    0.97;1;0.74
## 5            Allen:4598    0.97;1;0.74
## 6            Allen:4627    0.97;1;0.74

2.3 NEW: Test for gene expression enrichment using the binomial test

When genes are associated with two counts A and B, e.g. amino-acid changes since a common ancestor in two species, a binomial test can be used to identify brain regions with an enrichment of expressed genes with a high fraction A/(A+B) compared to the fraction of A in the brain in general (the root node). To perform the binomial test the input dataframe needs a column with the gene symbols and two additional columns with the corresponding counts:

## create a toy example dataset with two counts per gene
high_A_genes = c('RFFL', 'NTS', 'LIPE', 'GALNT6', 'GSN', 'BTBD16', 'CERS2')
low_A_genes = c('GDA', 'ENC1', 'EGR4', 'VIPR1', 'DOC2A', 'OASL', 'FRY', 'NAV3')
A_counts = c(sample(20:30, length(high_A_genes)),
    sample(5:15, length(low_A_genes)))
B_counts = c(sample(5:15, length(high_A_genes)),
    sample(20:30, length(low_A_genes)))
input_binom = data.frame(gene_ids=c(high_A_genes, low_A_genes),
    A_counts, B_counts)
##   gene_ids A_counts B_counts
## 1     RFFL       27       13
## 2      NTS       21        8
## 3     LIPE       30        7
## 4   GALNT6       29       11
## 5      GSN       28        6
## 6   BTBD16       25        9

In this example also the silent option is used, which suppresses all output that would be written to the screen (except for warnings and errors):

## run binomial test
res_binom = aba_enrich(input_binom, cutoff_quantiles=c(0.2,0.9),
    test='binomial', n_randsets=100, silent=TRUE)
##   age_category structure_id                                            structure n_significant mean_FWER
## 1            5   Allen:4518                         Sb_Subthalamic Nucleus, Left             0     0.125
## 2            5  Allen:12923                          DTL_Lateral Group of Nuclei             0     0.185
## 3            5  Allen:12925       DTLv_Lateral Group of Nuclei, Ventral Division             0     0.185
## 4            5   Allen:4417 DTLv_Lateral group of Nuclei, Left, Ventral division             0     0.185
## 5            5   Allen:9054                                 RN_Red Nucleus, Left             0     0.185
## 6            5   Allen:9598                     GiRt_gigantocellular group, Left             0     0.185
##   min_FWER equivalent_structures     FWERs
## 1     0.12            Allen:4518 0.12;0.13
## 2     0.12           Allen:12923 0.12;0.25
## 3     0.12           Allen:12925 0.12;0.25
## 4     0.12            Allen:4417 0.12;0.25
## 5     0.12            Allen:9054 0.12;0.25
## 6     0.12            Allen:9598 0.12;0.25

2.4 NEW: Test for gene expression enrichment using the 2x2 contingency table test

When genes are associated with four counts (A-D), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [6], the 2x2 contingency table test can be used. It can identify brain regions which have a high ratio of A/B compared to C/D, which in this example would correspond to a high ratio of non-synonymous substitutions / synonymous substitutions compared to non-synonymous variable / synonymous variable:

## create a toy example with four counts per gene
high_substi_genes = c('RFFL', 'NTS', 'LIPE', 'GALNT6', 'GSN', 'BTBD16', 'CERS2')
low_substi_genes = c('ENC1', 'EGR4', 'NPTX1', 'DOC2A', 'OASL', 'FRY', 'NAV3')
subs_non_syn = c(sample(5:15, length(high_substi_genes), replace=TRUE),
    sample(0:5, length(low_substi_genes), replace=TRUE))
subs_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)),
vari_non_syn = c(sample(0:5, length(high_substi_genes), replace=TRUE),
    sample(0:10, length(low_substi_genes), replace=TRUE))
vari_syn = sample(5:15, length(c(high_substi_genes, low_substi_genes)),
input_conti = data.frame(gene_ids=c(high_substi_genes, low_substi_genes),
    subs_non_syn, subs_syn, vari_non_syn, vari_syn)
##   gene_ids subs_non_syn subs_syn vari_non_syn vari_syn
## 1     RFFL            9        5            2       15
## 2      NTS           15        6            2        5
## 3     LIPE           13       14            4        6
## 4   GALNT6           11       10            5        8
## 5      GSN           12       13            4       15
## 6   BTBD16           11       10            1       10
## the corresponding contingency table for the first gene would be:
matrix(input_conti[1, 2:5], ncol=2, dimnames=list(c('non-synonymous',
    'synonymous'), c('substitution','variable')))
##                substitution variable
## non-synonymous 9            2       
## synonymous     5            15
res_conti = aba_enrich(input_conti, test='contingency',
    cutoff_quantiles=c(0.7,0.8,0.9), n_randset=100)

The output is analogous to that of the other tests:

##   age_category structure_id                structure n_significant mean_FWER min_FWER equivalent_structures
## 1            5   Allen:4671 MB_Mammillary Body, Left             1 0.6700000     0.01            Allen:4671
## 2            5   Allen:9512        MY_Myelencephalon             1 0.3833333     0.04            Allen:9512
## 3            5   Allen:4391         DiE_Diencephalon             1 0.4900000     0.04            Allen:4391
## 4            5   Allen:4392              TH_Thalamus             1 0.4900000     0.04            Allen:4392
## 5            5   Allen:4665   MamR_Mammillary Region             1 0.5000000     0.04            Allen:4665
## 6            5  Allen:12909       MB_Mammillary Body             1 0.6800000     0.04           Allen:12909
##         FWERs
## 1    0.01;1;1
## 2 0.04;0.11;1
## 3 0.43;0.04;1
## 4 0.43;0.04;1
## 5 0.04;0.46;1
## 6    0.04;1;1

Depending on the counts for each GO-category a Chi-square or Fisher’s exact test is performed. Note that this is the only test that is not dependent on the distribution of the gene-associated variables in the root nodes.

2.5 Test gene expression enrichment for genomic regions

Instead of defining candidate and background genes explicitly in the genes input dataframe, it is also possible to define entire chromosomal regions as candidate and background regions. The expression enrichment is then tested for all protein-coding genes located in, or overlapping the candidate regions on the plus or the minus strand. The gene coordinates used to identify those genes were obtained from (GRCh37.p13). The option ref_genome='grch38' uses gene coordinates from the GRCh38.p10 genome version obtained from

In comparison to defining candidate and background genes explicitly, this option has the advantage that the FWER accounts for spatial clustering of genes. For the random permutations used to compute the FWER, blocks as long as candidate regions are chosen from the merged candidate and background regions and genes contained in these blocks are considered candidate genes (Schematic 2).

To define chromosomal regions in the input dataframe, the first column has to be of the form chr:start-stop, where start always has to be smaller than stop. Note that this option requires the input of background regions. If multiple candidate regions are provided, in the randomsets they are placed randomly (but without overlap) into the merged candidate and background regions. The output of aba_enrich is identical to the one that is produced for single genes. The second element of the output list contains the candidate and background genes located in the user-defined regions:

## create input vector with a candidate region on chromosome 8
## and background regions on chromosome 7, 8 and 9
regions = c('8:82000000-83000000', '7:1300000-56800000',
    '7:74900000-148700000', '8:7400000-44300000', '8:47600000-146300000',
    '9:0-39200000', '9:69700000-140200000')
is_candidate = c(1, rep(0,6))
input_region = data.frame(regions, is_candidate)
## run enrichment analysis for the adult human brain
res_region = aba_enrich(input_region, dataset='adult',
    cutoff_quantiles=c(0.5,0.7,0.9), n_randsets=100)
## look at the results from the enrichment analysis
fwers_region = res_region[[1]]
##   age_category structure_id                           structure n_significant mean_FWER min_FWER
## 1            5  Allen:12926        MG_Medial Geniculate Complex             1 0.5166667     0.01
## 2            5   Allen:9150            LC_locus ceruleus, Right             1 0.1566667     0.02
## 3            5   Allen:4671            MB_Mammillary Body, Left             1 0.3833333     0.02
## 4            5   Allen:4734 He-III_III, Left Lateral Hemisphere             1 0.3366667     0.03
## 5            5   Allen:4738   He-VI_VI, Left Lateral Hemisphere             1 0.3433333     0.03
## 6            5  Allen:12909                  MB_Mammillary Body             1 0.4100000     0.03
##   equivalent_structures          FWERs
## 1           Allen:12926 0.78;0.76;0.01
## 2            Allen:9150  0.3;0.02;0.15
## 3            Allen:4671 0.78;0.35;0.02
## 4            Allen:4734 0.53;0.45;0.03
## 5            Allen:4738 0.54;0.46;0.03
## 6           Allen:12909 0.78;0.42;0.03
## see which genes are located in the candidate region
input_genes = res_region[[2]]
candidate_genes = input_genes[input_genes[,2]==1,]
##         gene score
## 278   CHMP4C     1
## 484    FABP4     1
## 485    FABP5     1
## 486    FABP9     1
## 487   FABP12     1
## 727    IMPA1     1
## 1053    PAG1     1
## 1117    PMP2     1
## 1347 SLC10A5     1
## 1393   SNX16     1
## 1691  ZFAND1     1

An alternative method to choose random blocks from the background regions can be used with the option circ_chrom=TRUE. Every candidate region is then compared to background regions on the same chromosome (Schematic 2). And in contrast to the default circ_chrom=FALSE, randomly chosen blocks do not have to be located inside a single background region, but are allowed to overlap multiple background regions. This means that a randomly chosen block can start at the end of the last background region and continue at the beginning of the first background region on a given chromosome.

2.6 Explore expression data

2.6.1 get_expression

The function get_expression enables the output of gene and brain region-specific expression data averaged across donors. By only setting the parameter structure_ids that defines the brain regions, the gene_ids and dataset are automatically set to the genes and dataset used in the last enrichment analysis. In comparison to defining genes and brain regions explicitly this saves some time since some pre-computations on the original dataset, e.g. aggregation of expression per gene, do not have to be redone. Using the default options (background=FALSE), get_expression returns expression data for the candidate genes. If background=TRUE, the gene expression data for both, candidate genes and background genes, are returned.

## get expression data for the first 5 brain regions
## from the last aba_enrich-analysis
top_regions = fwers_region[1:5, 'structure_id']
## [1] "Allen:12926" "Allen:9150"  "Allen:4671"  "Allen:4734"  "Allen:4738"
expr = get_expression(top_regions, background=FALSE)
##              CHMP4C   FABP12    FABP4    FABP5    FABP9    IMPA1     PAG1     PMP2  SLC10A5    SNX16   ZFAND1
## Allen:4444 2.586348 1.648789 2.129794 7.979775 1.358975 9.586679 8.861224 11.05810 1.990756 6.375037 8.401070
## Allen:4499 3.086266 1.302643 3.250694 9.242289 1.293894 8.830867 8.682235 11.50669 1.704897 6.400059 8.977492
## Allen:9150 2.801755 1.737777 2.188948 9.165938 1.568329 8.782606 7.299574 10.75592 1.969351 7.270022 8.160494
## Allen:4671 3.315379 1.322007 2.948199 8.331502 1.289105 9.367726 8.846593 10.50742 2.517208 6.726171 8.877031
## Allen:4675 2.784213 1.969504 3.043131 9.223920 1.588357 8.954029 7.125901 10.47701 2.137033 7.080715 8.722243
## Allen:4672 2.645451 1.744787 2.177720 7.968980 1.448837 9.034155 8.390096 10.48329 2.437719 6.958238 8.863811

The same output would be created independently of an aba_enrich analysis by, in addition to structure_ids, setting gene_ids and dataset manually. Like in all functions of the ABAEnrichment package gene_ids can be Entrez-ID, Ensembl-ID or gene-symbol.

## get expression data independent from previous aba_enrich analysis
regions = c('Allen:12926', 'Allen:4738', 'Allen:4671', 'Allen:12909')
gene_ids = c('CHMP4C', 'FABP12', 'FABP4', 'FABP5', 'FABP9', 'IMPA1',
    'PAG1', 'PMP2', 'SLC10A5', 'SNX16', 'ZFAND1') 
expr2 = get_expression(regions, gene_ids=gene_ids, dataset='adult',

For the 5_stages dataset the output of get_expression is a list with a data frame for each developmental stage, where the first element corresponds to the first developmental stage, the second element to the second developmental stage, and so on.

Note that the brain regions passed to get_expression do not have to match the brain regions returned in the output. This is due to the fact that not all brain regions were measured independently. In case a brain region was not measured directly, all available expression data from its substructures are returned. The function get_sampled_substructures can be used to identify substructures with expression data.

2.6.2 plot_expression

The function plot_expression enables the visualization of expression data. The usage of plot_expression is similar to that of get_expression. Providing only brain regions as input, it plots the expression data for the genes and dataset used in the last aba_enrich call.

## get expression data for the first 5 brain regions
## from the last aba_enrich-analysis
top_regions = fwers_region[1:5, 'structure_id']
plot_expression(top_regions, background=FALSE)

The optional argument dendro determines whether or not a dendrogram should be added to the heatplot. The colored side bar in the plot without dendrogram indicates candidate genes (red) and background genes (black). In this case only candidate gene expression was plotted (with the default option background=FALSE):

## plot the same expression data without dendrogram
plot_expression(top_regions, dendro=FALSE, background=FALSE)

When plotting expression data following an enrichment analysis with the Wilcoxon rank-sum test, the option dendro=FALSE results in a side bar that indicates the scores that were used for the enrichment analysis. For the binomial test the side bar shows A/(A+B) and for the 2x2 contingency table test ((A+1)/(B+1)) / ((C+1)/(D+1)) (+1 added to prevent division by 0, this is just a visual indication of the proportion of the ratios and not the real odds ratio from the 2x2 contingency table test).

Like get_expression, plot_expression can also be used independently of an enrichment analysis. In that case the arguments gene_ids and dataset have to be defined. If the 5_stages dataset is used, the additional argument age_category selects the developmental stage for which the expression data should be plotted:

## plot expression of some genes for the frontal neocortex (Allen:10161)
## in age category 3 (children, 3-11 yrs)
gene_ids = c('ENSG00000157764', 'ENSG00000163041', 'ENSG00000182158',
    'ENSG00000147889', 'ENSG00000103126', 'ENSG00000184634')
plot_expression('Allen:10161', gene_ids=gene_ids, dataset='5_stages',