The basic characteristics of the GO (The Gene Ontology Consortium, 2000) data are described in The Gene Ontology Consortium (2000) and the interested reader is referred there for more details. In this paper we assume that readers are familiar with the basic DAG structure of GO and with the mappings of genes to GO terms that are provide by GOA (Camon et al., 2004). We examine some of the different visualization problems and data analytic problems that can be addressed using these data. To make some of the concepts concrete and to provide extensive, but related examples we will make use of a microarray data set (Chiaretti et al., 2004).
We begin with brief descriptions and reminders of some of the important concepts, as well as of our example data. Once that has been done we will then consider several different bioinformatic tasks that are of interest and show how these tasks can be carried out using software from the Bioconductor Project.
The results reported in this document are based on hgu95av2.db 3.13.0 version of the hgu95av2.db data package and on GO.db 3.19.1 of the GO.db package. If you have different versions your results will probably be different.
The relationships between different GO terms, within a specific ontology are represented in the form of a directed acyclic graph. The leaves of this graph represent the most specific terms and their are edges from a specific term (child) to all less specific terms (each is a parent). The induced GO graph is the graph that obtains from taking a set of GO terms and finding all parents of those terms, and so on until the root node has been obtained. Given any set of genes one may use the mappings provided by GOA (Camon et al., 2004) to find the most specific set of GO terms associated with those genes and hence, from those, to the induced GO graph. Thus we can speak of the induced GO graph either from the perspective of selected genes or from the perspective of selected terms.
We note that there is a form of algebra possible on these induced GO graphs. If \(G_1\) and \(G_2\) are two different GO graphs from the same ontology then we can form their union, \(G_1 \cup G_2\) which is the union of the nodes in the two graphs together with the union of the edge sets. We note that this is a bit different than the usual mathematical definition of the union of two graphs. Other set operations such as intersection and complement are also easy to define and implement and they will play a role in our subsequent discussions.
One should be aware that our state of biological knowledge is constantly changing and that the terms, their relationships and the genes mapped to them are constantly changing and being updated. The meta-data used to prepare this report will likely be out of date in less than one year. Hence, one must try to ensure that the mappings being used are up to date (Bioconductor meta-data packages are built quarterly).
Another practical issue that needs some attention is consideration of the incompleteness of the data. Not all genes (or all transcripts) are assayed and not all genes are well enough understood to accurately annotate them at the existing GO nodes; to say nothing of the issues regarding the evolving nature of GO discussed above. For each problem the data analyst will need to consider what, if any, biases these issues might raise.
In all comparisons and gene list developments reported in the remainder of this paper we restrict attention to those genes which satisfy two conditions first they were included on the chip being used and second they have a GO annotation for the ontology of interest.
To demonstrate some of the tools that are included in the GOstats package we consider expression data from 79 samples from patients with acute lymphoblastic leukemia (ALL) that were investigated using HGU95AV2 Affymetrix GeneChip arrays (Chiaretti et al., 2004). The data were normalized using quantile normalization and expression estimates were computed using RMA (Irizarry et al., 2003). We will consider the subset of these patients that have B-cell ALL and within those we will further restrict our interest to comparing three groups. Those with BCR/ABL (this is a translocation between chromosomes 9 and 22), those with ALL1/AF4 (this is a translocation between chromosomes 4 and 11) and those with no detected cytogenetic abnormalities (labeled NEG). There are 37 patients with BCR/ABL, 10 with ALL1/AF4 and 42 from the NEG group.
We will examine genes that are differentially expressed between these different phenotypes and then make use of GO and other meta-data resources to help understand the genes that were selected. The actual method of determining differentially expressed genes is in some sense irrelevant to the subsequent analysis using GO and readers can easily substitute their own favorite methods. We refer readers to Heydebreck et al. (2004) for a general discussion of some of the issues involved in gene filtering.
Our approach is fairly simplistic. First, we require that genes be expressed in a reasonable set of the samples. Since our smallest group, those with ALL1/AF4, has only 10 samples we will require that a gene be expressed in at least 7 samples to be deemed interesting. For our purposes we consider a gene to be expressed if its value is greater than 100. We also introduce a second selection criteria that is based on differences between the group means. We require that the difference between the smallest group mean and the largest be larger than 100 units. Applying both of these transformations yielded 771 probes that were differentially expressed.
Next restricting attention to just these 771 probes we use the mt.maxT function from the multtest package to compute adjusted \(p\)-values for F-test comparisons. We found 151 of these to have adjusted \(p\)-values less than 0.05. We will deem these to be the differentially expressed genes and will now turn our attention to using GO to gain a better understanding of the associated genes. Before going further though, we note that these 151 probes map to 131 distinct Entrez Gene identifiers and since our GO mappings are based on Entrez Gene we further reduce our attention to these. We kept the first (in index order) of each of the duplicated probes for any analyses or computations that require gene expression data and note that others may want to use different conventions.
Using the genes selected in Section 0.3 we can construct the induced GO graph for any one of the three different Ontologies. In Figures 1, 2 and 3 we present these three different graphs. Of course one would like to add some of the experimental data and information to these plots.
The nodes in these different graphs can be colored according to different criteria. In Storch et al. (2002) they considered using color for the nodes. The studied differences in gene expression between heart and liver in mice. Their Figure 2 presents data that were assembled in roughly the following manner. Genes of interest were mapped to terms in the Biological Process arm of GO The Gene Ontology Consortium (2000). They then colored the nodes in this tree red if the associated genes were expressed in heart, green if they were expressed in liver and yellow if they were expressed in both. The figure itself is quite helpful and clearly displays the data informatively.
This is an interesting idea and we amplify it a bit here. In our example we have three groups and we might want to consider coloring the nodes differently. For each node in the induced graph there is one or more genes from our list of interesting genes annotated at that node (due to the way we constructed the graph). We could color the nodes according to the phenotype (ALL1/AF4, BCR/ABL, or NEG) that had the highest mean (if there is more than one gene then we might use the multiplicity of highest means). Or, possibly of more interest would be to categorize the genes (again we could use which phenotype had the highest mean) and render the nodes using pie charts.