| Authors: | Seth Falcon and Robert Gentleman |
|---|---|
| Date: | 2007-01-06 |
| IMP: | inferred from mutant phenotype |
|---|---|
| IGI: | inferred from genetic interaction |
| IPI: | inferred from physical interaction |
| ISS: | inferred from sequence similarity |
| IDA: | inferred from direct assay |
| IEP: | inferred from expression pattern |
| IEA: | inferred from electronic annotation |
| TAS: | traceable author statement |
| NAS: | non-traceable author statement |
| ND: | no biological data available |
| IC: | inferred by curator |
Are there any GO terms that have a larger than expected subset of our selected genes in their annotation list?
If so, these GO terms will give us insight into the functional characterisitcs of the respective subset of the gene list.
But what does larger than expected mean?
From Wikipedia, the free encyclopedia:
... the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.
TODO: insert detailed proof that a hypergeometric test is equivalent to a one-tailed Fisher's Exact test.
Testing a GO term amounts to drawing the genes annotated at it from the urn and tallying white and black.
| . | Selected (white) | Not (black) |
|---|---|---|
| in GO term | n11 | n12 |
| not in GO term | n21 | n22 |
Testing a GO term amounts to drawing the genes annotated at it from the urn and filling out the table.
NB: You can apply other two-way table tests besides Fisher's Exact test. For large categories, that may make sense.
What genes were candidates for selection?
The choice makes a big impact on Hypergeometric test results.
Possibilities:
P <- function(size) {
nFound <- 10
nDrawn <- 400
nAtCat <- 40
nNotAtCat <- size - nAtCat
phyper(nFound-1, nAtCat, nNotAtCat,
nDrawn, lower.tail=FALSE)
}
P(1000) ---> 0.986
P(5000) ---> 0.000914
First, a short diversion:
Object Oriented Programming in R: the S4 Object System.
Inputs:
HyperGParams GOHyperGParams KEGGHyperGParams PFAMHyperGParams
Outputs:
HyperGResult GOHyperGResult
If p is a GOHyperGParams instance:
geneIds(p) testDirection(p) universeGeneIds(p) conditional(p) annotation(p) pvalueCutoff(p) ontology(p)
There are also replacement forms for setting:
conditional(p) <- TRUE pvalueCutoff(p) <- 0.0000001
If r is a GOHyperGResult instances:
pvalues universeCounts summary
oddsRatios geneCounts htmlReport
expectedCounts goDag
geneIdUniverse
selectedGenes
Most of the accessors for HyperGParams work here too, so you can answer: Was it conditionl? Over or under representation? Etc.
hyperGTest(p)
p can be a:
- GOHyperGParams
- KEGGHyperGParams
- PFAMHyperGParams
Parameter class design makes it easier to run many tests and allows using a single instance as a template for tweaking.
The Hypergeometric test assumes independence of categories.
GO terms are not independent of each other.
Test results often include directly related terms with significant gene overlap.
Is there really evidence for both terms?
More general statements require evidence beyond that which is required to prove more specific statements.
This is an essential component of the scientific method.
We only want to call a GO term significant if there is evidence beyond that provided by its significant children.
Condition out child terms that have tested as significant when testing a given term.
Assess whether there is additional evidence for the parent term.
How:
- Walk leaves of the GO DAG, compute Hypergeometric as usual.
- When computing the next level, remove genes from significant children.