`r knitr::opts_chunk$set(tidy=FALSE, eval=file.exists("~/xml/msigdb_v4.0.xml"))` # XML XML documents are a series of nested tags, possibly with attributes. An example is the [MSigDB xml][MSigDB] file, which contains curated gene sets stored following a [format specification][MSigDBFormat]. Download a copy to your AMI, and store it in a directory `~/xml/`. The first few lines of this file look like [1] [2] [3] [4] ... [10299] Line 1 tells us about the version of XML used in the document, and the character encoding. Line 3 opens the `MSIGDB` node. The node has several attributes, `NAME`, `VERSION`, `BUILD_DATE`, as described in the [format specification][MSigDBFormat]. Nested inside the `MSIGDB` node is the first of many `GENESET` nodes; the node terminates on the final line of the file, with ``. The `GENESET` node has several attributes (of which only one is shown) an empty body, and terminates with ``. # Interacting with XML: XPath Load the data base in to R ```{r xml-load} library(XML) xml <- xmlTreeParse("~/xml/msigdb_v4.0.xml", useInternalNodes=TRUE) ``` _Don't bother to print xml_, it'll scroll across your screen for quite a while. Elements of XML can be addressed using [XPath][]. The idea is to specify the path from the root of the document to the node(s) or attributes that you're interested in. The path is like a linux file path, starting with `/`. Attributes are specified with `@` before their name. We can subset the `xml` object using this language, e.g., ```{r xpath-attr} xml[["/MSIGDB/@NAME"]] ``` An alternative is to use the `xmlAttrs` function to extract the attributes of the node we're interested in ```{r xmlAttrs} xmlAttrs(xml[["/MSIGDB"]]) ``` There is only one `NAME` attribute of `MSIGDB`, but there are many `GENESET` child nodes. Here we create a _node set_ of all of these ```{r nodeset} sets <- xml["/MSIGDB/GENESET"] class(sets) length(sets) ``` XPath provides a convenient syntax for querying nested paths: `//GENESET` says to start at the root and find all paths that have `GENESET` at any level. We could manipulate `sets` at the R level, e.g., selecting the second element and viewing the first four attributes ```{r xml-R} head(xmlAttrs(sets[[2]]), 4) ``` but it's more fun to formulate this query using XPath to select all attributes of the second gene set ```{r xml-xpath} head(xml["//GENESET[2]/@*"], 4) ``` Notice that this gene set has a `STANDARD_NAME` attribute. We can use this to select the gene set ```{r xml-attr-select} yy <- xml[["//GENESET[@STANDARD_NAME = 'EXTRINSIC_TO_PLASMA_MEMBRANE']"]] xmlAttrs(yy)[1:4] ``` There are many gene sets in our document; we might like to visit them all and extract a particular element, e.g., the `ORGANISM` attribute. We can do this by iterating over the node set in R ```{r xml-nodeset-iter} organism <- sapply(sets, function(elt) xmlAttrs(elt)["ORGANISM"]) ``` but again a fun way to do this is to use an `sapply`-like formulation on the XML document itself ```{r xml-nodeset-xpath} organism <- xpathSApply(xml, "//GENESET/@ORGANISM") table(organism) ``` The XPath specification includes functions that are useful for, e.g., string matching. A simple example is to count the number of gene sets in our document ```{r xml-count} xml[["count(//GENESET)"]] xml[["count(//GENESET[@ORGANISM='Homo sapiens'])"]] ``` Section [2.5 Abbreviated Syntax][XPathAbbrev] of the XPath specification is a very handy introduction to the flexibility of XPath queries. **Exercise** Use an XPath query to select the 5 gene sets that have `ORGANISM` equal to 'Danio rerio'. Use a single XPath query to determine the `STANDARD_NAME` of these gene sets. # (Advanced) XML event parsing - Scenario: very large XML file - Solution: iterative processing - Implementation: [XML][] package `xmlEventParse()` `xmlEventParse()` - Provide a 'call-back' function to process data each time a node of a particular type is encountered - Implement the call back as a _closure_, e.g., by using a 'factory' function that returns a function, that retains state across calls. **Example**: from [StackOverflow][SOEvents] **Advanced exercise**: implement event parsing to retrieve the `STANDARD_NAME` and `DESCRIPTION_BRIEF` attributes from all `GENESET` nodes. [XML]: http://cran.r-project.org/web/packages/XML [MSigDBFormat]: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_XML_description [MSigDB]: http://www.broadinstitute.org/gsea/downloads.jsp [XPath]: http://www.w3.org/TR/xpath/ [XPathAbbrev]: http://www.w3.org/TR/xpath/#path-abbrev [SOEvents]: http://stackoverflow.com/questions/16676798#16681768