`r knitr::opts_chunk$set(tidy=FALSE); options(width=130)` # How to make a data resource into a database package for Bioconductor ## Why on earth would you want to do this? Many kinds of data can be expressed as a simple table. So why would you not just use a data.frame to hold all your data? Data as a single table quickly becomes hard to represent efficiently if the following things happen: - Missing data - Many to one relationships - Many to many relationships Any of these things will cause your table to become very sparse and inneficient at representing the contents. So when that happens, it makes sense to break the data down into multiple tables. Relational databases like SQLite offer both an efficient means for storing data of this kind as well as a simple language for easily and quickly extracting it. SQL also allows you to recombine the results into single tables as needed. ## Querying data using RSQLite and DBI 1st we need a database file. Lets steal one from an installed package ```{r} dbFile <- system.file("extdata", "TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite", package="TxDb.Hsapiens.UCSC.hg19.knownGene") dbFile ``` Now let's get a connection ```{r} library(RSQLite) txcon <- dbConnect(SQLite(), dbname=dbFile) ``` There is a handy helper function that lists tables: ```{r} dbListTables(txcon) ``` Aand another one that lists fields for a table ```{r} dbListFields(txcon, name="metadata") ``` And now: let's actually look at some useful SQL queries.. First: let's look at a basic select statement ```{r} dbGetQuery(txcon, "SELECT * FROM metadata") ``` Now just get the 1st column ```{r} dbGetQuery(txcon, "SELECT name FROM metadata") ``` So using that, we can explore the chrominfo table like this ```{r} dbListFields(txcon, name="chrominfo") dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo LIMIT 4") ``` And we can restrict our results to only get those chromosomes that are longer than 100000000 using a "WHERE" clause ```{r} dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE length > 100000000") ``` Or we can restrict it to where the chromsome name is 'chrX' ```{r} dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE chrom='chrX'") ``` So far so good. But what if we want to get data from a couple tables at once? lets look at a couple of different tables. 1st the gene table ```{r} dbGetQuery(txcon, "SELECT * FROM gene LIMIT 4") ``` And now the transcript table ```{r} dbGetQuery(txcon, "SELECT * FROM transcript LIMIT 4") ``` Now for these two tables, we can do a simple inner join like this ```{r} sql <- "SELECT gene_id, tx_name FROM gene,transcript WHERE gene._tx_id=transcript._tx_id LIMIT 4" dbGetQuery(txcon, sql) ``` This same join can also be written in another fashion using the AS keyword to generate an alias. ```{r} sql <- "SELECT gene_id, tx_name, tx_strand FROM gene AS g, transcript AS t WHERE g._tx_id=t._tx_id LIMIT 4" dbGetQuery(txcon, sql) ``` At 1st it may seem like the alias is just a way to make queries shorter, but it is actually useful for subqueries. A subquery is when you need to make a query inside of another query. So for example you may want to combine a couple of queries like this: ```{r} sql <- "SELECT * FROM gene AS g, (SELECT * FROM transcript WHERE tx_strand='+') AS t WHERE g._tx_id=t._tx_id LIMIT 4" dbGetQuery(txcon, sql) ``` In the above example, the query "SELECT * FROM transcript WHERE tx_strand='+' is run and then the results of that are joined with the contents of the genes table by the outer query." If this all seems like it can get complicated in a hurry that's because it can. But fortunately, there exists another way for joining multiple tables that involves a lot less typing. When you have several tables that all contain the same foreign key, you can join them with the "USING" keyword like this: ```{r} sql <- "SELECT * FROM (SELECT * FROM gene, splicing USING(_tx_id)), transcript USING (_tx_id) LIMIT 4" dbGetQuery(txcon, sql) ``` Another query for joining these three tables would look like this: ```{r} sql <- "SELECT * FROM gene AS g, transcript AS t, splicing AS s WHERE g._tx_id=s._tx_id AND s._tx_id=t._tx_id LIMIT 4" dbGetQuery(txcon, sql) ``` Of course there are other many other keywords that you can learn about as well, but this should be enough to get you started. ### Exercises for Querying Databases from R. Exercise 1: Use what you have learned to extract out the names and chromosomes from the transcript table. Exercise 2: Now look at the splicing and exon tables. Notice how the splicing table has the same field "_exon_id" as the exon table. Use that field to join the contents of the two tables. Exercise 3: Now look at the transcript table again. Notice that it too has a key that can be used to join to the splicing table. Now use the ALIAS feature described above along with using () to group a subequery so that you can merge the exon, splicing AND transcript table into a single result.

[ Back to top ]

## Creating tables 1st lets look at some of the data we want to database. ```{r} gfl <- system.file("extdata","gene_names.txt", package="UnderstandingRBioc") gene_names <- read.table(gfl, header=TRUE) head(gene_names) cfl <- system.file("extdata","chroms.txt", package="UnderstandingRBioc") chroms <- read.table(cfl, header=TRUE) head(chroms) pfl <- system.file("extdata","pmids.txt", package="UnderstandingRBioc") pmids <- read.table(pfl, header=TRUE) head(pmids) ``` Notice that for each of these data.frames, there is a common column that we want to use as a foreign key... Next we need a new database. One that we can write into. The following code will spawn a DB in memory. ```{r} library(RSQLite) dbFile <- sprintf("%s.sqlite", tempfile()) dbFile con <- dbConnect(SQLite(), dbname=dbFile) ``` And once you create a table as below: your DB will exist on the disk as well. Create table statements allow us to create a table. ```{r} sql <- "CREATE TABLE gene_info ( gene_id TEXT, gene_symbol TEXT, gene_name TEXT )" dbGetQuery(con, sql) ``` You can easily use the helpers from before to verify that table is there ```{r} dbListTables(con) ``` And look at its fields as well ```{r} dbListFields(con, name="gene_info") ``` ### Exercise for Creating Tables. Exercise 4: Now that you know how to make a table, make one for the chrom and pmid data.frames above. Be sure to include all the fields that you want to populate.

[ Back to top ]

## Populating tables Lets start by paying close attention to the names of our data.frame. Notice how each column is named. ```{r} head(gene_names) ``` Now we are going to insert that entire data.frame into the table we created above. But we are going to carefully name the columns of the data.frame that we want to insert by using their names like this: ```{r} sql <- "INSERT INTO gene_info VALUES ($gene_id, $symbol, $name)" dbBeginTransaction(con) dbGetPreparedQuery(con, sql, bind.data = gene_names) dbCommit(con) ``` ### Exercises for Populating Tables. Exercise 5: Now take the next step and put the data from the pmid and chrom data.frames into your database. If you have not already done so, you should also make the gene_info table described above. Once you have done this. take a minute to make sure that the data you have inserted is present by writing some queries to extract it.

[ Back to top ]

Now we are going to talk about how we can formally make our database into an actual R object that people can use with select() etc. ## Creating an AnnotationDb object You always have to put the "Db type" and the "Supporting package" into a metadata table for your package. These fields are required by the loadDb supporting method so that it can spawn an AnnotationDb based object for you from your database. The 1st of these tells loadDb what kind of object it is supposed to make and the 2nd tells it where the methods for that object will be found. In this case we know what the "Db type" will be since it will be the new object type we want to define. But it is not yet clear what the "Supporting package" will be since we haven't made a package to hold our new methods yet. So just put AnnotationDbi in as a placeholder for now. ```{r} dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)") dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')") dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package', 'AnnotationDbi')") ``` Test for success: ```{r} dbGetQuery(con, "SELECT * FROM metadata") ``` Then declare the class and then use loadDb to make an instance of your object. ```{r} library(AnnotationDbi) setRefClass("myHamsterDb", contains="AnnotationDb") myHamster.db <- loadDb(dbFile) ``` ### Exercise for Metadata. Exercise 6: All databases that intend to eventually use the select method need to also have a metadata table with a name and value field. Run the code above to make sure you have one too. But then look at the original transcriptDb (txcon) database that we started with and examine the kind of data that is found in it's metadata table. What other kinds of data would be good to include here?

[ Back to top ]

## Writing accessors So once you have an object, you will want to make convenient accessors. There are four that most annotation packages support: 'collumns', 'keys', 'keytypes' and 'select'. The columns method normally just returns potential fields that can concievably be returned from the AnnotationDb object when using select. Here is how we can implement that method. First of all we should define a function that knows how to access the connection to our AnnotationDb object and then extract the fields from it. ```{r} .cols <- function(x) { con <- AnnotationDbi:::dbConn(x) list <- dbListTables(con) unwanted <- c("metadata") list <- list[!list %in% unwanted] fields <- as.vector(sapply(list, dbListFields, con=con)) toupper(fields) } ``` Then simply wrap this into a method ```{r} setMethod("columns", "myHamsterDb", function(x){.cols(x)}) ``` Then we can call it like this ```{r} columns(myHamster.db) ``` ### Exercise for Writing Accessors. Exercise 7: The keytypes method often returns the same results as the columns method. Assuming that this is the case this time create a keytypes method.

[ Back to top ]

## ANSWERS to exercises ### Exercise 1: ```{r} dbGetQuery(txcon, "SELECT tx_name,tx_chrom FROM transcript LIMIT 4") ``` ### Exercise 2: ```{r} dbGetQuery(txcon, "SELECT * FROM splicing, exon WHERE splicing._exon_id=exon._exon_id LIMIT 4") ``` ### Exercise 3: ```{r} dbGetQuery(txcon, "SELECT * FROM transcript AS t, (SELECT * FROM splicing,exon WHERE splicing._exon_id=exon._exon_id) AS e WHERE e._tx_id=t._tx_id LIMIT 4") ``` ### Exercise 4: ```{r} sql <- "CREATE TABLE chrom ( gene_id TEXT, chromosome TEXT )" dbGetQuery(con, sql) ``` ```{r} sql <- "CREATE TABLE pmid ( gene_id TEXT, pmid TEXT )" dbGetQuery(con, sql) ``` ### Exercise 5: ```{r} sql <- "INSERT INTO chrom VALUES ($gene_id, $chromosome)" dbBeginTransaction(con) dbGetPreparedQuery(con, sql, bind.data = chroms) dbCommit(con) ``` ```{r} sql <- "INSERT INTO pmid VALUES ($gene_id, $pmid)" dbBeginTransaction(con) dbGetPreparedQuery(con, sql, bind.data = pmids) dbCommit(con) ``` ```{r} dbGetQuery(con, "SELECT * FROM chrom LIMIT 4") dbGetQuery(con, "SELECT * FROM pmid LIMIT 4") ``` ### Exercise 6: This is the code that will create a metadata table for you: ```{r, eval=FALSE} dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)") dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')") dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package', 'AnnotationDbi')") ``` And looking at the other metadata table we see all sorts of interesting stuff: ```{r} dbGetQuery(txcon, "SELECT * FROM metadata") ``` ### Exercise 7: Since the keytypes method will be the same as columns, we can just use the same underlying function here. ```{r} setMethod("keytypes", "myHamsterDb", function(x){.cols(x)}) ``` Then we can call it like this ```{r} keytypes(myHamster.db) ```