<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{E1: Using SQLite Databases with R}
-->
`r knitr::opts_chunk$set(tidy=FALSE); options(width=130)`

# How to make a data resource into a database package for Bioconductor 

##  Why on earth would you want to do this? 

Many kinds of data can be expressed as a simple table.  So why would
you not just use a data.frame to hold all your data?

Data as a single table quickly becomes hard to represent efficiently
if the following things happen:

- Missing data
- Many to one relationships
- Many to many relationships

Any of these things will cause your table to become very sparse and
inneficient at representing the contents. So when that happens, it
makes sense to break the data down into multiple tables.  Relational
databases like SQLite offer both an efficient means for storing data
of this kind as well as a simple language for easily and quickly
extracting it.  SQL also allows you to recombine the results into
single tables as needed.

##  Querying data using RSQLite and DBI 

1st we need a database file. Lets steal one from an installed package
```{r}
dbFile <- system.file("extdata",
	"TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite",
	package="TxDb.Hsapiens.UCSC.hg19.knownGene")
dbFile
```

Now let's get a connection
```{r}
library(RSQLite)
txcon <- dbConnect(SQLite(), dbname=dbFile)
```

There is a handy helper function that lists tables:
```{r}
dbListTables(txcon)
```

Aand another one that lists fields for a table
```{r}
dbListFields(txcon, name="metadata")
```

And now: let's actually look at some useful SQL queries..

First: let's look at a basic select statement
```{r}
dbGetQuery(txcon, "SELECT * FROM metadata")
```

Now just get the 1st column
```{r}
dbGetQuery(txcon, "SELECT name FROM metadata")
```

So using that, we can explore the chrominfo table like this
```{r}
dbListFields(txcon, name="chrominfo")
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo LIMIT 4")
```

And we can restrict our results to only get those chromosomes that are
longer than 100000000 using a "WHERE" clause
```{r}
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE length > 100000000")
```

Or we can restrict it to where the chromsome name is 'chrX'
```{r}
dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE chrom='chrX'")
```

So far so good.  But what if we want to get data from a couple tables
at once?

lets look at a couple of different tables.  1st the gene table
```{r}
dbGetQuery(txcon, "SELECT * FROM gene LIMIT 4")
```

And now the transcript table
```{r}
dbGetQuery(txcon, "SELECT * FROM transcript LIMIT 4")
```

Now for these two tables, we can do a simple inner join like this
```{r}
sql <- "SELECT gene_id, tx_name 
        FROM gene,transcript 
       	WHERE gene._tx_id=transcript._tx_id 
        LIMIT 4"
dbGetQuery(txcon, sql)
```

This same join can also be written in another fashion using the AS
keyword to generate an alias.
```{r}
sql <- "SELECT gene_id, tx_name, tx_strand
        FROM gene AS g, transcript AS t 
       	WHERE g._tx_id=t._tx_id 
        LIMIT 4"
dbGetQuery(txcon, sql)
```

At 1st it may seem like the alias is just a way to make queries
shorter, but it is actually useful for subqueries.  A subquery is when
you need to make a query inside of another query.  So for example you
may want to combine a couple of queries like this:
```{r}
sql <- "SELECT * FROM gene AS g,
       (SELECT * FROM transcript WHERE tx_strand='+') AS t
       WHERE  g._tx_id=t._tx_id
       LIMIT 4"
dbGetQuery(txcon, sql)
```

In the above example, the query "SELECT * FROM transcript WHERE
tx_strand='+' is run and then the results of that are joined with the
contents of the genes table by the outer query."

If this all seems like it can get complicated in a hurry that's
because it can.  But fortunately, there exists another way for joining
multiple tables that involves a lot less typing.  When you have
several tables that all contain the same foreign key, you can join
them with the "USING" keyword like this:

```{r}
sql <- "SELECT * FROM (SELECT * FROM gene, splicing USING(_tx_id)), transcript  USING (_tx_id) LIMIT 4"
dbGetQuery(txcon, sql)
```

Another query for joining these three tables would look like this:

```{r}
sql <- "SELECT * FROM gene AS g, transcript AS t, splicing AS s WHERE g._tx_id=s._tx_id AND s._tx_id=t._tx_id LIMIT 4"
dbGetQuery(txcon, sql)
```

Of course there are other many other keywords that you can learn about
as well, but this should be enough to get you started.


### Exercises for Querying Databases from R.

Exercise 1: Use what you have learned to extract out the names and
chromosomes from the transcript table.

Exercise 2: Now look at the splicing and exon tables.  Notice how the
splicing table has the same field "_exon_id" as the exon table.  Use
that field to join the contents of the two tables.

Exercise 3: Now look at the transcript table again.  Notice that it
too has a key that can be used to join to the splicing table.  Now use
the ALIAS feature described above along with using () to group a
subequery so that you can merge the exon, splicing AND transcript
table into a single result.

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>

##  Creating tables 

1st lets look at some of the data we want to database.

```{r}
gfl <- system.file("extdata","gene_names.txt", package="UnderstandingRBioc")
gene_names <- read.table(gfl, header=TRUE)
head(gene_names)

cfl <- system.file("extdata","chroms.txt", package="UnderstandingRBioc")
chroms <- read.table(cfl, header=TRUE)
head(chroms)

pfl <- system.file("extdata","pmids.txt", package="UnderstandingRBioc")
pmids <- read.table(pfl, header=TRUE)
head(pmids)
```

Notice that for each of these data.frames, there is a common column
that we want to use as a foreign key...

Next we need a new database.  One that we can write into.  The
following code will spawn a DB in memory.

```{r}
library(RSQLite)
dbFile <- sprintf("%s.sqlite", tempfile())
dbFile
con <- dbConnect(SQLite(), dbname=dbFile)
```

And once you create a table as below: your DB will exist on the disk as well.

Create table statements allow us to create a table.
```{r}
sql <- "CREATE TABLE gene_info (
            gene_id TEXT,
            gene_symbol TEXT,
            gene_name TEXT
        )"
dbGetQuery(con, sql)
```

You can easily use the helpers from before to verify that table is there
```{r}
dbListTables(con)
```

And look at its fields as well
```{r}
dbListFields(con, name="gene_info")
```

### Exercise for Creating Tables.

Exercise 4: Now that you know how to make a table, make one for the
chrom and pmid data.frames above.  Be sure to include all the fields
that you want to populate.

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>

##  Populating tables 

Lets start by paying close attention to the names of our data.frame.
Notice how each column is named.
```{r}
head(gene_names)
```

Now we are going to insert that entire data.frame into the table we
created above.  But we are going to carefully name the columns of the
data.frame that we want to insert by using their names like this:

```{r}
sql <- "INSERT INTO gene_info
        VALUES ($gene_id, $symbol, $name)"
dbBeginTransaction(con)
dbGetPreparedQuery(con, sql, bind.data = gene_names)
dbCommit(con)
```

### Exercises for Populating Tables.

Exercise 5: Now take the next step and put the data from the pmid and
chrom data.frames into your database.  If you have not already done
so, you should also make the gene_info table described above.  Once
you have done this. take a minute to make sure that the data you have
inserted is present by writing some queries to extract it.

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>

Now we are going to talk about how we can formally make our database
into an actual R object that people can use with select() etc.

##  Creating an AnnotationDb object 

You always have to put the "Db type" and the "Supporting package" into
a metadata table for your package.  These fields are required by the
loadDb supporting method so that it can spawn an AnnotationDb based
object for you from your database.  The 1st of these tells loadDb what
kind of object it is supposed to make and the 2nd tells it where the
methods for that object will be found.  In this case we know what the
"Db type" will be since it will be the new object type we want to
define.  But it is not yet clear what the "Supporting package" will be
since we haven't made a package to hold our new methods yet.  So just
put AnnotationDbi in as a placeholder for now.

```{r}
dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
		'AnnotationDbi')")
```

Test for success:
```{r}
dbGetQuery(con, "SELECT * FROM metadata")
```

Then declare the class and then use loadDb to make an instance of your
object.

```{r}
library(AnnotationDbi)
setRefClass("myHamsterDb", contains="AnnotationDb")
myHamster.db <- loadDb(dbFile)
```

### Exercise for Metadata.

Exercise 6: All databases that intend to eventually use the select
method need to also have a metadata table with a name and value field.
Run the code above to make sure you have one too.  But then look at
the original transcriptDb (txcon) database that we started with and examine
the kind of data that is found in it's metadata table.  What other
kinds of data would be good to include here?

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>

##  Writing accessors 

So once you have an object, you will want to make convenient
accessors.  There are four that most annotation packages support:
'collumns', 'keys', 'keytypes' and 'select'.  The columns method
normally just returns potential fields that can concievably be
returned from the AnnotationDb object when using select.  Here is how
we can implement that method. First of all we should define a function
that knows how to access the connection to our AnnotationDb object and
then extract the fields from it.

```{r}
.cols <- function(x)
{
    con <- AnnotationDbi:::dbConn(x)
    list <- dbListTables(con)
    unwanted <- c("metadata")
    list <- list[!list %in% unwanted]
    fields <- as.vector(sapply(list, dbListFields, con=con))
    toupper(fields)
}
```

Then simply wrap this into a method
```{r}
setMethod("columns", "myHamsterDb", function(x){.cols(x)})
```

Then we can call it like this
```{r}
columns(myHamster.db)
```
### Exercise for Writing Accessors.

Exercise 7: The keytypes method often returns the same results as the
columns method.  Assuming that this is the case this time create a
keytypes method.

<p class="back_to_top">[ <a href="#top">Back to top</a> ]</p>


## ANSWERS to exercises
### Exercise 1:

```{r}
dbGetQuery(txcon, "SELECT tx_name,tx_chrom FROM transcript LIMIT 4")
```

### Exercise 2:

```{r}
dbGetQuery(txcon, 
  "SELECT * FROM splicing, exon WHERE splicing._exon_id=exon._exon_id LIMIT 4")
```

### Exercise 3:

```{r}
dbGetQuery(txcon,
   "SELECT * FROM transcript AS t,
   (SELECT * FROM splicing,exon WHERE splicing._exon_id=exon._exon_id) AS e
   WHERE e._tx_id=t._tx_id LIMIT 4")
```


### Exercise 4:

```{r}
sql <- "CREATE TABLE chrom (
            gene_id TEXT,
            chromosome TEXT
        )"
dbGetQuery(con, sql)
```

```{r}
sql <- "CREATE TABLE pmid (
            gene_id TEXT,
            pmid TEXT
        )"
dbGetQuery(con, sql)
```


### Exercise 5:

```{r}
sql <- "INSERT INTO chrom
        VALUES ($gene_id, $chromosome)"
dbBeginTransaction(con)
dbGetPreparedQuery(con, sql, bind.data = chroms)
dbCommit(con)
```

```{r}
sql <- "INSERT INTO pmid
        VALUES ($gene_id, $pmid)"
dbBeginTransaction(con)
dbGetPreparedQuery(con, sql, bind.data = pmids)
dbCommit(con)
```

```{r}
dbGetQuery(con, "SELECT * FROM chrom LIMIT 4")
dbGetQuery(con, "SELECT * FROM pmid LIMIT 4")
```


### Exercise 6:

This is the code that will create a metadata table for you:

```{r, eval=FALSE}
dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
		'AnnotationDbi')")
```


And looking at the other metadata table we see all sorts of interesting stuff:

```{r}
dbGetQuery(txcon, "SELECT * FROM metadata")
```


### Exercise 7:

Since the keytypes method will be the same as columns, we can just use
the same underlying function here.

```{r}
setMethod("keytypes", "myHamsterDb", function(x){.cols(x)})
```

Then we can call it like this
```{r}
keytypes(myHamster.db)
```