---
title: "A.1 -- Using R"
author: "Martin Morgan <Martin.Morgan@RoswellPark.org>"
date: "11 - 12 September 2017"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  % \VignetteIndexEntry{A.1 -- Using R}
  % \VignetteEngine{knitr::rmarkdown}
---

```{r style, echo = FALSE, results = 'asis'}
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```

# _RStudio_: A Quick Tour

Panes

Options

Help

Environment, History, and Files

# _R_: First Impressions

Type values and mathematical formulas into _R_'s command prompt

```{r plus}
1 + 1
```

Assign values to symbols (variables)

```{r values-symbols}
x = 1
x + x
```

Invoke functions such as `c()`, which takes any number of values and
returns a single _vector_

```{r vector}
x = c(1, 2, 3)
x
```

_R_ functions, such as `sqrt()`, often operate efficiently on vectors

```{r vectorized}
y = sqrt(x)
y
```

There are often several ways to accomplish a task in _R_

```{r skinning-the-cat}
x = c(1, 2, 3)
x
x <- c(4, 5, 6)
x
x <- 7:9
x
10:12 -> x
x
```

Sometimes _R_ does 'surprising' things that can be fun to figure out

```{r surprise}
x <- c(1, 2, 3) -> y
x
y
```

## _R_ Data types: vector and list

'Atomic' vectors

- Types include integer, numeric (float-point; real), complex,
  logical, character, raw (bytes)

    ```{r atomic-vectors}
    people <- c("Lori", "Nitesh", "Valerie", "Herve")
    people
    ```

- Atomic vectors can be named

    ```{r named-vectors}
    population <- c(Buffalo=259000, Rochester=210000, `New York`=8400000)
    population
    log10(population)
    ```

- Statistical concepts like `NA` ("not available")

    ```{r NA-concept}
    truthiness <- c(TRUE, FALSE, NA)
    truthiness
    ```
- Logical concepts like 'and' (`&`), 'or' (`|`), and 'not' (`!`)

    ```{r logical-concept}
    !truthiness
    truthiness | !truthiness
    truthiness & !truthiness
    ```

- Numerical concepts like infinity (`Inf`) or not-a-number (`NaN`,
  e.g., 0 / 0)

    ```{r numerical-concept}
    undefined_numeric_values <- c(NA, 0/0, NaN, Inf, -Inf)
    undefined_numeric_values
    sqrt(undefined_numeric_values)
    ```

- Common string manipulations

    ```{r string-manipulation}
    toupper(people)
    substr(people, 1, 3)
    ```

- _R_ is a green consumer -- recycling short vectors to align with
  long vectors

    ```{r greenery}
    x <- 1:3
    x * 2            # '2' (vector of length 1) recycled to c(2, 2, 2)
    truthiness | NA
    truthiness & NA
    ```
- It's very common to nest operations, which can be simultaneously
  compact, confusing, and expressive (`[`: subset; `<`: less than)

    ```{r nested-operations}
    substr(tolower(people), 1, 3)
    population[population < 1000000]
    ```

Lists

- The list type can contain other vectors, including other lists

    ```{r lists}
    frenemies = list(
        friends=c("Larry", "Richard", "Vivian"),
        enemies=c("Dick", "Mike")
    )
    frenemies
    ```

- `[` subsets one list to create another list, `[[` extracts a list element

    ```{r list-subset}
    frenemies[1]
    frenemies[c("enemies", "friends")]
    frenemies[["enemies"]]
    ```

Factors

- Character-like vectors, but with values restricted to specific levels

    ```{r factors}
    sex = factor(c("Male", "Male", "Female"),
                 levels=c("Female", "Male", "Hermaphrodite"))
    sex
    sex == "Female"
    table(sex)
    sex[sex == "Female"]
    ```

## Classes: data.frame and beyond

Variables are often related to one another in a highly structured way,
e.g., two 'columns' of data in a spreadsheet

```{r related-variables}
x = rnorm(1000)       # 1000 random normal deviates
y = x + rnorm(1000)   # another 1000 deviates, as a function of x
plot(y ~ x)           # relationship between x and y
```

Convenient to manipulate them together

- `data.frame()`: like columns in a spreadsheet

    ```{r data.frame}
    df = data.frame(X=x, Y=y)
    head(df)           # first 6 rows
    plot(Y ~ X, df)    # same as above
    ```

- See all data with `View(df)`. Summarize data with `summary(df)`

    ```{r data.frame-summary}
    summary(df)
    ```

- Easy to manipulate data in a coordinated way, e.g., access column
  `X` with `$` and subset for just those values greater than 0

    ```{r data.frame-subset}
    positiveX = df[df$X > 0,]
    head(positiveX)
    plot(Y ~ X, positiveX)
    ```

- _R_ is introspective -- ask it about itself

    ```{r introspection}
    class(df)
    dim(df)
    colnames(df)
    ```

- `matrix()` a related class, where all elements have the same type (a
  `data.frame()` requires elements within a column to be the same
  type, but elements between columns can be different types).

A scatterplot makes one want to fit a linear model (do a regression
analysis)

- Use a _formula_ to describe the relationship between variables
- Variables found in the second argument

    ```{r lm-formula}
    fit <- lm(Y ~ X, df)
    ```

- Visualize the points, and add the regression line

    ```{r lm-plot}
    plot(Y ~ X, df)
    abline(fit, col="red", lwd=3)
    ```

- Summarize the fit as an ANOVA table

    ```{r anova}
    anova(fit)
    ```

- N.B. -- 'Type I' sums-of-squares, so order of independent variables
  matters; use `drop1()` for 'Type III'. See [DataCamp Quick-R][]

- Introspection -- what class is `fit`? What _methods_ can I apply to
  an object of that class?

    ```{r class-method-introspection}
    class(fit)
    methods(class=class(fit))
    ```

## Help!

Help available in _Rstudio_ or interactively

- Check out the help page for `rnorm()`

    ```{r, eval=FALSE}
    ?rnorm
    ```

- 'Usage' section describes how the function can be used

    ```
    rnorm(n, mean = 0, sd = 1)
    ```
- Arguments, some with default values. Arguments matched first by
  name, then position

- 'Arguments' section describes what the arguments are supposed to be

- 'Value' section describes return value

- 'Examples' section illustrates use

- Often include citations to relevant technical documentation,
  reference to related functions, obscure details

- Can be intimidating, but in the end actually _very_ useful

[DataCamp Quick-R]: http://www.statmethods.net/stats/anova.html