# Introduction to R

Martin Morgan
February 2, 2015

## R

Language and environment for statistical computing and graphics

• Full-featured programming language
• Interactive and interpretted – convenient and forgiving
• Coherent, extensive documentation
• Statistical, e.g. `factor()`, `NA`
• Extensible – CRAN, Bioconductor, github, …

Vector, class, object

• Efficient vectorized calculations on 'atomic' vectors `logical`, `integer`, `numeric`, `complex`, `character`, `byte`
• Atomic vectors are building blocks for more complicated objects
• `matrix` – atomic vector with 'dim' attribute
• `data.frame` – list of equal length atomic vectors
• Formal classes represent complicated combinations of vectors, e.g., the return value of `lm()`, below

Function, generic, method

• Functions transform inputs to outputs, perhaps with side effects, e.g., `rnorm(1000)`
• Argument matching first by name, then by position
• Functions may define (some) arguments to have default values
• Generic functions dispatch to specific methods based on class of argument(s), e.g., `print()`.
• Methods are functions that implement specific generics, e.g., `print.factor`; methods are invoked indirectly, via the generic.

Introspection

• General properties, e.g., `class()`, `str()`
• Class-specific properties, e.g., `dim()`

Help

• `?print`: help on the generic print
• `?print.data.frame`: help on print method for objects of class data.frame.

Example

``````x <- rnorm(1000)                   # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)         # object of class 'data.frame'
plot(y ~ x, df)                    # generic plot, method plot.formula
``````

``````fit <- lm(y ~x, df)                # object of class 'lm'
methods(class=class(fit))          # introspection
``````
``````##  [1] add1.lm*           alias.lm*          anova.lm*
##  [4] case.names.lm*     confint.lm         cooks.distance.lm*
##  [7] deviance.lm*       dfbeta.lm*         dfbetas.lm*
## [10] drop1.lm*          dummy.coef.lm      effects.lm*
## [13] extractAIC.lm*     family.lm*         formula.lm*
## [16] hatvalues.lm*      influence.lm*      kappa.lm
## [19] labels.lm*         logLik.lm*         model.frame.lm*
## [22] model.matrix.lm    nobs.lm*           plot.lm*
## [25] predict.lm         print.lm*          proj.lm*
## [28] qr.lm*             residuals.lm       rstandard.lm*
## [31] rstudent.lm*       simulate.lm*       summary.lm
## [34] variable.names.lm* vcov.lm*
##
##    Non-visible functions are asterisked
``````

## Lab

### 1. R data manipulation

This exercise servers as a refresher / tutorial on basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

``````fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
``````

Check out the help page `?read.delim` for input options, and explore basic properties of the object you've created, for instance…

``````class(pdata)
``````
``````## [1] "data.frame"
``````
``````colnames(pdata)
``````
``````##  [1] "id"             "diagnosis"      "sex"            "age"
##  [5] "BT"             "remission"      "CR"             "date.cr"
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"
## [17] "ccr"            "relapse"        "transplant"     "f.u"
## [21] "date.last.seen"
``````
``````dim(pdata)
``````
``````## [1] 127  21
``````
``````head(pdata)
``````
``````##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
## 6    TRUE      FALSE               REL     12/15/1997
``````
``````summary(pdata\$sex)
``````
``````##    F    M NA's
##   42   83    2
``````
``````summary(pdata\$cyto.normal)
``````
``````##    Mode   FALSE    TRUE    NA's
## logical      69      24      34
``````

Remind yourselves about various ways to subset and access columns of a data.frame

``````pdata[1:5, 3:4]
``````
``````##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57
``````
``````pdata[1:5, ]
``````
``````##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
``````
``````head(pdata[, 3:5])
``````
``````##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1
``````
``````tail(pdata[, 3:5], 3)
``````
``````##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2
``````
``````head(pdata\$age)
``````
``````## [1] 53 19 52 38 57 17
``````
``````head(pdata\$sex)
``````
``````## [1] M M F M M M
## Levels: F M
``````
``````head(pdata[pdata\$age > 21,])
``````
``````##      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1  1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 3  3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4  4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5  4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 10 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE
## 11 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE
##    cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1        FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 3           NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4        FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5        FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 10       FALSE     del(p15)  BCR/ABL           p190 NEG    <NA> FALSE
## 11       FALSE del(p15/p16)  BCR/ABL      p190/p210 NEG dyploid FALSE
##    relapse transplant               f.u date.last.seen
## 1    FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3     TRUE      FALSE               REL     10/15/1999
## 4     TRUE      FALSE               REL      1/23/1998
## 5     TRUE      FALSE               REL      11/4/1997
## 10    TRUE      FALSE               REL      7/11/1997
## 11   FALSE       TRUE BMT / DEATH IN CR           <NA>
``````

It seems from below that there are 17 females over 40 in the data set, but when sub-setting `pdata` to contain just those individuals 19 rows are selected. Why? What can we do to correct this?

``````idx <- pdata\$sex == "F" & pdata\$age > 40
table(idx)
``````
``````## idx
## FALSE  TRUE
##   108    17
``````
``````dim(pdata[idx,])
``````
``````## [1] 19 21
``````

Use the `mol.biol` column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g.,

``````bcrabl <- pdata[pdata\$mol.biol %in% c("BCR/ABL", "NEG"),]
``````

The `mol.biol` column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

``````bcrabl\$mol.biol <- factor(bcrabl\$mol.biol)
``````

The `BT` column is a factor describing B- and T-cell subtypes

``````levels(bcrabl\$BT)
``````
``````##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"
``````

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

``````table(bcrabl\$BT)
``````
``````##
##  B B1 B2 B3 B4  T T1 T2 T3 T4
##  4  9 35 22  9  4  1 15  9  2
``````
``````levels(bcrabl\$BT) <- substring(levels(bcrabl\$BT), 1, 1)
table(bcrabl\$BT)
``````
``````##
##  B  T
## 79 31
``````

Use `xtabs()` (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

``````xtabs(~ BT + mol.biol, bcrabl)
``````
``````##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  31
``````

Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

``````aggregate(age ~ mol.biol + sex, bcrabl, mean)
``````
``````##   mol.biol sex      age
## 1  BCR/ABL   F 39.93750
## 2      NEG   F 30.42105
## 3  BCR/ABL   M 40.50000
## 4      NEG   M 27.21154
``````

Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

``````t.test(age ~ mol.biol, bcrabl)
``````
``````##
##  Welch Two Sample t-test
##
## data:  age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL     mean in group NEG
##              40.25000              28.07042
``````
``````boxplot(age ~ mol.biol, bcrabl)
``````

## Resources

Publications (General R)