Contents

1 Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails would not be evidence of a ‘fair’ coin, even though the observed number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

  1. Correct
  2. Understandable
  3. Robust
  4. Fast

2 First approach

2.1 Implementation

Suppose we represent heads as ‘0’ and tails as ‘1’. Let p be the probability that a coin that is currently a head is tossed and remains a head, and similarly for tails. If p == 0.5, then the tosses would seem to be uncorrelated. We implement this idea as a simple function

f1 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- NULL
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome <- c(outcome, current)
    }
    outcome
}

2.2 Assessment

Is it correct?

f1(10, .5)
##  [1] 1 1 0 0 0 0 0 0 1 0
table(f1(1000, .5))
## 
##   0   1 
## 496 504
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
## [1] 10.75269

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

system.time(f1(5000, .5))
##    user  system elapsed 
##   0.141   0.019   0.161
system.time(f1(10000, .5))
##    user  system elapsed 
##   0.455   0.047   0.503
system.time(f1(20000, .5))
##    user  system elapsed 
##   1.796   0.261   2.062

Doubling the problem size (n) causes a 4-fold increase in execution time.

What’s the problem? c(outcome, current) causes outcome to be copied each time through the loop!

3 Second approach

3.1 Implementation

Pre-allocate and fill the outcome vector; outome updated in place, without copying.

f2 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

3.2 Assessment

Is it correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
## [1] TRUE

Is it understandable? As understandable as f1().

Is it robust? In a minute…

Is it fast?

system.time(f2(5000, .5))
##    user  system elapsed 
##   0.029   0.000   0.029
system.time(f2(10000, .5))
##    user  system elapsed 
##   0.055   0.001   0.057
system.time(f2(20000, .5))
##    user  system elapsed 
##   0.106   0.003   0.111

f2() seems to scale linearly, so performs increasingly well compared to f1().

4 Third approach

4.1 Implementation

Note that runif() is called n times, but the program could be modified, and still be understandable, if it were called just once – hoist runif(1) > p out of the loop.

f3 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in 1:n) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

4.2 Assessment

  1. Correct?
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
## [1] TRUE
  1. Understandable? Yes.

  2. Robust? None of them have been, and f3() really isn’t!

set.seed(123)
f1(0, .5)
## [1] 1 1
set.seed(123)
try(f3(0, .5))
## Error in if (test[i]) current <- (current + 1)%%2 : 
##   missing value where TRUE/FALSE needed
  1. Fast? Yes, about 10 times faster than f3()
n <- 100000
system.time(f2(n, .5))
##    user  system elapsed 
##   0.551   0.033   0.587
system.time(f3(n, .5))
##    user  system elapsed 
##   0.043   0.001   0.044

… with linear scaling.

system.time(f3(n * 10, .5))
##    user  system elapsed 
##   0.424   0.007   0.434

5 Fourth approach

5.1 Implementation

The problem is that 1:n is not robust, especially 1:0 generates the sequence c(1, 0), whereas we were expecting a zero-length sequence!

Solution: use seq_len(n)

lapply(3:1, seq_len)
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 1
f4 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in seq_len(n)) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

5.2 Assessment

  1. Correct? Yes

  2. Understandable? Yes

  3. Robust? Yes

set.seed(123)
f4(3, .5)
## [1] 1 1 0
f4(2, .5)
## [1] 1 0
f4(1, .5)
## [1] 0
f4(0, .5)
## numeric(0)
  1. Fast? Yes

6 Fifth approach

6.1 Implementation

Use cumsum() (cummulative sum) to produce sequential groups that have the same head or tail status. Use %% on the cummulative sum to translate those groups into heads (cummsum() %% 2 == 0 or tails (cummsum() %% 2 == 1).

f5 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    test <- runif(n) > p
    cumsum(current + test) %% 2
}

6.2 Assessment

  1. Correct?
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
## [1] TRUE
  1. Understandable? Harder to understand…

  2. Robust? Yes

f5(0, .8)
## numeric(0)
  1. Fast?
n <- 1000000
system.time(f4(n, .5))
##    user  system elapsed 
##   0.416   0.002   0.420
system.time(f5(n, .5))
##    user  system elapsed 
##   0.087   0.002   0.088
system.time(f5(n * 10, .5))
##    user  system elapsed 
##   1.086   0.063   1.151

About 4x faster than f4(), scales linearly, fast even for very large n.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

correlated_tosses_expts <- function(m, n, p) {
    ## m tosses (rows) in each of n experiments
    start0 <- rep(rbinom(m, 1, .5), each = m)
    group0 <- cumsum(runif(m * n) > p)
    toss <- (start0 + group0) %% 2
    matrix(toss, m)
}
system.time({
    expt <- correlated_tosses_expts(1000, 10000, .8)
})
##    user  system elapsed 
##   0.977   0.058   1.035
hist(colSums(expt))

7 XXX Approach

Probably we have reached the point of diminishing gains, we’ve already spent far more time developing f5() than we’ll ever save by further investigation… However,

Other tools

8 Session info

sessionInfo()
## R version 3.6.1 Patched (2019-07-16 r76845)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.13.2
## 
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.4 compiler_3.6.1     magrittr_1.5      
##  [4] bookdown_0.12      htmltools_0.3.6    tools_3.6.1       
##  [7] yaml_2.2.0         Rcpp_1.0.1         codetools_0.2-16  
## [10] stringi_1.4.3      rmarkdown_1.14     knitr_1.23        
## [13] stringr_1.4.0      digest_0.6.20      xfun_0.8          
## [16] evaluate_0.14