-
Notifications
You must be signed in to change notification settings - Fork 1
/
Lab9.Rmd
170 lines (129 loc) · 4.28 KB
/
Lab9.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "Assignment 3"
author: Cynthia Begay
output: github_document
---
---
title: "Lab 9 - HPC"
output: tufte::tufte_html
link-citations: yes
---
# Learning goals
In this lab, you are expected to learn/put in practice the following skills:
- Evaluate whether a problem can be parallelized or not.
- Practice with the parallel package.
- Use Rscript to submit jobs
- Practice your skills with Git.
## Problem 1: Think
Give yourself a few minutes to think about what you just learned. List three
examples of problems that you believe may be solved using parallel computing,
and check for packages on the HPC CRAN task view that may be related to it.
-Cross-validation in machine learning: 'carat', 'mlr'.
-Bootstrapping: 'boot'
-Monte Carlo Simulation:
## Problem 2: Before you
The following functions can be written to be more efficient without using
parallel:
1. This function generates a `n x k` dataset with all its entries distributed
poission with mean `lambda`.
```{r p2-fun1, eval = TRUE}
fun1 <- function(n = 100, k = 4, lambda = 4) {
x <- NULL
for (i in 1:n)
x <- rbind(x, rpois(k, lambda))
x
}
fun1alt <- function(n = 100, k = 4, lambda = 4) {
matrix(rpois(n * k, lambda = lambda), ncol = k)
}
# Benchmarking
microbenchmark::microbenchmark(
fun1(),
fun1alt(), unit = "relative"
)
```
2. Find the column max (hint: Checkout the function `max.col()`).
```{r p2-fun2, eval = TRUE}
# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)
# Find each column's max value
fun2 <- function(x) {
apply(x, 2, max)
}
fun2alt <- function(x) {
x[cbind(max.col(t(x)),1:ncol(x))]
}
# Benchmarking
microbenchmark::microbenchmark(
fun2(x),
fun2alt(x), unit = "relative"
)
```
## Problem 3: Parallelize everyhing
We will now turn our attention to non-parametric
[bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)).
Among its many uses, non-parametric bootstrapping allow us to obtain confidence
intervals for parameter estimates without relying on parametric assumptions.
The main assumption is that we can approximate many experiments by resampling
observations from our original dataset, which reflects the population.
This function implements the non-parametric bootstrap:
```{r p3-boot-fun, eval = TRUE}
library(parallel)
my_boot <- function(dat, stat, R, ncpus = 1L) {
# Getting the random indices
n <- nrow(dat)
idx <- matrix(sample.int(n, n*R, TRUE), nrow=n, ncol=R)
# Making the cluster using `ncpus`
# STEP 1: Make a cluster
cl <- makePSOCKcluster(ncpus)
# STEP 2: Set it up (export data if needed), idk, dt
clusterExport(cl, varlist = c("idx", "dat", "stat"), envir = environment())
# STEP 3: THIS FUNCTION NEEDS TO BE REPLACES WITH parLapply
ans <- parLapply(seq_len(R), function(i) {
stat(dat[idx[,i], , drop=FALSE])
})
# Coercing the list into a matrix
ans <- do.call(rbind, ans)
# STEP 4: Stop the cluster
stopCluster(cl)
ans
}
```
1. Use the previous pseudocode, and make it work with parallel. Here is just an example
for you to try:
```{r p3-test-boot, eval = FALSE}
# Bootstrap of an OLS
my_stat <- function(d) coef(lm(y ~ x, data=d))
# DATA SIM
set.seed(1)
n <- 500
R <- 1e4
x <- cbind(rnorm(n))
y <- x*5 + rnorm(n)
# Checking if we get something similar as lm
ans0 <- confint(lm(y~x))
ans1 <- my_boot(dat = data.frame(x, y), my_stat, R = R, ncpus = 2L)
# You should get something like this
t(apply(ans1, 2, quantile, c(.025,.975)))
## 2.5% 97.5%
## (Intercept) -0.1372435 0.05074397
## x 4.8680977 5.04539763
ans0
## 2.5 % 97.5 %
## (Intercept) -0.1379033 0.04797344
## x 4.8650100 5.04883353
```
2. Check whether your version actually goes faster than the non-parallel version:
```{r benchmark-problem3, eval = FALSE}
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 1L))
system.time(my_boot(dat = data.frame(x, y), my_stat, R = 4000, ncpus = 2L))
```
## Problem 4: Compile this markdown document using Rscript
Once you have saved this Rmd file, try running the following command
in your terminal:
```bash
Rscript --vanilla -e 'rmarkdown::render("[full-path-to-your-Rmd-file.Rmd]")' &
```
Where `[full-path-to-your-Rmd-file.Rmd]` should be replace with the full path to
your Rmd file... :).