forked from b-rodrigues/rap4all
-
Notifications
You must be signed in to change notification settings - Fork 0
/
preface.qmd
241 lines (209 loc) · 13.7 KB
/
preface.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# Preface {.unnumbered}
In the summer of 2022, a former colleague from my first job asked me if I wanted
to help him teach a class at the University of Luxembourg. It was a class for
the Master's of Data Science, and the class was supposed to be taught by
non-academics like us. The idea was to teach the students some "real-world"
skills from the industry. It was a 40 hours class, and naturally we split them
equally between us; my colleague focused on time series statistics but I really
didn't know what I should do. I knew I wanted to teach, I always liked teaching,
but I am a public servant in the ministry of higher education and research in
Luxembourg. I still code a lot, but I don't do exciting machine learning
anymore, or advanced econometrics like my colleague. Before (re)joining the
public service I was a senior data scientist and then manager in one of the big
four accounting firms. Before that, and this is where my colleague and I met, I
was a research assistant in the research department of the national statistical
institute of statistics in Luxembourg, and my colleague is still an applied
researcher there.
What could I teach these students? What "skills from the industry" could I
possibly share with them? I am an expert in nothing in particular. Actually, I
don’t really know anything very deeply, but know at least a little about many
different things. There are many self-help books out there that state that it’s
better to know a lot about only a few, maybe even only one, topic, than know
a lot about many topics. I tend to disagree with this; at least in my
experience, knowing enough about many different topics always allowed me to
communicate effectively with many different people, from researchers focusing on
very specific topics that needed my help to assist them in their research, to
clients from a wide range of industries that were sharing their problems with me
in my consulting years. If I needed to deepen my knowledge on a particular topic
before I could intervene, I had the necessary theoretical background to grab a
few books and learn the material. Also, I was never afraid of asking questions.
This is reflected in my blogging. As I’m writing these lines (beginning of
2023), I have been blogging for about ten years. Most of my blog posts are me
trying to lay out a problem I had at work and how I solved it. Sometimes I do
some things for pleasure or curiosity, like the [two posts on the video game
nethack](https://www.brodrigues.co/blog/2018-11-03-nethack_analysis/), or the
ones [on 19th century
newspapers](https://www.brodrigues.co/blog/2019-01-04-newspapers/) where I
learned a lot about NLP. Because I was lucky enough to work with different
people from many backgrounds, I always had to solve a very wide range of
problems.
But that still didn’t really help me to find a topic to teach... but then it
dawned on me. Even though in my career I had to help many different people with
many different backgrounds and needs, there were two things that everyone always
required: traceability and reliability.
Everyone wanted to know how I came to the conclusions that I came to, and most
of them even wanted to be able to reproduce my steps as a form of double
checking what I did (consultants are expensive, so you better make sure that
they’re worth their hourly rate!). When I was a manager, I applied the same logic
to my teammates. I wanted to be able to understand what they were doing, or at
least know that if I needed to review their work deeply, the possibility was there.
So what I had to teach these students of data science was some best practices in
software engineering. Most people working with data don’t get taught software
engineering skills. Courses focus on probability theory, linear algebra,
algorithms, and programming but not software engineering. That's because
software engineering skills get taught to software engineers. But while
statisticians, data scientists, (or whatever we get called these days), are not
software engineers, they do write a lot of code. And code that is quite
important at that. And yet, most of us do work like pigs (no disrespect to
pigs).
For example, how much of the code you write that produces very sensitive and
important results, be it in science or in industry, is thoroughly tested? How
much of the code you use relies on a single person showing up for work and using
some secret knowledge that is undocumented? What if that person ends up under a
bus? How much code do you run that no one dares touch anymore because that one
person from before did end up under a bus?
How many people do you have to ping when you need to get an update to a
quarterly report? How many people do you have to ping to know how Table 3 from
that report from 2020 that was quickly put together during the Covid-19
lockdowns was computed? Are all the people involved even working in your
company still?
When collaborating with teammates to write a report or scientific paper, do you
consider potential risks? (If you're wondering *What risks?* then you’re definitely
not considering them.)
Are you able to tell anyone, *exactly*, how that number that gets used by the
CEO in that one report was made? What if there's an investigation, or some
external audit? Would the auditors be able to run the code and understand what
is going on with as little intervention as possible (ideally none) from you?
*But I don't work in an industry that gets audited*, you may think. Well, maybe
not, or maybe one day your work will get audited anyways. Maybe it'll get
audited internally for whatever reason. Maybe there's a new law that went into
force that requires your work, or parts of your work, to be easily traceable.
And if you're a scientist, your work does get audited, or at least it should be
in theory. I don't know any scientist (and I know more scientists than the
average person, thanks to my background and current job) that is against the
idea of open science, open data, reproducibility, and so on. Not one. But in
practice, how many papers are truly reproducible? How many scientific results
are auditable and traceable?
Lack of traceability and reproducibility can sometimes lead to serious
consequences. If you’re in the social sciences, you likely know about the
*Reinhart and Rogoff* paper. Reinhard and Rogoff are two American economists
that published a paper in 2010 that showed that when countries are too much in
debt (over 60% of GDP according to the authors) then annual growth decreases by
two percent. These papers provided an empirical justification for austerity
measures in the aftermath of the 2009 European debt crisis. But there was a
serious problem with the Reinhard and Rogoff paper. It’s not that they somehow
didn’t use the *correct* theoretical framework or modelling procedure in their
paper. It’s not that their assumptions were disputable or too unrealistic. It’s
that they performed their calculations inside an Excel spreadsheet and did not,
and this is not a joke, they did not select every country’s real GDP growth to
compute the average real GDP growth for high-debt countries:
::: {.content-hidden when-format="pdf"}
<figure>
<img src="images/rr_excel.png"
alt="You can see that not all countries are selected..."></img>
<figcaption>You can see that not all countries are selected...</figcaption>
</figure>
:::
::: {.content-visible when-format="pdf"}
```{r, echo = F, out.width="300px"}
#| fig-cap: "You can see that not all countries are selected..."
knitr::include_graphics("images/rr_excel.png")
```
:::
(source to image, [archived link](https://archive.is/DTGpC)^[https://archive.is/DTGpC])
And this is not the only problem with this paper.
The problem is not that this mistake was made. Reinhard and Rogoff are only
human and mistakes can happen. What’s problematic is that this was picked up and
corrected too late. In an ideal world, Reinhard and Rogoff would not have used
tools that make mistakes like this almost impossible to find once they’re made.
Instead, they would have used tools that would have made such a thing not happen
in the first place, or, as a second best, making it easier and faster for someone
else to find this mistake. And this is not something that is only useful in
research, but also in any industry. Being able to trust results, tracing back
calculations and auditing are not only concerns of researchers.
So this is what I decided to teach the students: how they could structure their
projects in such a way that they could spot problems like that during
development, but also make it easy to reproduce and retrace who did what and
when. I wrote my course notes into a [freely available
bookdown](https://rap4mads.eu/) that I used for teaching. When I started
compiling my notes, I discovered the concept *Reproducible Analytical Pipelines*
as developed by the [Office for National
Statistics](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/)
(henceforth ONS). I found the name "Reproducible Analytical Pipeline"
(henceforth RAP) really perfect for what I was aiming at. The ONS team
responsible for evangelising RAPs also published a free
[ebook](https://ukgovdatascience.github.io/rap_companion/) in 2019 already.
Another big source of inspiration is [Software
Carpentry](https://software-carpentry.org/) to which I was exposed during my PhD
years, around 2014-ish if memory serves. While working on a project with some
German colleagues from the University of Bonn, the principal investigator made
us work using these concepts to manage the project. I was really impressed by
it, and these ideas and techniques stayed with me since then.
The bottom line is: the ideas I’m presenting here are nothing new. It’s just
that I took some time to compile them and make them accessible and interesting
(at least I hope so) for users of the R programming language.
At least my students found the course interesting. But not just students. I
tweeted about this course and shared the notes with a wider audience, and this
is when I got very positive feedback from people that were not my students.
People wanted to buy this as a book and go deeper into the topics laid out. This
is when I realised that, as far as I know, there is not a practical book
available discussing these topics. So I decided to write one, but I took my time
getting started. What finally, really, got me working on it was when [Dmytro
Perepolkin](https://github.com/dmi3kno/) reached out to me and suggested I
contact several persons to get their inputs and ideas and get started. I
followed his advice, and this led to very fruitful discussions with [Sébastien
Rochette](https://github.com/statnmap), [Miles
McBain](https://github.com/MilesMcBain) and Dmytro. Their ideas and inputs
definitely improved the quality of this book, so many thanks to them. Also
thanks to [David Solito](https://twitter.com/dsolito), [Matan
Hakim](https://github.com/matanhakim), [Stas
Kolenikov](https://github.com/skolenik), [Sam
Parmar](https://github.com/parmsam), [Chuck](https://github.com/chorgan182),
[Matouš Eibich](https://github.com/MatousEibich), [Jonathan
Moore](https://github.com/jonathandmoore), [Alain
Vagner](https://github.com/AlainVagner) and [Matthias Meiksner](https://github.com/IZE85)
for proofreading the book and providing valuable feedback and fixes. And thank you,
dear reader, for picking this up!
This book is divided into two parts. The first part teaches you what I believe
is essential knowledge you should possess in order to write truly reproducible
pipelines. This essential knowledge is constituted of:
- Version control with Git and how to manage projects with Github;
- Functional programming;
- Literate programming.
The main idea from part 1 is "don’t repeat yourself". Git and Github will help
us avoid losing code, and losing track of who should do what in a project (even
if you’re working alone on a project, you will see that using Git and Github
will save you many hours and headaches). Getting familiar with functional and
literate programming should improve the quality of our code by avoiding two
common sources of mistakes: computing results that rely on the state of our
program (and later, the state of the whole hardware we are using) and copy and
paste mistakes.
The second part of the book will then build upon this knowledge to introduce
several tools that will help us go beyond the benefits of version control and
functional and literate programming:
- Dependency management with `{renv}`;
- Package development with `{fusen}`;
- Unit and assertive testing;
- Build automation with `{targets}`;
- Reproducible environments with Docker;
- Continuous integration and delivery.
While this is not a book for beginners (you really should be familiar with R
before reading this), I will not assume that you have any knowledge of the tools
presented in part 2. In fact, even if you’re already familiar with Git, Github,
functional programming and literate programming, I think that you will still
learn something useful from reading part 1. But be warned, this book will
require you to take the time to read it, and then type on your computer. Type *a
lot*.
I hope that you will enjoy reading this book and applying the ideas in your
day-to-day, ideas which hopefully should improve the reliability, traceability
and reproducibility of your code. You can read this book for free on
[https://raps-with-r.dev/](https://raps-with-r.dev/), or if you want you can buy
a DRM-free PDF or Epub over at
[https://leanpub.com/raps-with-r](https://leanpub.com/raps-with-r).
::: {.content-hidden when-format="pdf"}
You can also buy a physical copy of the book on [Amazon](https://www.amazon.com/dp/B0C87H6MGF/).
:::
If you want to get to know me better, read my
[bio](https://www.brodrigues.co/about/me/)^[https://www.brodrigues.co/about/me/].
If you have feedback, drop me an email at bruno [at] brodrigues [dot] co.
Enjoy!