forked from b-rodrigues/rap4all
-
Notifications
You must be signed in to change notification settings - Fork 0
/
prerequisites.qmd
133 lines (116 loc) · 8.37 KB
/
prerequisites.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# Before we start
This is not an introductory book, so before tackling the topics presented here,
make sure that you are familiar with the different topics presented below. If
you read this chapter and everything is obvious or known to you, then you should
have no trouble following along. If instead what you read here is cryptic, then
take some time to improve your understanding of these topics first.
## Essential knowledge
It’s important to know the parts that constitute R. Let’s make something clear:
R is not RStudio, or whatever interface you are using to interact with R. R is a
domain-specific interpreted programming language. R is domain-specific because
its primary use is in performing statistics. Interpreted, because results get
returned immediately when you execute a script in the console. In other words,
when you write `1+1` in the console, you get back `2` immediately. There are
programming languages, called *compiled* programming languages, that
require code to be compiled into binaries before execution. C is such a
language. The fact that R is interpreted makes interactive exploratory data
analysis easy, but also introduces certain negative aspects. I will discuss
these in detail in the book. R’s console is an example of a
REPL -- *Read-Eval-Print-Loop* -- environment. Code gets *read*, *evaluated*,
*printed* and the read state gets returned, starting the *loop* over.
To make working with R easier, you should not write code in the console and
execute it, but instead write it in a text file. You can keep these text files,
update and share them with collaborators. Such text files are called scripts.
You *could* write these scripts using the most basic text editor included in
your operating system (that would be Notepad.exe on Windows for example), but
you should instead use a text editor made specifically to make programming
easier. Popular choices among R users include RStudio, Visual Studio Code, or
maybe something more exotic like Emacs combined with ESS (my personal choice).
Whatever text editor you choose, take time to configure it and learn how to use
it. You will spend many, many, many hours inside that text editor. The code you
write in that text editor is what’s going to feed you and your family. Learn
your chosen text editor’s keyboard shortcuts and other advanced features. This
initial investment will pay for itself many times over. Also, you need to know
what an actual text file is. A document written in Word (with the `.docx`
extension) is not a text file. It looks like text, but is not. The `.docx`
format is a much more complex format with many layers of abstraction. "True"
plain text files can be opened with the simplest text editor included in your
operating system. I’ve had students trying to create text files with word
processors like MS Word and then being confused when things would not work.
As stated before, R is a domain-specific programming language mainly used for
doing statistics, or whatever modernized term you may prefer like *"data
science"*. Its base capabilities can be extended by installing packages. For
example, a base installation of R provides you with useful functions like
`mean()` or `sd()`, to compute the average or standard deviation of a vector of
numbers, or `rnorm()` to compute random variates from a Gaussian (Normal)
distribution. However, there is no function available to train a random forest.
If you need to train a random forest you need to install a package using the
`install.packages("randomForest")` command. This installs the `{randomForest}`
package (in the rest of the book, I will surround package names with curly
braces). The collection of packages installed is called a "library". Packages
get downloaded from CRAN, the *Comprehensive R Archive Network*. There is no
doubt in my mind that the reason R became so popular is because it is quite easy
to write packages for it; and this is something that we will learn as well! Some
packages are written with other programming languages, very often Fortran or
C++. The code included in these packages is then compiled and can be executed by
R using a user-facing function. For example, if you dig into the source code of
the `{randomForest}` package, you will find C and Fortran code. This is
important to know, because sometimes R packages need to be compiled by
`install.packages()`, and this compilation can sometimes fail (especially on
Linux, but more on that later in the book).
When you use R, you will load data sets, create plots, train models, etc. These
data sets, plots, models, are all *objects* and they get saved in the global
environment. To see a list of objects currently available in the global
environment, type `ls()` in the R console. When you quit R, you get asked to
save the *workspace*: this will save the current state of the global environment
and load it next time you start R. I highly recommend for you to not save the
workspace. If you are using RStudio you can change this behaviour in the global
options (under *Workspace*, set *Save workspace to .RData on exit* to *Never*).
Other editors might have a similar option. Saving and loading the workspace
makes it impossible to start with a fresh R session (unless you start R with the
`--vanilla` flag), which can cause issues that are difficult to pinpoint.
You should also be comfortable with paths and your computer’s file system.
*Comfortable* means having no problems finding where a file gets downloaded for
example, or being able to navigate to any folder, either through a GUI file
browser or through a terminal (if you’re familiar with navigating your computer
using the terminal, you will have an easier time with this book than if you
didn’t). I also highly recommend that you strive to use relative paths in your
scripts, and not absolute paths. In other words: don’t start your scripts with a
line such as:
```{r, eval = F}
setwd("H:/Username/Projects/housing_regression/")
```
but instead, use "Projects" if you’re using RStudio, or similar features from
your preferred IDE. This way, you can use relative paths instead. This makes
collaboration much easier. Using "Projects" in RStudio, if you need to load
data, you can simply write:
```{r, eval = F}
dataset <- read.csv("data.csv")
```
and don’t need to set working directories using `setwd()`, which obviously will
not exist on your collaborators computer.
There is also the `{here}` package that makes using relative paths
easier, but I won’t discuss it in this book. If you’re interested
you can read this
[post](https://github.com/jennybc/here_here)^[https://github.com/jennybc/here_here].
You should be familiar with writing functions. This book has a whole chapter on
functional programming, and I will teach you how to write functions, but if
you’re already familiar with this, then it will make going through that chapter
easier.
Finally, you should know how to ask for help. If you need help with this book,
feel free to open an issue on the book’s Github repo
[here](https://github.com/b-rodrigues/rap4all)^[https://github.com/b-rodrigues/rap4all],
or open a thread on the book’s Leanpub forum (if you bought a copy) over
[here](https://community.leanpub.com/c/raps-with-r/)^[https://community.leanpub.com/c/raps-with-r/].
Just like for this book, if you have an issue with an R package, look for its
repository: many packages’ source code is hosted on Github (but not always). You can
also try to reach out to the author, or open a thread on Stackoverflow. Whatever
you do, make sure that you do your homework first:
- Read the documentation. Maybe you’re using the tool wrong.
- Take note of the error message. Error messages can be cryptic sometimes, but as you gain in experience, you will learn to decrypt them.
- Write down the simplest script possible that reproduces the issue you're facing. This is called an MRE, "Minimal Reproducible Example". If you need to open a thread asking for help, post this MRE, this will make helping you much easier. For general advice on how to write an MRE, you can read this [classic blog post](https://jonskeet.uk/csharp/complete.html)^[https://jonskeet.uk/csharp/complete.html].
Finally, keep in mind the following saying from my father, a mason (the ones
that lay bricks, not the ones meeting in secrecy to govern the world):
> The tools are always right. If you’re using a tool and it’s not behaving as
> expected, it is much more likely that your expectations are wrong.
> Take this opportunity to review your knowledge of the tool.