layout | title | subtitle | minutes |
---|---|---|---|
page |
R for reproducible scientific analysis |
Project management with RStudio |
20 |
- To create self-contained projects in RStudio
- To use git from within RStudio
The scientific process is naturally incremental and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.
A good project layout will make your life easier by:
- ensuring the integrity of your data;
- making it simpler to share the code with peers;
- allowing you to easily upload your code with your manuscript submission;
- enabling you to pick the project up again after a break hassle-free.
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is the project management functionality. We'll be using this today to create a self-contained, reproducible project.
We're going to create a new project in RStudio:
- Click the "File" menu button, then "New Project".
- Click "New Directory".
- Click "Empty Project".
- Type in the name of the directory to store your project, e.g. "test_project".
- Make sure that the checkbox for "Create a git repository" is selected.
- Click the "Create Project" button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.
We also set up our project to integrate with git, putting it under version control. RStudio has an interface to git, but is very limited in what it can do. Let's make an initial commit of our template files.
The top right panel in RStudio has a tab for "Git". When files are not yet tracked by git these are marked by yellow question marks. Now our project has two items untracked:
- .gitignore (automatically generated by git)
- test_project.Rproj (automatically generated by RStudio)
Stage these two files by selecting them in the Git tab and pressing Commit
. On the pop up window in the Comment
box type: "Adding .gitgnore and test_proj.Rproj files". Press Commit
, then Close
on the pop up window and then close the last pop up window too.
Although there is no "best" way to lay out a project, there are some general principles to adhere to that will make project management easier:
This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as "read-only".
In many cases your data will be "dirty": it will need significant preprocessing to get into a R format (or any other programming language). It is often useful to store these scripts in a separate folder, and create a second "read-only" data folder to hold the "processed" data sets.
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.
First step for our project is to generate a new folder, called data
, and store the raw data files under it.
To create a new folder:
On the bottom right panel click the
New Folder
button.In the popup window type the name of the new folder.
To download the gapminder data from this link, you would have typically used the
wget
function in the shell. In that case the command would be:wget http://tinyurl.com/gapminder-FiveYearData-csv
.To run a shell command in R we can use the
system
function. Look at the arguments of the command in the help page and then:
Use the R function
system
to run the shell command wget from the link above. To save the file under the recently createddata
folder, add the following flag--output-document=data/gapminder-FiveYearData.csv
to the abovewget
function.Click on the data folder (bottom right panel) to check if a new file exists.
We will load, inspect and analyse these data later.
Check the Git
tab for any changes. What do you see?
Generally you do not want to version disposable output or read-only data. You should modify the
.gitignore
file to tell git to ignore these files and directories.
Modify the
.gitignore
file to containdata/*
so that the data folder isn't versioned. To do so, click on the file in the bottom right panel and add a line at the end. Save the changes.Stage and commit the
.gitignore
file using the R git interface.