-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
223 lines (146 loc) · 10 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
set.seed(42)
```
# rbackupr
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->
The goal of rbackupr is to facilitate online backups to Google Drive from R. Key features:
- it uses requires limited permission (only to files created by this app, not all of your Google Drive)
- it caches metadata locally for speedy updates
- it allows for easy management of multiple backups based on the idea of "projects" located under a top level folder
- the top level folder on the local disk can be moved, or read from somewhere else: all paths are relative to the base folder
- it allows for backups recursively (e.g. deeply nested folders), and makes it easy to include only some of the contents (e.g. all ".csv" files from all the subfolders that have a bunch of other files)
__Warning: I have not finalised testing and documentation, but it seems to work as expected at this stage. Given that, by default, it requests access only to files created by itself, it should be mostly safe to use. Of course, make sure it is fit for (your) purpose. Consider creating your own Google App (see below).__
## Core motivation
By default, when you use the `googledrive` package you give access to all of your Google Drive. This has a number of shortcomings, in particular if you use Google Drive for different reasons and you have many files and folders stored there:
- going thrhough a lot of files, `googledrive` is slow in finding what you need
- in particular if you use `googledrive` from a remote server, you may feel unconfortable in potentially leaving an open door to *all* of the files you have stored there
- even if you use it locally, you may be concerned of the mistakes you could make by giving full access to all of your files to scripts written by you, or to packages published by a random GitHub user (such as the author of this package)
While you can give full access to Google Drive if you so wish, `rbackupr` is developed based on the expectation that it will have only access to the ["drive.file" scope of Google Drive](https://developers.google.com/drive/api/v2/about-auth), i.e. that it have "access to files created or opened by the app". In brief, of course you still need to trust the scripts you are running, but at least you can be sure that it will not mess with completely unrelated files you keep on Google Drive.
See screenshot of Google authentication prompt, where it is clear that you give `rbackupr` access only to "the specific Google Drive files that you use with this app."
<img src="man/figures/google_login_screen.png" align="center"/>
In order to reduce the impact of some of the issues that come with this choice, this app caches locally metadata about remote folders and files in order to speed up processing.
An additional benefit of `rbackupr` is that it is agnostic about the location of the base folder, as it stores only relative path: it is possible to keep a folder up to date from different locations, or move a folder and have it updated without issues.
## Installation
You can install the development version of `rbackupr` from [GitHub](https://github.com/) with:
```{r install, eval = FALSE}
# install.packages("remotes")
remotes::install_github("giocomai/rbackupr")
```
## How to use
In order to show how this package works, we will first create some files with random data in the temporary folder. So this is how our base folder looks like: a main folder, with some sub-folders, and some files within each of them.
```{r example_folder_tree, eval=TRUE}
library("rbackupr")
base_temp_folder <- fs::dir_create(fs::path(
tempdir(),
"rbackupr_testing"
))
subfolder_names <- stringr::str_c(
"data_",
sample(x = 10:100, size = 3, replace = FALSE)
)
purrr::walk(
.x = subfolder_names,
.f = function(x) {
current_folder <- fs::path(base_temp_folder, x)
fs::dir_create(path = current_folder)
current_number <- stringr::str_extract(string = x, pattern = "[[:digit:]]+") %>%
as.numeric()
purrr::walk(
.x = rep(NA, sample(x = 2:5, size = 1)),
.f = function(x) {
readr::write_csv(
x = tibble::tibble(numbers = rnorm(n = 10, mean = current_number)),
file = fs::file_temp(
pattern = stringr::str_c("spreadsheet_", current_number, "_"),
tmp_dir = current_folder,
ext = "csv"
)
)
}
)
}
)
fs::dir_tree(base_temp_folder)
```
By default, `rbackupr` does not cache metadata files and folders it stores on Google Drive, but you are strongly encouraged to do so. You can do so with `rb_enable_cache()`. By default, metadata will be cached in the current active directory, but if you use `rbackupr` with different projects, you may not want to leave files with cached metadata scattered around your local drive, but rather keep them in place where they can be accessed. Something like this could typically be included at the beginning of `rbackupr` backup scripts.
```{r eval = FALSE}
library("rbackupr")
rb_enable_cache()
rb_set_cache_folder(path = fs::path(
fs::path_home_r(),
"R",
"rb_data"
))
rb_create_cache_folder(ask = FALSE)
```
To reduce mixing of files you uploaded with `rbackupr` with other files you likely have on Google Drive, all your files and folders are stored under a base folder, that by default is called "rbackupr". You can call your base folder something else, you can have more than one base folder, but you're probably fine just by keeping the defaults. Under your base folder, you will have a "project" folder, and under that folder, you will actually have all your files related to that project.
No matter full path on your files on your hard drive, the folder you set up to backup will correspond to the project folder.
So the first step you need to take, is set a project for the current session, and then actually create its own folder on Google Drive, if it doesn't yet exist.
```{r eval=FALSE}
rb_set_project(project = "rbackupr_testing") # this will set the project for the current session
rb_drive_auth()
rb_get_project(create = TRUE) # if it already exists, it just returns its dribble
```
At this stage, among all your files and folder on Google Drive, you should expect to find a folder called `rbackupr` by default (you can customise this in `rb_drive_create_project()`). Within that folder, you will find another folder, `rbackupr_testing` (the one we set as project). All the files you backup from now on in the current session will be located under this folder.
Notice that since we are giving only the `drive.file` scope, i.e. only access to files and folders created with the current app, if you run `googledrive::drive_ls()` you should only see those two folders and nothing else.
The following steps are based on the basic idea of caching locally key information about online resources stored by Google Drive This is helpful in particular considering the fact that if you have many files, retrieving just a full file list from Google Drive to check what needs to be uploaded can be extremely time-consuming. In the local cache, for simplicity, not all data that are part of a dribble are stored.
For the base folder, under which all projects are expected to be located, only name and dribble id are stored. Indeed, they have no "parent folder".
```{r eval=FALSE}
rb_drive_find_base_folder()
```
Even if all projects are expected to be located under a single base folder, local cache for them stores also the id of the base folder as `parent_id`.
```{r eval=FALSE}
rb_get_project()
```
Under the project folder, the real folders and files that are part of the backup are stored.
We can see the folders included in the base project with:
```{r eval=FALSE}
rb_get_project() %>%
rb_get_folders()
```
Then, for files, more details are stored in cache, including:
```{r eval=FALSE}
rb_get_project() %>%
rb_get_folders() %>%
dplyr::slice(1) %>%
rb_get_files()
```
## Creating your own app
For more details on creating your own app see in particular:
- [How to get your own API credentials](https://gargle.r-lib.org/articles/get-api-credentials.html)
- [Bring your own OAuth app or API key](https://googledrive.tidyverse.org/articles/bring-your-own-app.html)
Creating your own app has many benefits; most importantly:
- you're in control (well, Google is ultimately always in control, but at least it's just you and them)
- you don't risk hitting the rate limiting because some other user is busy with the same app
Ultimately, `rbackupr` relies on `gargle` for the authentication, so you can still follow the relevant documentation to set some settings.
```{r eval = FALSE}
# See ?gargle_oauth_cache()
options(
gargle_oauth_email = "jane@example.com",
gargle_oauth_cache = "/path/to/folder/that/does/not/sync/to/cloud"
)
```
## Database cache structure
The local SQLite database created for caching has the following structure:
- a table with details about the "base_folder" ("rbackupr_base_folder"). This may well have a single row in many common use cases.
- a table with details about the project folders ("rbackupr_projects"). This will have one row per project created.
- one table for folders and one for files under each project. They will be called "rbackupr_folders_ProjectName" and "rbackupr_files_ProjectName", where "ProjectName" is the name of the project, typically set with `rb_set_project()`.
## Things to keep in mind
Once authenticated, you can still use `googledrive`. Keep in mind that some things may not work as expected, due to the restricted access to Google Drive given by this app.
For example, listing all files with `googledrive::drive_ls()` will only list files created with this app.
## Dockerfile
This repository includes a Dockerfile. It has been created with `dockerfiler::dock_from_desc()`.
## License
Released under the MIT license.