-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for data repositories #1
base: master
Are you sure you want to change the base?
Conversation
…rage Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…orrectly Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Swapping two variables when rewritten a data.frame results in a large diff while the information content of the data hasn't changed. Therefore the variables will be reordered to match the original order. Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
When a line is moved in a file, the resulting diff is a deletion at the original location and an addition at the new location. Changing the order of the observations in a data.frame does not change the information content. Sorting the data before writing avoids unnecessary diffs. Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…within the sorting variables Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…tead of "data_repository" Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
@stijnvanhoey and @florisvdh: can you have another look at this? Main changes:
|
dir.exists() is not available in R < 3.2.0 Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
### Storing data | ||
|
||
Use `write_delim_git()` to store a `data.frame` into the repository. The function will separate the data and the metadata. The data is stored as a headerless, unquoted tab delimited file with ".tsv" extension and UTF-8 encoding. The metadata is stored in YAML format with ".yml" extension. Therefore any extension given to the `file` will be stripped (with a warning). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check if I get it right... the metadata contains the variable names, their type and in case of factors the mapping between levels and labels + the sorting order?
Should attributes of the data.frame be saved as metadata as well? For instance rownames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO: if rownames are important they should be a variable
|
||
```{r} | ||
# undo the remove by resetting to the last commit | ||
reset(commits(repo)[[1]], "hard") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is rewritting the data objects. resetting the repo to the last commit is more elegant.
vignettes/data-repository.Rmd
Outdated
|
||
### Verbose data storage | ||
|
||
`write_delim_git()` will store the data by default in an optimize way in the repository. The downside of this is that the stored data is less human-readable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize -> optimized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 51c4d7e
vignettes/data-repository.Rmd
Outdated
|
||
### Reading data | ||
|
||
Retrieving data is straight forward. Use `read_delim_git` and provide the `file` and the `repo`. The retrieved data is identical to the original data after applying the ordering of variables and observations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
straight forward -> straightforward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing in 51c4d7e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really useful to have a write and read for data.frames to and from a git repo! Cool!
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the concept of data repository cleared out, the functionality is very clear to me, nice work! I have not verified the code itself, but the vignette provides a good introduction. Thanks, looking forward to further use and test it.
I agree with stijnvanhoey. I did not retest the functions but from the git status output in the vignette they seems to behave very well. Functionality is very straightforward now and at least for me the data handling might become the primary use of the git2r package. I like the way rm_file() now works. Also other changes and enhancements much appreciated. |
I have just one small remark; the title of the vignette still says 'data repository'. It does not bother me though, if you left it on purpose -- it is clear from the title that the focus of the vignette is on data and from the explanation it is clear how it works. |
Add functionality to store and retrieve R data.frames to a git repository as git optimized text files