Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for data repositories #1

Open
wants to merge 52 commits into
base: master
Choose a base branch
from
Open

Add support for data repositories #1

wants to merge 52 commits into from

Conversation

ThierryO
Copy link
Owner

Add functionality to store and retrieve R data.frames to a git repository as git optimized text files

ThierryO added 30 commits July 23, 2018 13:33
…rage

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…orrectly

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Swapping two variables when rewritten a data.frame results in a large diff
while the information content of the data hasn't changed. Therefore the
variables will be reordered to match the original order.

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
When a line is moved in a file, the resulting diff is a deletion at the
original location and an addition at the new location. Changing the order of
the observations in a data.frame does not change the information content.
Sorting the data before writing avoids unnecessary diffs.

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…within the sorting variables

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
…tead of "data_repository"

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
@ThierryO
Copy link
Owner Author

@stijnvanhoey and @florisvdh: can you have another look at this?

Main changes:

  • use the git_repository class instead of the data_repository class (which has been removed)
  • allow the user to store data in verbose format (default remains the optimized format)
  • automatic staging of the changes is now optional
  • add a new functionrm_data() to remove data objects in bulk
  • updated vignette

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
dir.exists() is not available in R < 3.2.0

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
### Storing data

Use `write_delim_git()` to store a `data.frame` into the repository. The function will separate the data and the metadata. The data is stored as a headerless, unquoted tab delimited file with ".tsv" extension and UTF-8 encoding. The metadata is stored in YAML format with ".yml" extension. Therefore any extension given to the `file` will be stripped (with a warning).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check if I get it right... the metadata contains the variable names, their type and in case of factors the mapping between levels and labels + the sorting order?

Should attributes of the data.frame be saved as metadata as well? For instance rownames?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: if rownames are important they should be a variable


```{r}
# undo the remove by resetting to the last commit
reset(commits(repo)[[1]], "hard")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line necessary here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is rewritting the data objects. resetting the repo to the last commit is more elegant.


### Verbose data storage

`write_delim_git()` will store the data by default in an optimize way in the repository. The downside of this is that the stored data is less human-readable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize -> optimized

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 51c4d7e


### Reading data

Retrieving data is straight forward. Use `read_delim_git` and provide the `file` and the `repo`. The retrieved data is identical to the original data after applying the ordering of variables and observations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

straight forward -> straightforward

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing in 51c4d7e

Copy link

@hansvancalster hansvancalster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really useful to have a write and read for data.frames to and from a git repo! Cool!

Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Signed-off-by: Thierry Onkelinx <thierry.onkelinx@inbo.be>
Copy link
Collaborator

@stijnvanhoey stijnvanhoey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the concept of data repository cleared out, the functionality is very clear to me, nice work! I have not verified the code itself, but the vignette provides a good introduction. Thanks, looking forward to further use and test it.

@florisvdh
Copy link
Collaborator

I agree with stijnvanhoey. I did not retest the functions but from the git status output in the vignette they seems to behave very well. Functionality is very straightforward now and at least for me the data handling might become the primary use of the git2r package. I like the way rm_file() now works. Also other changes and enhancements much appreciated.

@florisvdh
Copy link
Collaborator

I have just one small remark; the title of the vignette still says 'data repository'. It does not bother me though, if you left it on purpose -- it is clear from the title that the focus of the vignette is on data and from the explanation it is clear how it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants