Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading datapackages from zip files #193

Open
khusmann opened this issue Mar 25, 2024 · 0 comments
Open

Support loading datapackages from zip files #193

khusmann opened this issue Mar 25, 2024 · 0 comments
Labels
enhancement New feature or request function:read_package Function read_package()
Milestone

Comments

@khusmann
Copy link
Contributor

khusmann commented Mar 25, 2024

As I've briefly mentioned in other discussions (#158), I think the ability to load data packages as zipped blobs would potentially aid in the adoption of frictionless by many research communities.

Presently, SPSS, SAS, Stata dominate the scene in many fields for curating and exchanging data. For example, ICPSR prefers data submitted in these formats: https://www.icpsr.umich.edu/web/pages/deposit/index.html

I think one of the key reasons these formats are so popular (aside from vendor lock-in), is because it allows data and metadata to be bundled together into a single file. This makes it really easy to distribute data: when I attach the file to an email or upload it to OneDrive to share with colleagues, the data and metadata travel as a unified package.

By contrast, a standard data package is a folder with a collection of files and a "datapackage.json" file. If I wish to send this to someone over email or download link, I need to zip it, and the person receiving it must unzip and read_package("./survey-data/datapackage.json") It might seem like a small thing, but I think this ends up creating a lot of friction for users that don't have much technical experience -- I end up having to explain to them what the "datapckage.json" file is, how it's metadata with references to the actual data, but doesn't hold data itself. These are good things to learn, but I think it creates complexity for end-users that represents a barrier for adoption. (In my experience users are often tempted to go straight to the csv files in the datapackage and start loading them manually and will ignore / are intimidated by the datapackage.json file).

It would be so much easier if I could send someone a "survey-data.zip" file (or post at a download link) and tell them all they need to do is download it and read_package("./survey-data.zip"). This way, they can store & access the package as a self-contained, readonly blob they can instantly start using without having to learn anything about how its internals are organized. This matches the use-case of the binary SPSS / SAS / Stata formats, and allows frictionless to be a drop in replacement. Actually, better than a drop-in replacement, because the data package zip can include multiple resources, not just a single table.

This also gives us a nice path to migrate users away from SPSS / SAS / Stata formats -- provide me with "survey_data.sav" and I can produce a 1-1 "survey_data.zip" of all of the data & metadata in frictionless format. They're both formats that bundle data / metadata, but one is open!

Note that readr / vroom supports loading specific files from multi-file archives: https://vroom.r-lib.org/articles/vroom.html#reading-individual-files-from-a-multi-file-zip-archive , so it'd be pretty easy to only load what we needed when it was called for by read_resource, rather than needing to unzip the entire zip to a temporary folder.

I could go either way on loading *.zip files from remote urls -- Instead of downloading / caching remote zip files, I'd be more inclined to have this feature only work for local paths, and if a remote zip was given I'd give the user the message that they'll need to download it themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:read_package Function read_package()
Projects
None yet
Development

No branches or pull requests

2 participants