Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

Closed
djbrooke opened this issue Jul 10, 2019 · 7 comments
Closed

Comments

@djbrooke
Copy link
Contributor

We discussed #6002 and #2720 in sprint planning today and plan to work on both during a future sprint. I'm closing out both of those in favor of this one.

  • For reproducibility, we want to maintain the original file format for tabular files because .tab may be referenced in scripts
  • For preservation, .tsv is preferred as it is more recognized in the community

We should determine a way to meet both needs.

@pdurbin
Copy link
Member

pdurbin commented Dec 5, 2019

For preservation, .tsv is preferred as it is more recognized in the community

Not just for preservation. For simple things like opening the tab-separated file in Excel. Please see https://twitter.com/Ray_J__/status/1202296388618457089 and the screenshot below:

Screen Shot 2019-12-04 at 10 34 40 PM

@mheppler
Copy link
Contributor

mheppler commented Mar 5, 2020

Review as part of Add originalFileName field to json #2734 when that is picked up in development.

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2022

For reproducibility, we want to maintain the original file format for tabular files because .tab may be referenced in scripts

I'm not sure I understand what the change would be. We always maintain the original file.

  • For preservation, .tsv is preferred as it is more recognized in the community

I re-opened #2720 because I feel strongly that we should use .tsv instead of .tab

Given the above, is there any reason to keep this issue open?

Vote to close.

@jggautier
Copy link
Contributor

About the first comment about reproducibility, the Dataverse software always maintains the original file but the file and information about it is not always easily accessible. I think this has improved since this issue was opened, but I can think of at least one case where it could be handled better:

The last time I ran the Binder integration on a dataset I uploaded, Binder ignored my dataset's .csv files and tried instead to use the .tab files that were created by the Dataverse software's ingest process. But my dataset's Python script was written to do things with the .csv files. It assumed the files would be .csv files.

To work around this, I had to replace the .csv files in my dataset with .tab files and adjust my Python script to do things with .tab files instead. I would imagine that a researcher who wants to make their computational workflow reproducible by uploading it to a Dataverse repository and using something like Binder would not anticipate needing to use .tab files instead of .csv files.

@pdurbin
Copy link
Member

pdurbin commented Oct 5, 2022

@jggautier you'd definitely right that there's something to fix for Binder. I just launched my dataset there and what I see is the .tab version, like you're saying.

Binder uses repo2docker under the covers and here's where Dataverse support was added: jupyterhub/repo2docker#739

We could submit a PR to repo2docker to change the behavior so that original files rather than preservation (.tab) files are downloaded from Dataverse. I'd be worried about backward compatibility though.

Anyway, we need a specific, actionable plan. I'm happy to talk about this whenever.

@pdurbin
Copy link
Member

pdurbin commented Oct 8, 2023

We could submit a PR to repo2docker to change the behavior so that original files rather than preservation (.tab) files are downloaded from Dataverse.

This is exactly what I did:

@pdurbin
Copy link
Member

pdurbin commented Nov 11, 2023

Closing in favor of this issue:

@pdurbin pdurbin closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants