-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example Dataset #657
Comments
@cliu72 how does this look? |
Looks great.
|
Yeah, we could create a small function that can download the example dataset if it's not present and the user declares that they want it downloaded.
Yes, that is correct. |
Okay cool, looks good. First priority is getting the ark notebook reorganization PR in, then can get started on this/docker tag update |
Looks good to me! Just to be clear, are we only going to upload single-channel TIFs for now, or are we uploading intermediate files too? Since pixel clustering is stochastic, everyone would get slightly different weights, so I wonder if it will be confusing (since the feather files they generate won't match up with the ones uploaded). But also, having the intermediate files would be good if users only want to test later parts of the pipeline. |
I don't think they'll ever know the feather files are different, since they won't be looking at them. |
We can do both. Set up one branch with just the dataset, and no extraneous data (good for general users). And another, 'development' branch with all necessary data to start from any notebook. |
Yeah, I think having both would be good. And in the version with intermediate files, have a disclaimer that says that if you run it from the beginning on your own, your files might not match with the intermediate files. |
@srivarra forgot about this one, this should come before all the other stuff we talked about today |
This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue
Section 1: Design details
Relevant background
Currently we do not have a general example dataset consistent across ark and toffy, nor a method to keep it up to date for example use cases, testing, and development purposes. In addition keeping all the example data, notebook output, and auxiliary data files in a separate directory will benefit users and developers. For example, after running tests devs won't have to remove those created files manually.
Design overview
Need a general directory which stores data of all forms, from raw input, through the processed intermediate stages (like feather files) and then figures (segmentation labels, deepcell, mantis) and final data.
Having a version controlled example dataset will allow us to update, make changes and improve it for future users. Most importantly it'll be a good learning resource for anyone using Ark. Hugging face has a robust solution for this, where users can upload data, model weights, and more. As of right now we'll only consider uploading a dataset. The UI is very similar to GitHub's as well, so its familiarity will allow for a smoother process. Hugging Face also provides a nice Python API for downloading and uploading datasets via GIT Large File System (LFS).
Design list/flowchart
Required inputs
Output files
Files will be outputted to the
dataset_directory
, into a user determined subfolder. For the case of the example dataset, that would beexample_dataset
.The contents of
dataset_directory
will be included in the.gitignore
.Section 2: Implimentation details
Once you have completed section 1, please tag the relevant parties and iterate on the initial design details until everyone is satisfied. Then,
proceed to section 2
Control flow
The dataset will be contain two zip files, one with just enough data to work with ark from notebook 1, and a debug dataset that contains all data needed for each and every notebook.
load_dataset(org/dataset-name)
fromdatasets
/datasets/
Milestones and timeline
Hugging Face Links
The text was updated successfully, but these errors were encountered: