Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Dataset #657

Closed
1 of 4 tasks
srivarra opened this issue Aug 16, 2022 · 10 comments · Fixed by #685
Closed
1 of 4 tasks

Example Dataset #657

srivarra opened this issue Aug 16, 2022 · 10 comments · Fixed by #685
Assignees
Labels
design_doc Detailed implementation plan

Comments

@srivarra
Copy link
Contributor

srivarra commented Aug 16, 2022

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Section 1: Design details

Relevant background

Currently we do not have a general example dataset consistent across ark and toffy, nor a method to keep it up to date for example use cases, testing, and development purposes. In addition keeping all the example data, notebook output, and auxiliary data files in a separate directory will benefit users and developers. For example, after running tests devs won't have to remove those created files manually.

Design overview

Need a general directory which stores data of all forms, from raw input, through the processed intermediate stages (like feather files) and then figures (segmentation labels, deepcell, mantis) and final data.

Having a version controlled example dataset will allow us to update, make changes and improve it for future users. Most importantly it'll be a good learning resource for anyone using Ark. Hugging face has a robust solution for this, where users can upload data, model weights, and more. As of right now we'll only consider uploading a dataset. The UI is very similar to GitHub's as well, so its familiarity will allow for a smoother process. Hugging Face also provides a nice Python API for downloading and uploading datasets via GIT Large File System (LFS).

Design list/flowchart

Example Dataset-3

Required inputs

  • Dataset:
    • Pick a sample of FOVs with their respective channels (ideally from a lab member's project).
    • Organize the sample as a real user would see it.
    • Create a small script which can be used to upload and the dataset to Hugging Face.
      • Requires authentication to upload, however it's unnecessary for downloading.
    • Adjust the Notebook input and output directories to the following, depending on the context:
dataset_directory/
├── example_dataset/
│   ├── raw_dataset/
│   │   ├── fov0/
│   │   │   ├── chan0.tiff
│   │   │   ├── chan1.tiff
│   │   │   ├── chan2.tiff
│   │   │   └── ...
│   │   └── fov1/
│   │       ├── chan0.tiff
│   │       ├── chan1.tiff
│   │       ├── chan2.tiff
│   │       └── ...
│   ├── processed_dataset/
│   │   ├── deepcell_output/
│   │   └── pixie_output/
│   │       ├── pixel_clustering/
│   │       └── cell_clustering/
│   ├── figures/
│   └── output/
│       ├── fov0.feather
│       └── ...
└── your_dataset/
    └── ...

Output files

Files will be outputted to the dataset_directory, into a user determined subfolder. For the case of the example dataset, that would be example_dataset.

The contents of dataset_directory will be included in the .gitignore.

Section 2: Implimentation details

Once you have completed section 1, please tag the relevant parties and iterate on the initial design details until everyone is satisfied. Then,
proceed to section 2

Control flow

The dataset will be contain two zip files, one with just enough data to work with ark from notebook 1, and a debug dataset that contains all data needed for each and every notebook.

  • call load_dataset(org/dataset-name) from datasets
  • Decide whether this is for the regular dataset or the debug dataset
  • If the files do not exist, download it and move the files over into /datasets/
  • In the notebooks, have an optional get dataset cell.

Milestones and timeline

  • Create the dataset for huggingface
  • Reorganize the folder structure for datasets
  • Adjust notebook paths
  • Notebook test adjustments if necessary.

Hugging Face Links

@srivarra srivarra added the design_doc Detailed implementation plan label Aug 16, 2022
@srivarra srivarra self-assigned this Aug 16, 2022
@srivarra
Copy link
Contributor Author

@ngreenwald

@ngreenwald
Copy link
Member

@cliu72 how does this look?

@ngreenwald
Copy link
Member

Looks great.

  1. Can we add the necessary code to download the data at the top of the notebooks, so the user doesn't have to do anything? Maybe with a check to see if the folder exists already; if it does, skip?
  2. If I'm reading this correctly, the entire example dataset folder being .gitignored means we don't have to have separate input and output folders. Instead, we can have the data get saved in the default, logical place, overwriting (if necessary) the previous version without any version control issues, correct?

@srivarra
Copy link
Contributor Author

  1. Can we add the necessary code to download the data at the top of the notebooks, so the user doesn't have to do anything? Maybe with a check to see if the folder exists already; if it does, skip?

Yeah, we could create a small function that can download the example dataset if it's not present and the user declares that they want it downloaded.

  1. If I'm reading this correctly, the entire example dataset folder being .gitignored means we don't have to have separate input and output folders. Instead, we can have the data get saved in the default, logical place, overwriting (if necessary) the previous version without any version control issues, correct?

Yes, that is correct.

@ngreenwald
Copy link
Member

Okay cool, looks good. First priority is getting the ark notebook reorganization PR in, then can get started on this/docker tag update

@cliu72
Copy link
Contributor

cliu72 commented Aug 17, 2022

@cliu72 how does this look?

Looks good to me! Just to be clear, are we only going to upload single-channel TIFs for now, or are we uploading intermediate files too? Since pixel clustering is stochastic, everyone would get slightly different weights, so I wonder if it will be confusing (since the feather files they generate won't match up with the ones uploaded). But also, having the intermediate files would be good if users only want to test later parts of the pipeline.

@ngreenwald
Copy link
Member

I don't think they'll ever know the feather files are different, since they won't be looking at them.
It's up to you, would it be useful to give users the option of just running cell clustering without the pixel notebook? Or are most people gonna want to run both. If so, no need for the intermediate files.

@srivarra
Copy link
Contributor Author

Looks good to me! Just to be clear, are we only going to upload single-channel TIFs for now, or are we uploading intermediate files too? Since pixel clustering is stochastic, everyone would get slightly different weights, so I wonder if it will be confusing (since the feather files they generate won't match up with the ones uploaded). But also, having the intermediate files would be good if users only want to test later parts of the pipeline.

We can do both. Set up one branch with just the dataset, and no extraneous data (good for general users). And another, 'development' branch with all necessary data to start from any notebook.

@cliu72
Copy link
Contributor

cliu72 commented Aug 22, 2022

Yeah, I think having both would be good. And in the version with intermediate files, have a disclaimer that says that if you run it from the beginning on your own, your files might not match with the intermediate files.

@ngreenwald
Copy link
Member

@srivarra forgot about this one, this should come before all the other stuff we talked about today

This was referenced Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design_doc Detailed implementation plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants