Example Dataset #657

srivarra · 2022-08-16T23:17:56Z

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Section 1: Design details

Relevant background

Currently we do not have a general example dataset consistent across ark and toffy, nor a method to keep it up to date for example use cases, testing, and development purposes. In addition keeping all the example data, notebook output, and auxiliary data files in a separate directory will benefit users and developers. For example, after running tests devs won't have to remove those created files manually.

Design overview

Need a general directory which stores data of all forms, from raw input, through the processed intermediate stages (like feather files) and then figures (segmentation labels, deepcell, mantis) and final data.

Having a version controlled example dataset will allow us to update, make changes and improve it for future users. Most importantly it'll be a good learning resource for anyone using Ark. Hugging face has a robust solution for this, where users can upload data, model weights, and more. As of right now we'll only consider uploading a dataset. The UI is very similar to GitHub's as well, so its familiarity will allow for a smoother process. Hugging Face also provides a nice Python API for downloading and uploading datasets via GIT Large File System (LFS).

Design list/flowchart

Required inputs

Dataset:
- Pick a sample of FOVs with their respective channels (ideally from a lab member's project).
- Organize the sample as a real user would see it.
- Create a small script which can be used to upload and the dataset to Hugging Face.
  - Requires authentication to upload, however it's unnecessary for downloading.
- Adjust the Notebook input and output directories to the following, depending on the context:

dataset_directory/
├── example_dataset/
│   ├── raw_dataset/
│   │   ├── fov0/
│   │   │   ├── chan0.tiff
│   │   │   ├── chan1.tiff
│   │   │   ├── chan2.tiff
│   │   │   └── ...
│   │   └── fov1/
│   │       ├── chan0.tiff
│   │       ├── chan1.tiff
│   │       ├── chan2.tiff
│   │       └── ...
│   ├── processed_dataset/
│   │   ├── deepcell_output/
│   │   └── pixie_output/
│   │       ├── pixel_clustering/
│   │       └── cell_clustering/
│   ├── figures/
│   └── output/
│       ├── fov0.feather
│       └── ...
└── your_dataset/
    └── ...

Output files

Files will be outputted to the dataset_directory, into a user determined subfolder. For the case of the example dataset, that would be example_dataset.

The contents of dataset_directory will be included in the .gitignore.

Section 2: Implimentation details

Once you have completed section 1, please tag the relevant parties and iterate on the initial design details until everyone is satisfied. Then,
proceed to section 2

Control flow

The dataset will be contain two zip files, one with just enough data to work with ark from notebook 1, and a debug dataset that contains all data needed for each and every notebook.

call load_dataset(org/dataset-name) from datasets
Decide whether this is for the regular dataset or the debug dataset
If the files do not exist, download it and move the files over into /datasets/
In the notebooks, have an optional get dataset cell.

Milestones and timeline

Create the dataset for huggingface
Reorganize the folder structure for datasets
Adjust notebook paths
Notebook test adjustments if necessary.

Hugging Face Links

Example Dataset Repo

The text was updated successfully, but these errors were encountered:

srivarra · 2022-08-16T23:18:55Z

@ngreenwald

ngreenwald · 2022-08-16T23:29:22Z

@cliu72 how does this look?

ngreenwald · 2022-08-16T23:32:06Z

Looks great.

Can we add the necessary code to download the data at the top of the notebooks, so the user doesn't have to do anything? Maybe with a check to see if the folder exists already; if it does, skip?
If I'm reading this correctly, the entire example dataset folder being .gitignored means we don't have to have separate input and output folders. Instead, we can have the data get saved in the default, logical place, overwriting (if necessary) the previous version without any version control issues, correct?

srivarra · 2022-08-17T00:26:42Z

Can we add the necessary code to download the data at the top of the notebooks, so the user doesn't have to do anything? Maybe with a check to see if the folder exists already; if it does, skip?

Yeah, we could create a small function that can download the example dataset if it's not present and the user declares that they want it downloaded.

If I'm reading this correctly, the entire example dataset folder being .gitignored means we don't have to have separate input and output folders. Instead, we can have the data get saved in the default, logical place, overwriting (if necessary) the previous version without any version control issues, correct?

Yes, that is correct.

ngreenwald · 2022-08-17T02:43:28Z

Okay cool, looks good. First priority is getting the ark notebook reorganization PR in, then can get started on this/docker tag update

cliu72 · 2022-08-17T07:08:57Z

@cliu72 how does this look?

Looks good to me! Just to be clear, are we only going to upload single-channel TIFs for now, or are we uploading intermediate files too? Since pixel clustering is stochastic, everyone would get slightly different weights, so I wonder if it will be confusing (since the feather files they generate won't match up with the ones uploaded). But also, having the intermediate files would be good if users only want to test later parts of the pipeline.

ngreenwald · 2022-08-17T15:19:22Z

I don't think they'll ever know the feather files are different, since they won't be looking at them.
It's up to you, would it be useful to give users the option of just running cell clustering without the pixel notebook? Or are most people gonna want to run both. If so, no need for the intermediate files.

srivarra · 2022-08-18T17:58:57Z

Looks good to me! Just to be clear, are we only going to upload single-channel TIFs for now, or are we uploading intermediate files too? Since pixel clustering is stochastic, everyone would get slightly different weights, so I wonder if it will be confusing (since the feather files they generate won't match up with the ones uploaded). But also, having the intermediate files would be good if users only want to test later parts of the pipeline.

We can do both. Set up one branch with just the dataset, and no extraneous data (good for general users). And another, 'development' branch with all necessary data to start from any notebook.

cliu72 · 2022-08-22T23:41:40Z

Yeah, I think having both would be good. And in the version with intermediate files, have a disclaimer that says that if you run it from the beginning on your own, your files might not match with the intermediate files.

ngreenwald · 2022-08-24T21:57:14Z

@srivarra forgot about this one, this should come before all the other stuff we talked about today

srivarra added the design_doc Detailed implementation plan label Aug 16, 2022

srivarra self-assigned this Aug 16, 2022

This was referenced Aug 18, 2022

Switch example data back to channels last format #383

Closed

Create single consolidated example dataset that is used across all notebooks #408

Closed

This was referenced Aug 25, 2022

Next Release - v0.4.1 #674

Merged

Example Dataset #685

Merged

ngreenwald closed this as completed in #685 Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Dataset #657

Example Dataset #657

srivarra commented Aug 16, 2022 •

edited

Loading

srivarra commented Aug 16, 2022

ngreenwald commented Aug 16, 2022

ngreenwald commented Aug 16, 2022

srivarra commented Aug 17, 2022

ngreenwald commented Aug 17, 2022

cliu72 commented Aug 17, 2022

ngreenwald commented Aug 17, 2022

srivarra commented Aug 18, 2022

cliu72 commented Aug 22, 2022

ngreenwald commented Aug 24, 2022

Example Dataset #657

Example Dataset #657

Comments

srivarra commented Aug 16, 2022 • edited Loading

srivarra commented Aug 16, 2022

ngreenwald commented Aug 16, 2022

ngreenwald commented Aug 16, 2022

srivarra commented Aug 17, 2022

ngreenwald commented Aug 17, 2022

cliu72 commented Aug 17, 2022

ngreenwald commented Aug 17, 2022

srivarra commented Aug 18, 2022

cliu72 commented Aug 22, 2022

ngreenwald commented Aug 24, 2022

srivarra commented Aug 16, 2022 •

edited

Loading