Dataset ingestion as pipeline step or async job? #3

jfomhover · 2022-03-22T18:14:06Z

jfomhover
Mar 22, 2022
Collaborator

In some of the vision use cases, the data is available as an archive at a given url, example:

stanford dogs dataset : http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
places 365 dataset: http://data.csail.mit.edu/places/places365/places365standard_easyformat.tar

To benchmark our components on those, we need not just a dataset registration, but a step that will untar those archives, and potentially split into train/validation sets.

There are two options here:

add a step in the training pipeline with this data ingestion, running potentially every time we run training
have a distinct job that produces the train/validation datasets separately, just run once

What would be the preference?

For option 1, I was thinking about creating CLI jobs that you run once in your workspace to register the datasets and be done with it.

wayliums · 2022-03-22T18:24:33Z

wayliums
Mar 22, 2022
Collaborator

757MB and 25GB. I would definitely think we should go with Option 2. Is this needed as part of a test job? If so, we need a way so that we could easily just rerun a code and set it up in Staging workspace.

7 replies

wayliums Mar 22, 2022
Collaborator

Here's my detailed suggestion now, taking some inspiration from our dpv2 samples

Basically

for your benchmark Github workflow, use a step to register the dataset, directly reference to the http location.
Invoke a pipeline job, taking that dataset and split to train/validation.
In the same pipeline, do your other benchmarking activities.
If the dataset isn't changed, I would assume the training and validation dataset will be reusing the cache.
When DPV2 output supports explicit dataset registration, then we can modify the split step to register those two dataset so other pipeline can use them too.

wayliums Mar 22, 2022
Collaborator

Actually, you can directly upload to azure storage, untar it, and use wasb to mount the dataset?

https://github.com/Azure/azureml-examples/blob/main/cli/jobs/single-step/spark/nyctaxi/job.yml#L9

Then as part of setup, we just need to setup workspace connection to that storage location.

And, there's azure open dataset, if possible we use those too? No image dataset for now though.

https://docs.microsoft.com/en-us/azure/open-datasets/dataset-catalog

jfomhover Mar 22, 2022
Collaborator Author

Open Datasets is great, but the only image dataset is MNIST which is given in binary format (not images).

wayliums Mar 22, 2022
Collaborator

right, so that's where I would suggest we have a storage where we can have a storage ourselves, so we avoid downloading from web every time

dkmiller Mar 22, 2022

@jfomhover , since the data here is non sensitive (I'm presuming) it should in principal be fine to put it in an unauthenticated blob storage container which you can either registry on-the-fly or refer to inline.

Alternatively, having deterministic components for creating the benchmarking data is a good approach; this meshes nicely with what @ShizeSu06 and I did for canary pipelines: canary-data-write.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset ingestion as pipeline step or async job? #3

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Dataset ingestion as pipeline step or async job? #3

jfomhover Mar 22, 2022 Collaborator

Replies: 1 comment · 7 replies

wayliums Mar 22, 2022 Collaborator

wayliums Mar 22, 2022 Collaborator

wayliums Mar 22, 2022 Collaborator

jfomhover Mar 22, 2022 Collaborator Author

wayliums Mar 22, 2022 Collaborator

dkmiller Mar 22, 2022

jfomhover
Mar 22, 2022
Collaborator

Replies: 1 comment 7 replies

wayliums
Mar 22, 2022
Collaborator

wayliums Mar 22, 2022
Collaborator

wayliums Mar 22, 2022
Collaborator

jfomhover Mar 22, 2022
Collaborator Author

wayliums Mar 22, 2022
Collaborator