how can i use the imagenet-22k-wds and imagenet-w21-wds in huggingface on timm/train.py ? #2152

TheDarkKnight-21th · 2024-04-18T03:50:28Z

TheDarkKnight-21th
Apr 18, 2024

i know that wds is the "web dataset" and it is different with image files(data) which is stored in local.

so how can i use the timm/imagenet-22k-wds and timm/imagenet-w21-wds in huggingface on timm/train.py ?

what should i do typing on that script (argument) ?

also is it possible to use the .tar(wds) streaming mode with url (without storing the data in local) ? and then
how can i get the url? like imagenet-12k-wds on webdataset example ? https://huggingface.co/docs/hub/datasets-webdataset

if Data loading with url is possible, which method will be faster between data loading with url or data loading with stored dataset in local?

i alreay finished downloading the timm/imagenet-w21-wds on my local server with "dataset" library.

< imagenet-w21-wds path>

plz check the timm/train.py https://github.com/huggingface/pytorch-image-models/blob/main/train.py

rwightman · 2024-04-18T06:02:21Z

rwightman
Apr 18, 2024
Maintainer

@TheDarkKnight-21th you need to find the local path with the .tar files in it, if it also has the _info.json file you can just use:
python train.py --data-dir /path/to/imagenet-w21-wds/ --dataset wds/ --val-split '' --num-classes 21842

It's actually faster to download the large wds tar datasets using the cli tool and enabling HF transfer.

And it will use the shard info in the info file, there is no val split for that w21-wds which is why I use an empty '' to disable val. The in12k and in22k have a val subset that I created so wouldn't be needed. You can also manually specify the splits and use a subset of shards for val (they are shuffled). See below for manual split format, shard names followed by | and then # samples

python train.py --data-dir /imagenet-w21-wds/ --dataset wds/ --train-split 'imagenet_w21-train-{0000..1983}.tar|12741248' --val-split 'imagenet_w21-train-{1984..2047}.tar|411008' --num-classes 21842

5 replies

rwightman Apr 18, 2024
Maintainer

Oh wait, w21 is fewer classes, I need to add it to https://github.com/huggingface/pytorch-image-models/blob/main/timm/data/imagenet_info.py .. it's 19167.

rwightman Apr 18, 2024
Maintainer

and make sure those splits are in quotes, or they will be expanded by your shell

TheDarkKnight-21th Apr 18, 2024
Author

....

also is it possible to use the .tar(wds) streaming mode with url (without storing the data in local) ? and then
how can i get the url? like imagenet-12k-wds on webdataset example ? https://huggingface.co/docs/hub/datasets-webdataset

if Data loading with url is possible, which method will be faster between data loading with url or data loading with stored dataset in local?

...

rwightman Apr 18, 2024
Maintainer

@TheDarkKnight-21th can see details for streaming in a past tweet, https://twitter.com/wightmanr/status/1743083207443267759

rwightman Apr 18, 2024
Maintainer

while streaming is possible, for train it'd be a LOT of data (re-stream file every epoch), and given cost of compute for large scale, if there's a hiccup in the network and it times out, waste. Better to fully download and more perforamce, esp if distributed training.

I'd use streaming for exploring or maybe validation but not train.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can i use the imagenet-22k-wds and imagenet-w21-wds in huggingface on timm/train.py ? #2152

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

how can i use the imagenet-22k-wds and imagenet-w21-wds in huggingface on timm/train.py ? #2152

TheDarkKnight-21th Apr 18, 2024

Replies: 1 comment · 5 replies

rwightman Apr 18, 2024 Maintainer

rwightman Apr 18, 2024 Maintainer

rwightman Apr 18, 2024 Maintainer

TheDarkKnight-21th Apr 18, 2024 Author

rwightman Apr 18, 2024 Maintainer

rwightman Apr 18, 2024 Maintainer

TheDarkKnight-21th
Apr 18, 2024

Replies: 1 comment 5 replies

rwightman
Apr 18, 2024
Maintainer

rwightman Apr 18, 2024
Maintainer

rwightman Apr 18, 2024
Maintainer

TheDarkKnight-21th Apr 18, 2024
Author

rwightman Apr 18, 2024
Maintainer

rwightman Apr 18, 2024
Maintainer