Advice for training with TPU #2158

gau-nernst · 2024-04-25T01:31:27Z

gau-nernst
Apr 25, 2024

I'm training some timm models from scratch for face recognition. Recently I signed up for TPU Research Cloud. Do you have any advice for training timm models with TPU? Is PyTorch XLA the way to go, or re-writing the code in JAX is worth it for performance gains/compatibility?
For the data, I hosted my dataset as WebDataset on HuggingFace. That should be ok for TPU training?
Thank you!

Answered by rwightman

Apr 25, 2024

Technically it probably would be possible to hack something to http stream TFDS from .tfrecord shards in a HF dataset (just raw data afterall). But in any case, you wouldn't want to stream from the hub to GC for TPU training because it's too slow, you'd be wasting the TPUs. You need to copy your dataset to GCS for training with TPUs if you want any sort of reasonable performance and reliability.

I did have timm working quite well with TPUs + PyTorch XLA on an alternate branch with a different API I called bits https://github.com/huggingface/pytorch-image-models/tree/bits_and_tpu/timm/bits ... a few people were using it successfully at the time. However, I lost reliable access to TPUs and …

View full answer

gau-nernst · 2024-04-25T04:48:32Z

gau-nernst
Apr 25, 2024
Author

Thinking of using big_vision codebase to train ViT with TPUs. But I would need to convert the dataset to TFDS. Not sure if I can stream TFDS from huggingface (trying to avoid the costs with storing the data on GCS)

1 reply

rwightman Apr 25, 2024
Maintainer

Technically it probably would be possible to hack something to http stream TFDS from .tfrecord shards in a HF dataset (just raw data afterall). But in any case, you wouldn't want to stream from the hub to GC for TPU training because it's too slow, you'd be wasting the TPUs. You need to copy your dataset to GCS for training with TPUs if you want any sort of reasonable performance and reliability.

I did have timm working quite well with TPUs + PyTorch XLA on an alternate branch with a different API I called bits https://github.com/huggingface/pytorch-image-models/tree/bits_and_tpu/timm/bits ... a few people were using it successfully at the time. However, I lost reliable access to TPUs and wasn't able to continue working on it. Seems I might be able to get access to TPUs again but have too many other things on the stack of TODOs to take that on right now....

Answer selected by gau-nernst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for training with TPU #2158

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Advice for training with TPU #2158

gau-nernst Apr 25, 2024

Replies: 1 comment · 1 reply

gau-nernst Apr 25, 2024 Author

rwightman Apr 25, 2024 Maintainer

gau-nernst
Apr 25, 2024

Replies: 1 comment 1 reply

gau-nernst
Apr 25, 2024
Author

rwightman Apr 25, 2024
Maintainer