Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Florence2 distributed inference example #3123

Merged
merged 3 commits into from
Oct 9, 2024

Conversation

hlky
Copy link
Contributor

@hlky hlky commented Sep 19, 2024

What does this PR do?

This PR adds a distributed inference example for Florence-2.

#3078

Notes

Based on https://gist.github.com/sayakpaul/dfb26d6562b62ba8c2f613dc912a28a8#file-generate_captions_multigpu-py and https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba

  • nodesplitter=None in wds.WebDataset seems to be an important implementation detail, by default I'm getting ValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training, with wds.split_by_node only 1 GPU is utilized and the script hangs.
  • Similarly empty_check=False avoids the need to set num_workers to less than or equal to the number of shards, otherwise: ValueError: No samples found in dataset; perhaps you have fewer shards than workers.
  • We have to convert BatchFeature to dict: dict(inputs) for split_between_processes to work.

Who can review?

@sayakpaul

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

Could we perhaps make the data_path configurable such that it accepts the supported paths by webdataset?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hlky hlky force-pushed the distributed-inference-florence2 branch from 1049d75 to 906aa8f Compare September 20, 2024 09:32
@hlky
Copy link
Contributor Author

hlky commented Sep 20, 2024

I've added CLI args using Fire. --data_path can any of the supported paths by webdataset e.g. https://huggingface.co/datasets/pixparse/cc3m-wds/resolve/main/cc3m-train-{0000..0011}.tar, /local/dir/shard-{0000..0011}.tar, pipe:aws s3 cp s3://data/shard-{0000..0011}.tar -

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left two minor comments.

examples/inference/distributed/florence2.py Outdated Show resolved Hide resolved


if __name__ == "__main__":
fire.Fire(main)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add some notes above the script about the time it takes to get the artifacts done (maybe on a single shard)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll do some benchmarking, any preference on GPU model and count?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as the script runs on 2 GPUs (maybe 4090s), it should be okay. But I will let @muellerzr to comment further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this on 2x4090. It appears to be CPU bound due to sha1, pil_to_bytes in preprocess, Image.open, processor on the batch in collate.

Testing with 1 shard, https://huggingface.co/datasets/pixparse/cc3m-wds/resolve/main/cc3m-train-0000.tar, batch size 12 (6 per gpu), and <CAPTION> task:

420it [10:33,  1.51s/it]

Compared to my project florence-tool using 1 GPU:

5046it [06:22, 13.18it/s]

In that project I save the results with __key__ and __url__, use processor in preprocess then concat in collate and don't resave the image, although I do need to test whether processor is more efficient in preprocess or collate. Note: I increment the progress bar by batch size rather than batch index.

Testing a version of this script that does the same get roughly the same speed accounting for 1 vs 2 GPUs:

420it [03:16,  2.13it/s]

WDYT?

Also the last partial batch is missing in this script:

ls -f1 outputs/ | grep .json | wc -l
5040

Looks like we'd need to use apply_padding=True in split_between_processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Not serializing the images is fine. I think it's dependent on how one is ultimately using the datasets in their workflows. But how are we doing the captions? They should be serialized, right?

What I usually do is first serialize the images and caption, and then package them as webdataset shards. I think we could directly do that here instead of serializing the captions. WDYT?

Additionally, for speeding up the PIL stuff, I find https://github.com/uploadcare/pillow-simd to be quintessential.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an optimized version that still saves the images and original captions, seems the issue was pil_to_bytes and Image.open. See the diff here.

Performance:

On 2x4090: 420it [03:15,  2.15it/s] (~25.8 images/s)

With --prompt "<DETAILED_CAPTION>": 420it [08:16,  1.18s/it] (~10.17 images/s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. IIUC is this same performance you obtained with your tool?

Additionally, would it be possible to check if pillow-simd leads to any speedup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the same performance.

No improvement using pillow-simd.

@hlky hlky force-pushed the distributed-inference-florence2 branch from 906aa8f to 8fc5c63 Compare September 20, 2024 11:23
@hlky hlky force-pushed the distributed-inference-florence2 branch from 9767e70 to 21bef83 Compare October 6, 2024 08:50
Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this wonderful example! (and patience while I was OOO for a bit). Code wise this looks solid, just two suggestions in regards to the documentation at the top :)

examples/inference/distributed/florence2.py Outdated Show resolved Hide resolved
examples/inference/distributed/florence2.py Outdated Show resolved Hide resolved
@hlky hlky force-pushed the distributed-inference-florence2 branch from 21bef83 to 9222c43 Compare October 9, 2024 09:32
@sayakpaul
Copy link
Member

Thanks for your contributions here, @hlky! I didn't recognize you first but now after huggingface/diffusers#9509, I can immediately recall!

@sayakpaul
Copy link
Member

@muellerzr is the failing test related?

@muellerzr
Copy link
Collaborator

@sayakpaul nope it's not

@muellerzr muellerzr merged commit f4ee5a2 into huggingface:main Oct 9, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants