Florence2 distributed inference example #3123

hlky · 2024-09-19T16:42:46Z

What does this PR do?

This PR adds a distributed inference example for Florence-2.

Notes

Based on https://gist.github.com/sayakpaul/dfb26d6562b62ba8c2f613dc912a28a8#file-generate_captions_multigpu-py and https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba

nodesplitter=None in wds.WebDataset seems to be an important implementation detail, by default I'm getting ValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training, with wds.split_by_node only 1 GPU is utilized and the script hangs.
Similarly empty_check=False avoids the need to set num_workers to less than or equal to the number of shards, otherwise: ValueError: No samples found in dataset; perhaps you have fewer shards than workers.
We have to convert BatchFeature to dict: dict(inputs) for split_between_processes to work.

Who can review?

@sayakpaul

sayakpaul

Thank you for working on this!

Could we perhaps make the data_path configurable such that it accepts the supported paths by webdataset?

HuggingFaceDocBuilderDev · 2024-09-20T01:13:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

hlky · 2024-09-20T09:36:27Z

I've added CLI args using Fire. --data_path can any of the supported paths by webdataset e.g. https://huggingface.co/datasets/pixparse/cc3m-wds/resolve/main/cc3m-train-{0000..0011}.tar, /local/dir/shard-{0000..0011}.tar, pipe:aws s3 cp s3://data/shard-{0000..0011}.tar -

sayakpaul

Thanks! I left two minor comments.

examples/inference/distributed/florence2.py

sayakpaul · 2024-09-20T10:58:56Z

examples/inference/distributed/florence2.py

+
+
+if __name__ == "__main__":
+    fire.Fire(main)


Would be nice to add some notes above the script about the time it takes to get the artifacts done (maybe on a single shard)?

Sure I'll do some benchmarking, any preference on GPU model and count?

I think as long as the script runs on 2 GPUs (maybe 4090s), it should be okay. But I will let @muellerzr to comment further.

I've tested this on 2x4090. It appears to be CPU bound due to sha1, pil_to_bytes in preprocess, Image.open, processor on the batch in collate.

Testing with 1 shard, https://huggingface.co/datasets/pixparse/cc3m-wds/resolve/main/cc3m-train-0000.tar, batch size 12 (6 per gpu), and <CAPTION> task:

420it [10:33, 1.51s/it]

Compared to my project florence-tool using 1 GPU:

5046it [06:22, 13.18it/s]

In that project I save the results with __key__ and __url__, use processor in preprocess then concat in collate and don't resave the image, although I do need to test whether processor is more efficient in preprocess or collate. Note: I increment the progress bar by batch size rather than batch index.

Testing a version of this script that does the same get roughly the same speed accounting for 1 vs 2 GPUs:

420it [03:16, 2.13it/s]

WDYT?

Also the last partial batch is missing in this script:

ls -f1 outputs/ | grep .json | wc -l 5040

Looks like we'd need to use apply_padding=True in split_between_processes.

Thanks!

Not serializing the images is fine. I think it's dependent on how one is ultimately using the datasets in their workflows. But how are we doing the captions? They should be serialized, right?

What I usually do is first serialize the images and caption, and then package them as webdataset shards. I think we could directly do that here instead of serializing the captions. WDYT?

Additionally, for speeding up the PIL stuff, I find https://github.com/uploadcare/pillow-simd to be quintessential.

I've pushed an optimized version that still saves the images and original captions, seems the issue was pil_to_bytes and Image.open. See the diff here.

Performance:

On 2x4090: 420it [03:15, 2.15it/s] (~25.8 images/s) With --prompt "<DETAILED_CAPTION>": 420it [08:16, 1.18s/it] (~10.17 images/s)

Nice. IIUC is this same performance you obtained with your tool?

Additionally, would it be possible to check if pillow-simd leads to any speedup?

Yes it's the same performance.

No improvement using pillow-simd.

muellerzr

Thanks for this wonderful example! (and patience while I was OOO for a bit). Code wise this looks solid, just two suggestions in regards to the documentation at the top :)

examples/inference/distributed/florence2.py

sayakpaul · 2024-10-09T09:40:52Z

Thanks for your contributions here, @hlky! I didn't recognize you first but now after huggingface/diffusers#9509, I can immediately recall!

sayakpaul · 2024-10-09T09:44:12Z

@muellerzr is the failing test related?

muellerzr · 2024-10-09T09:48:11Z

@sayakpaul nope it's not

sayakpaul reviewed Sep 20, 2024

View reviewed changes

hlky force-pushed the distributed-inference-florence2 branch from 1049d75 to 906aa8f Compare September 20, 2024 09:32

sayakpaul reviewed Sep 20, 2024

View reviewed changes

hlky force-pushed the distributed-inference-florence2 branch from 906aa8f to 8fc5c63 Compare September 20, 2024 11:23

sayakpaul requested review from muellerzr and SunMarc September 22, 2024 15:47

hlky force-pushed the distributed-inference-florence2 branch from 9767e70 to 21bef83 Compare October 6, 2024 08:50

sayakpaul mentioned this pull request Oct 8, 2024

add multi gpu image captioning #3144

Closed

muellerzr approved these changes Oct 8, 2024

View reviewed changes

examples/inference/distributed/florence2.py Outdated Show resolved Hide resolved

examples/inference/distributed/florence2.py Outdated Show resolved Hide resolved

hlky added 3 commits October 9, 2024 10:31

Florence2 distributed inference example

851bb05

optimized

e3d7a49

Documentation

9222c43

hlky force-pushed the distributed-inference-florence2 branch from 21bef83 to 9222c43 Compare October 9, 2024 09:32

muellerzr merged commit f4ee5a2 into huggingface:main Oct 9, 2024
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Florence2 distributed inference example #3123

Florence2 distributed inference example #3123

hlky commented Sep 19, 2024

sayakpaul left a comment

HuggingFaceDocBuilderDev commented Sep 20, 2024

hlky commented Sep 20, 2024

sayakpaul left a comment

sayakpaul Sep 20, 2024

hlky Sep 20, 2024

sayakpaul Sep 20, 2024

hlky Sep 21, 2024

sayakpaul Sep 22, 2024

hlky Sep 22, 2024

sayakpaul Sep 22, 2024

hlky Sep 22, 2024

muellerzr left a comment

sayakpaul commented Oct 9, 2024

sayakpaul commented Oct 9, 2024

muellerzr commented Oct 9, 2024

Florence2 distributed inference example #3123

Florence2 distributed inference example #3123

Conversation

hlky commented Sep 19, 2024

What does this PR do?

Notes

Who can review?

sayakpaul left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 20, 2024

hlky commented Sep 20, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

sayakpaul commented Oct 9, 2024

sayakpaul commented Oct 9, 2024

muellerzr commented Oct 9, 2024