-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Florence2 distributed inference example #3123
Florence2 distributed inference example #3123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
Could we perhaps make the data_path configurable such that it accepts the supported paths by webdataset
?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
1049d75
to
906aa8f
Compare
I've added CLI args using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I left two minor comments.
|
||
|
||
if __name__ == "__main__": | ||
fire.Fire(main) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add some notes above the script about the time it takes to get the artifacts done (maybe on a single shard)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll do some benchmarking, any preference on GPU model and count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as the script runs on 2 GPUs (maybe 4090s), it should be okay. But I will let @muellerzr to comment further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this on 2x4090. It appears to be CPU bound due to sha1
, pil_to_bytes
in preprocess, Image.open
, processor
on the batch in collate.
Testing with 1 shard, https://huggingface.co/datasets/pixparse/cc3m-wds/resolve/main/cc3m-train-0000.tar
, batch size 12 (6 per gpu), and <CAPTION>
task:
420it [10:33, 1.51s/it]
Compared to my project florence-tool using 1 GPU:
5046it [06:22, 13.18it/s]
In that project I save the results with __key__
and __url__
, use processor
in preprocess
then concat in collate
and don't resave the image, although I do need to test whether processor
is more efficient in preprocess
or collate
. Note: I increment the progress bar by batch size rather than batch index.
Testing a version of this script that does the same get roughly the same speed accounting for 1 vs 2 GPUs:
420it [03:16, 2.13it/s]
WDYT?
Also the last partial batch is missing in this script:
ls -f1 outputs/ | grep .json | wc -l
5040
Looks like we'd need to use apply_padding=True
in split_between_processes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Not serializing the images is fine. I think it's dependent on how one is ultimately using the datasets in their workflows. But how are we doing the captions? They should be serialized, right?
What I usually do is first serialize the images and caption, and then package them as webdataset
shards. I think we could directly do that here instead of serializing the captions. WDYT?
Additionally, for speeding up the PIL stuff, I find https://github.com/uploadcare/pillow-simd to be quintessential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an optimized version that still saves the images and original captions, seems the issue was pil_to_bytes
and Image.open
. See the diff here.
Performance:
On 2x4090: 420it [03:15, 2.15it/s] (~25.8 images/s)
With --prompt "<DETAILED_CAPTION>": 420it [08:16, 1.18s/it] (~10.17 images/s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. IIUC is this same performance you obtained with your tool?
Additionally, would it be possible to check if pillow-simd leads to any speedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's the same performance.
No improvement using pillow-simd.
906aa8f
to
8fc5c63
Compare
9767e70
to
21bef83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this wonderful example! (and patience while I was OOO for a bit). Code wise this looks solid, just two suggestions in regards to the documentation at the top :)
21bef83
to
9222c43
Compare
Thanks for your contributions here, @hlky! I didn't recognize you first but now after huggingface/diffusers#9509, I can immediately recall! |
@muellerzr is the failing test related? |
@sayakpaul nope it's not |
What does this PR do?
This PR adds a distributed inference example for Florence-2.
#3078
Notes
Based on https://gist.github.com/sayakpaul/dfb26d6562b62ba8c2f613dc912a28a8#file-generate_captions_multigpu-py and https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba
nodesplitter=None
inwds.WebDataset
seems to be an important implementation detail, by default I'm gettingValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training
, withwds.split_by_node
only 1 GPU is utilized and the script hangs.empty_check=False
avoids the need to set num_workers to less than or equal to the number of shards, otherwise:ValueError: No samples found in dataset; perhaps you have fewer shards than workers.
BatchFeature
to dict:dict(inputs)
forsplit_between_processes
to work.Who can review?
@sayakpaul