-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [doc] performance/scalability revamp #15213
Conversation
Hi @stas00 Thanks for shaping the structure - this is looking great! I have been thinking about this a bit more and have a two main comments:
In terms of implementation what do you think about tackling the sections with the most need first? I'd expect single GPU training and CPU inference to be the most widely used settings. Followed by multi-GPU training and GPU inference (AFAICT for many companies GPUs in prod are still out of reach). Finally, multi-GPU inference which is probably only needed for a few companies with huge models and fast response requirements, right? |
re: sub-sections The ideas to have the map and some of those map entries will redirect to elsewhere for details. e.g. mixed precision is wanted in all paths, so we cover its theory in one place and link to it from all other places. We will need to decide whether we cover it in the first path (e.g. 1-gpu train) or in a shared doc (e.d. current performance.mdx). I'm inclined to think the latter, since re: Performance vs. scalability Same idea, keep the general scalability documents that describes the whole domain including parts that we don't yet have. This is an overview document. Then in each specific path doc we cover only the tech that is available to our users with references to the main doc for theory/overview and focusing on the how-to details. That way the specific paths (scenarios) will be 100% actionable. And the general document shows a curious reader a bigger picture and perhaps even entice them to go and fill the missing holes. Also note that all those missing holes will be filled out sooner or later since we are actively trying to sort it out. re: priority/order we first re-shuffle what we have and your PR and then perhaps add a bit of specific how-to that hasn't been written yet (e.g. make a really neat 1-gpu-train path and merge this PR. Then gradually improve the other sections. |
That sounds good to me - let's discuss this with the rest of the team then. |
Progress made:
Need your input:
I'd say for now please ignore any spelling or grammar or any minor details as it's likely that many of these sections will be re-written completely so please don't waste your time on premature editing. We will do it at the very end when y'all happy with the first doc. Thank you! p.s. Also how do we make the doc builder add its link as this PR was created before that feature was added? I could start a new PR if it's easier since there was no commentary to preserve yet. I think the ability to read the rendered doc will help a lot to help us produce better documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting this! I've made a high-level review as requested. I won't have time to fill the Accelerate bits until next week, so if you have some time to do it @lvwerra, don't hesitate!
@@ -63,6 +63,20 @@ | |||
title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster' | |||
- local: parallelism | |||
title: Model Parallelism | |||
- local: perf_infer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a subfolder instead of having so many document names prefixed with perf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea
|
||
### fp16 / bf16 | ||
|
||
Enabling mixed precision will make your training both faster and use less memory. The science of it is explained [here](performance#fp16) and [here]((performance#bf16). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the "use less memory" as it's only the case when the batch size is big. We already have enough issues opened by users complaining it uses more memory for large models with batch size 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good call. it's because fp16 is touted everywhere that it saves memory and it's not most of the time as it takes 0.5x extra memory for fp16 and fp32 versions.
|
||
First, a quick decision tree: | ||
|
||
1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add "for training" as you need space for the gradients, optimizer states etc.
- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( | ||
https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `batch_size` in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the argument passed along to the dataloader, so not complex at all :-)
### Optimizer | ||
|
||
A choice of optimizer can impact the throughput. For example, using the fused AdamW optimizer from [NVIDIA/apex](https://github.com/NVIDIA/apex) will be faster than the same optimizer from `torch`. | ||
|
||
The science of optimizer's speed and which optimizers to choose when are explained [here](performance#optimizer). | ||
|
||
To activate please see the Optimizer section in the "Less Memory" section of this document (XXX: how to link?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a section on optimizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - expanded it in this PR
Hi @stas00 thanks for starting to work on this! First, I am not sure about the doc builder - maybe it is indeed easiest to just create a new PR. A few high level comments on the current structure:
What do you think? If you'd like I can have a stab at it. |
Indeed, we can move the indepth reference to another file and refer to it instead. For now it's just easier to xref to it.
Indeed, something like that. I was just trying to use a few sections to lay out a possible structure before we fill the gaps in.
Which means that many sections will be duplicated at least 5 times and in some cases more than that - as you can see I have 2 identical sub-sections for several entries since those impact both speed and memory, but the explanations are slightly different.
Sure, please feel free to shift things around and propose a different approach. Let me know if I prefer that I open a new PR first, but then I will need to integrate Sylvain's suggestions and I'm a bit too busy with BigScience at the moment. so it's your call. I can of course integrate them in the new PR as well, I won't forget. |
Awesome, thanks for clarifying - it is sometimes hard to read the intentions :) If you could open a new PR that would be great and I can work on it a bit next week and also integrate Sylvain's comments (or make notes). Thanks! |
The PR has been moved to #15723 Will address all the suggestions so far in the new PR. |
@lvwerra and I are working on a massive performance/scalability docs revamp:
So the rough plan is to make custom plans for each of the combinations
[inference|training] * [1 gpu|many gpus|cpu]
so that it's very easy for the user to follow the instructions that are specific to their needs.So the proposed doc layout is:
See the PR's changes for a rough layout of the content.
One big question is this: At the moment everything is pytorch-centric, as we don't have any info on tf/flax. Down the road we will either inject tf/flax-specific instructions into the current docs, or perhaps it'd be better to have dedicated docs for pt/tf/flax. It'd help a lot to decide ahead of time to avoid document renaming and potentially breaking links. If we plan to have these PT-specific perhaps let's embed
_pt
in the filenames?@lvwerra