Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [doc] performance/scalability revamp #15213

Closed
wants to merge 9 commits into from
Closed

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Jan 18, 2022

@lvwerra and I are working on a massive performance/scalability docs revamp:

So the rough plan is to make custom plans for each of the combinations [inference|training] * [1 gpu|many gpus|cpu] so that it's very easy for the user to follow the instructions that are specific to their needs.

So the proposed doc layout is:

  • performance.mdx (main entry point)
  • perf_infer.mdx
    • perf_infer_cpu.mdx
    • perf_infer_gpu_many.mdx
    • perf_infer_gpu_one.mdx
  • perf_train.mdx
    • perf_train_gpu_many.mdx
    • perf_train_gpu_one.mdx
  • scalability.mdx (rename from parallelism.mdx) (XXX: to do)

See the PR's changes for a rough layout of the content.

One big question is this: At the moment everything is pytorch-centric, as we don't have any info on tf/flax. Down the road we will either inject tf/flax-specific instructions into the current docs, or perhaps it'd be better to have dedicated docs for pt/tf/flax. It'd help a lot to decide ahead of time to avoid document renaming and potentially breaking links. If we plan to have these PT-specific perhaps let's embed _pt in the filenames?

@lvwerra

@stas00 stas00 changed the title [doc] performance/scalability revamp [WIP] [doc] performance/scalability revamp Jan 18, 2022
@stas00 stas00 marked this pull request as draft January 18, 2022 22:43
@lvwerra
Copy link
Member

lvwerra commented Jan 19, 2022

Hi @stas00

Thanks for shaping the structure - this is looking great! I have been thinking about this a bit more and have a two main comments:

  • Not sure if you copy-pasted the subsections or if they are supposed to be like that. There is a lot of redundancy and I think if we e.g. explain mixed precision schemes in the single GPU section we don't need to explain it again in the multi-GPU section. And I would suggest to the reader that one should first look at the single GPU section as the methods carry over to the multi-GPU case.
  • Performance vs. scalability: the only parallelism strategy that is currently natively supported with the Trainer and accelerate is data parallelism, with the exception of ZerO with DeepSpeed. What to you think about outsourcing the theoretical parallelism parts to a blog post and only keep the aspects in the docs that can be used natively in transformers. I think we could then also merge perf_train_gpu_many.mdx and scalability.mdx into one section where we highlight for each technique whether it helps performance or scalability.

In terms of implementation what do you think about tackling the sections with the most need first? I'd expect single GPU training and CPU inference to be the most widely used settings. Followed by multi-GPU training and GPU inference (AFAICT for many companies GPUs in prod are still out of reach). Finally, multi-GPU inference which is probably only needed for a few companies with huge models and fast response requirements, right?

@stas00
Copy link
Contributor Author

stas00 commented Jan 19, 2022

re: sub-sections

The ideas to have the map and some of those map entries will redirect to elsewhere for details. e.g. mixed precision is wanted in all paths, so we cover its theory in one place and link to it from all other places. We will need to decide whether we cover it in the first path (e.g. 1-gpu train) or in a shared doc (e.d. current performance.mdx). I'm inclined to think the latter, since performance.mdx currently has a lot of shared info - .e.g. most of the hardware notes. So this could be our theory w/ brief examples - precisely as it's now. And then on each specific path (e.g. many-gpu train) we link the overview in the main doc and then include recipes to have to use it on this particular path, with code, etc.

re: Performance vs. scalability

Same idea, keep the general scalability documents that describes the whole domain including parts that we don't yet have. This is an overview document. Then in each specific path doc we cover only the tech that is available to our users with references to the main doc for theory/overview and focusing on the how-to details. That way the specific paths (scenarios) will be 100% actionable. And the general document shows a curious reader a bigger picture and perhaps even entice them to go and fill the missing holes. Also note that all those missing holes will be filled out sooner or later since we are actively trying to sort it out.

re: priority/order

we first re-shuffle what we have and your PR and then perhaps add a bit of specific how-to that hasn't been written yet (e.g. make a really neat 1-gpu-train path and merge this PR. Then gradually improve the other sections.

@lvwerra
Copy link
Member

lvwerra commented Jan 24, 2022

That sounds good to me - let's discuss this with the rest of the team then.

@stas00
Copy link
Contributor Author

stas00 commented Feb 16, 2022

Progress made:

  • beefed up the main perf doc - which is the reference to which other docs point to
  • started working on a "model" file docs/source/perf_train_gpu_one.mdx for the 1 of 5 main scenarios we mapped out - once polished we continue with the rest.

Need your input:

  • please review docs/source/perf_train_gpu_one.mdx and let me know if this feels good or whether a different approach should be taken. As you can see since there is going to be a lot of duplication I'm offloading all the why's to performance.mdx and only leaving the how-to information specific to each situation/scenario. I'm of course open to changes in course, but let's sort out one doc and then replicate the same structure to the other 4 docs.
  • I don't know anything about accelerate so if @sgugger / @lvwerra you could fill in the gaps that would be great. (marked with XXX: Sylvain/Leandro)

I'd say for now please ignore any spelling or grammar or any minor details as it's likely that many of these sections will be re-written completely so please don't waste your time on premature editing. We will do it at the very end when y'all happy with the first doc.

Thank you!

p.s. Also how do we make the doc builder add its link as this PR was created before that feature was added? I could start a new PR if it's easier since there was no commentary to preserve yet. I think the ability to read the rendered doc will help a lot to help us produce better documentation.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this! I've made a high-level review as requested. I won't have time to fill the Accelerate bits until next week, so if you have some time to do it @lvwerra, don't hesitate!

@@ -63,6 +63,20 @@
title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
- local: parallelism
title: Model Parallelism
- local: perf_infer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a subfolder instead of having so many document names prefixed with perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea


### fp16 / bf16

Enabling mixed precision will make your training both faster and use less memory. The science of it is explained [here](performance#fp16) and [here]((performance#bf16).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the "use less memory" as it's only the case when the batch size is big. We already have enough issues opened by users complaining it uses more memory for large models with batch size 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good call. it's because fp16 is touted everywhere that it saves memory and it's not most of the time as it takes 0.5x extra memory for fp16 and fp32 versions.


First, a quick decision tree:

1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add "for training" as you need space for the gradients, optimizer states etc.

Comment on lines +173 to +174
- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer](
https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `batch_size` in the code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the argument passed along to the dataloader, so not complex at all :-)

Comment on lines +178 to +184
### Optimizer

A choice of optimizer can impact the throughput. For example, using the fused AdamW optimizer from [NVIDIA/apex](https://github.com/NVIDIA/apex) will be faster than the same optimizer from `torch`.

The science of optimizer's speed and which optimizers to choose when are explained [here](performance#optimizer).

To activate please see the Optimizer section in the "Less Memory" section of this document (XXX: how to link?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a section on optimizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - expanded it in this PR

@lvwerra
Copy link
Member

lvwerra commented Feb 17, 2022

Hi @stas00 thanks for starting to work on this!

First, I am not sure about the doc builder - maybe it is indeed easiest to just create a new PR.

A few high level comments on the current structure:

  • Since performance.mdx is the main entry point I think it's main purpose should just be to give an overview of all the subsections and maybe a guide how to read them.
  • My understanding was that we would essentially move the guide added in add model scaling section #15119 from performance.mdx to the perf_train_gpu_one.mdx section and properly merge it with the material that was already there and is currently appended at the end of the document.
  • Regarding the separation of why and how: I am actually not in favour of that because I think it is easier for a user if both are together. Explain a concept and directly show how it is done, otherwise there will be a lot of switching between the two.

What do you think? If you'd like I can have a stab at it.

@stas00
Copy link
Contributor Author

stas00 commented Feb 17, 2022

Hi @stas00 thanks for starting to work on this!

First, I am not sure about the doc builder - maybe it is indeed easiest to just create a new PR.

A few high level comments on the current structure:

  • Since performance.mdx is the main entry point I think it's main purpose should just be to give an overview of all the subsections and maybe a guide how to read them.

Indeed, we can move the indepth reference to another file and refer to it instead. For now it's just easier to xref to it.

  • My understanding was that we would essentially move the guide added in add model scaling section #15119 from performance.mdx to the perf_train_gpu_one.mdx section and properly merge it with the material that was already there and is currently appended at the end of the document.

Indeed, something like that. I was just trying to use a few sections to lay out a possible structure before we fill the gaps in.

  • Regarding the separation of why and how: I am actually not in favour of that because I think it is easier for a user if both are together. Explain a concept and directly show how it is done, otherwise there will be a lot of switching between the two.

Which means that many sections will be duplicated at least 5 times and in some cases more than that - as you can see I have 2 identical sub-sections for several entries since those impact both speed and memory, but the explanations are slightly different.

What do you think? If you'd like I can have a stab at it.

Sure, please feel free to shift things around and propose a different approach.

Let me know if I prefer that I open a new PR first, but then I will need to integrate Sylvain's suggestions and I'm a bit too busy with BigScience at the moment. so it's your call.

I can of course integrate them in the new PR as well, I won't forget.

@lvwerra
Copy link
Member

lvwerra commented Feb 18, 2022

Awesome, thanks for clarifying - it is sometimes hard to read the intentions :)

If you could open a new PR that would be great and I can work on it a bit next week and also integrate Sylvain's comments (or make notes).

Thanks!

@stas00
Copy link
Contributor Author

stas00 commented Feb 18, 2022

The PR has been moved to #15723

Will address all the suggestions so far in the new PR.

@stas00 stas00 closed this Feb 18, 2022
@stas00 stas00 deleted the doc-perf-revamp branch April 27, 2022 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants