[WIP] [doc] performance/scalability revamp #15213

stas00 · 2022-01-18T22:38:28Z

@lvwerra and I are working on a massive performance/scalability docs revamp:

So the rough plan is to make custom plans for each of the combinations [inference|training] * [1 gpu|many gpus|cpu] so that it's very easy for the user to follow the instructions that are specific to their needs.

So the proposed doc layout is:

performance.mdx (main entry point)
perf_infer.mdx
- perf_infer_cpu.mdx
- perf_infer_gpu_many.mdx
- perf_infer_gpu_one.mdx
perf_train.mdx
- perf_train_gpu_many.mdx
- perf_train_gpu_one.mdx
scalability.mdx (rename from parallelism.mdx) (XXX: to do)

See the PR's changes for a rough layout of the content.

One big question is this: At the moment everything is pytorch-centric, as we don't have any info on tf/flax. Down the road we will either inject tf/flax-specific instructions into the current docs, or perhaps it'd be better to have dedicated docs for pt/tf/flax. It'd help a lot to decide ahead of time to avoid document renaming and potentially breaking links. If we plan to have these PT-specific perhaps let's embed _pt in the filenames?

@lvwerra

lvwerra · 2022-01-19T15:07:38Z

Hi @stas00

Thanks for shaping the structure - this is looking great! I have been thinking about this a bit more and have a two main comments:

Not sure if you copy-pasted the subsections or if they are supposed to be like that. There is a lot of redundancy and I think if we e.g. explain mixed precision schemes in the single GPU section we don't need to explain it again in the multi-GPU section. And I would suggest to the reader that one should first look at the single GPU section as the methods carry over to the multi-GPU case.
Performance vs. scalability: the only parallelism strategy that is currently natively supported with the Trainer and accelerate is data parallelism, with the exception of ZerO with DeepSpeed. What to you think about outsourcing the theoretical parallelism parts to a blog post and only keep the aspects in the docs that can be used natively in transformers. I think we could then also merge perf_train_gpu_many.mdx and scalability.mdx into one section where we highlight for each technique whether it helps performance or scalability.

In terms of implementation what do you think about tackling the sections with the most need first? I'd expect single GPU training and CPU inference to be the most widely used settings. Followed by multi-GPU training and GPU inference (AFAICT for many companies GPUs in prod are still out of reach). Finally, multi-GPU inference which is probably only needed for a few companies with huge models and fast response requirements, right?

stas00 · 2022-01-19T18:03:55Z

re: sub-sections

The ideas to have the map and some of those map entries will redirect to elsewhere for details. e.g. mixed precision is wanted in all paths, so we cover its theory in one place and link to it from all other places. We will need to decide whether we cover it in the first path (e.g. 1-gpu train) or in a shared doc (e.d. current performance.mdx). I'm inclined to think the latter, since performance.mdx currently has a lot of shared info - .e.g. most of the hardware notes. So this could be our theory w/ brief examples - precisely as it's now. And then on each specific path (e.g. many-gpu train) we link the overview in the main doc and then include recipes to have to use it on this particular path, with code, etc.

re: Performance vs. scalability

Same idea, keep the general scalability documents that describes the whole domain including parts that we don't yet have. This is an overview document. Then in each specific path doc we cover only the tech that is available to our users with references to the main doc for theory/overview and focusing on the how-to details. That way the specific paths (scenarios) will be 100% actionable. And the general document shows a curious reader a bigger picture and perhaps even entice them to go and fill the missing holes. Also note that all those missing holes will be filled out sooner or later since we are actively trying to sort it out.

re: priority/order

we first re-shuffle what we have and your PR and then perhaps add a bit of specific how-to that hasn't been written yet (e.g. make a really neat 1-gpu-train path and merge this PR. Then gradually improve the other sections.

lvwerra · 2022-01-24T17:08:30Z

That sounds good to me - let's discuss this with the rest of the team then.

stas00 · 2022-02-16T05:29:41Z

Progress made:

beefed up the main perf doc - which is the reference to which other docs point to
started working on a "model" file docs/source/perf_train_gpu_one.mdx for the 1 of 5 main scenarios we mapped out - once polished we continue with the rest.

Need your input:

please review docs/source/perf_train_gpu_one.mdx and let me know if this feels good or whether a different approach should be taken. As you can see since there is going to be a lot of duplication I'm offloading all the why's to performance.mdx and only leaving the how-to information specific to each situation/scenario. I'm of course open to changes in course, but let's sort out one doc and then replicate the same structure to the other 4 docs.
I don't know anything about accelerate so if @sgugger / @lvwerra you could fill in the gaps that would be great. (marked with XXX: Sylvain/Leandro)

I'd say for now please ignore any spelling or grammar or any minor details as it's likely that many of these sections will be re-written completely so please don't waste your time on premature editing. We will do it at the very end when y'all happy with the first doc.

Thank you!

p.s. Also how do we make the doc builder add its link as this PR was created before that feature was added? I could start a new PR if it's easier since there was no commentary to preserve yet. I think the ability to read the rendered doc will help a lot to help us produce better documentation.

sgugger

Thanks for drafting this! I've made a high-level review as requested. I won't have time to fill the Accelerate bits until next week, so if you have some time to do it @lvwerra, don't hesitate!

sgugger · 2022-02-16T16:45:17Z

docs/source/_toctree.yml

@@ -63,6 +63,20 @@
    title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
  - local: parallelism
    title: Model Parallelism
+  - local: perf_infer


Let's make a subfolder instead of having so many document names prefixed with perf.

sgugger · 2022-02-16T16:46:49Z

docs/source/perf_train_gpu_one.mdx

+
+### fp16 / bf16
+
+Enabling mixed precision will make your training both faster and use less memory. The science of it is explained [here](performance#fp16) and [here]((performance#bf16).


We should remove the "use less memory" as it's only the case when the batch size is big. We already have enough issues opened by users complaining it uses more memory for large models with batch size 1.

very good call. it's because fp16 is touted everywhere that it saves memory and it's not most of the time as it takes 0.5x extra memory for fp16 and fp32 versions.

sgugger · 2022-02-16T16:48:26Z

docs/source/perf_train_gpu_one.mdx

+
+First, a quick decision tree:
+
+1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case.


I would add "for training" as you need space for the gradients, optimizer states etc.

sgugger · 2022-02-16T16:49:10Z

docs/source/perf_train_gpu_one.mdx

+- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer](
+https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `batch_size` in the code.


This is the argument passed along to the dataloader, so not complex at all :-)

sgugger · 2022-02-16T16:49:29Z

docs/source/perf_train_gpu_one.mdx

+### Optimizer
+
+A choice of optimizer can impact the throughput. For example, using the fused AdamW optimizer from  [NVIDIA/apex](https://github.com/NVIDIA/apex) will be faster than the same optimizer from `torch`.
+
+The science of optimizer's speed and which optimizers to choose when are explained [here](performance#optimizer).
+
+To activate please see the Optimizer section in the "Less Memory" section of this document (XXX: how to link?)


There is already a section on optimizers?

yes - expanded it in this PR

lvwerra · 2022-02-17T10:27:54Z

Hi @stas00 thanks for starting to work on this!

First, I am not sure about the doc builder - maybe it is indeed easiest to just create a new PR.

A few high level comments on the current structure:

Since performance.mdx is the main entry point I think it's main purpose should just be to give an overview of all the subsections and maybe a guide how to read them.
My understanding was that we would essentially move the guide added in add model scaling section #15119 from performance.mdx to the perf_train_gpu_one.mdx section and properly merge it with the material that was already there and is currently appended at the end of the document.
Regarding the separation of why and how: I am actually not in favour of that because I think it is easier for a user if both are together. Explain a concept and directly show how it is done, otherwise there will be a lot of switching between the two.

What do you think? If you'd like I can have a stab at it.

stas00 · 2022-02-17T21:18:28Z

Hi @stas00 thanks for starting to work on this!

First, I am not sure about the doc builder - maybe it is indeed easiest to just create a new PR.

A few high level comments on the current structure:

Since performance.mdx is the main entry point I think it's main purpose should just be to give an overview of all the subsections and maybe a guide how to read them.

Indeed, we can move the indepth reference to another file and refer to it instead. For now it's just easier to xref to it.

My understanding was that we would essentially move the guide added in add model scaling section #15119 from performance.mdx to the perf_train_gpu_one.mdx section and properly merge it with the material that was already there and is currently appended at the end of the document.

Indeed, something like that. I was just trying to use a few sections to lay out a possible structure before we fill the gaps in.

Regarding the separation of why and how: I am actually not in favour of that because I think it is easier for a user if both are together. Explain a concept and directly show how it is done, otherwise there will be a lot of switching between the two.

Which means that many sections will be duplicated at least 5 times and in some cases more than that - as you can see I have 2 identical sub-sections for several entries since those impact both speed and memory, but the explanations are slightly different.

What do you think? If you'd like I can have a stab at it.

Sure, please feel free to shift things around and propose a different approach.

Let me know if I prefer that I open a new PR first, but then I will need to integrate Sylvain's suggestions and I'm a bit too busy with BigScience at the moment. so it's your call.

I can of course integrate them in the new PR as well, I won't forget.

lvwerra · 2022-02-18T10:53:18Z

Awesome, thanks for clarifying - it is sometimes hard to read the intentions :)

If you could open a new PR that would be great and I can work on it a bit next week and also integrate Sylvain's comments (or make notes).

Thanks!

stas00 · 2022-02-18T17:38:22Z

The PR has been moved to #15723

Will address all the suggestions so far in the new PR.

[doc] performance/scalability revamp

b5a2586

stas00 changed the title ~~[doc] performance/scalability revamp~~ [WIP] [doc] performance/scalability revamp Jan 18, 2022

stas00 marked this pull request as draft January 18, 2022 22:43

stas00 added 4 commits January 28, 2022 11:08

Merge remote-tracking branch 'origin/master' into doc-perf-revamp

94b1b18

link the new docs

309bbae

no :

a6a2832

mixed precision

5d3df09

lvwerra mentioned this pull request Jan 31, 2022

add model scaling section #15119

Merged

stas00 added 4 commits February 9, 2022 09:39

Merge remote-tracking branch 'origin/master' into doc-perf-revamp

b6f33bf

Merge remote-tracking branch 'origin/master' into doc-perf-revamp

90c65d2

work on the first doc

674bef6

expand the main doc

0ad5fe6

sgugger reviewed Feb 16, 2022

View reviewed changes

stas00 mentioned this pull request Feb 18, 2022

[WIP] [doc] performance/scalability revamp #15723

Merged

stas00 closed this Feb 18, 2022

stas00 deleted the doc-perf-revamp branch April 27, 2022 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [doc] performance/scalability revamp #15213

[WIP] [doc] performance/scalability revamp #15213

stas00 commented Jan 18, 2022

lvwerra commented Jan 19, 2022

stas00 commented Jan 19, 2022 •

edited

Loading

lvwerra commented Jan 24, 2022

stas00 commented Feb 16, 2022 •

edited

Loading

sgugger left a comment

sgugger Feb 16, 2022

stas00 Feb 17, 2022

sgugger Feb 16, 2022

stas00 Feb 17, 2022

sgugger Feb 16, 2022

sgugger Feb 16, 2022

sgugger Feb 16, 2022

stas00 Feb 17, 2022

lvwerra commented Feb 17, 2022

stas00 commented Feb 17, 2022 •

edited

Loading

lvwerra commented Feb 18, 2022

stas00 commented Feb 18, 2022


		### fp16 / bf16

		Enabling mixed precision will make your training both faster and use less memory. The science of it is explained [here](performance#fp16) and [here]((performance#bf16).


		First, a quick decision tree:

		1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case.

		- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer](
		https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `batch_size` in the code.

[WIP] [doc] performance/scalability revamp #15213

[WIP] [doc] performance/scalability revamp #15213

Conversation

stas00 commented Jan 18, 2022

lvwerra commented Jan 19, 2022

stas00 commented Jan 19, 2022 • edited Loading

lvwerra commented Jan 24, 2022

stas00 commented Feb 16, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented Feb 17, 2022

stas00 commented Feb 17, 2022 • edited Loading

lvwerra commented Feb 18, 2022

stas00 commented Feb 18, 2022

stas00 commented Jan 19, 2022 •

edited

Loading

stas00 commented Feb 16, 2022 •

edited

Loading

stas00 commented Feb 17, 2022 •

edited

Loading