[transformer] Allow for skipping stream synch #1505

crcrpar · 2022-10-06T17:31:39Z

Optionally disable stream synchronization after batched p2p communication

exported from nvcr.io/nvidia/pytorch:22.09-py3 container with some test cases

…tion

only when pytorch/pytorch#82450 is included in pytorch. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Aidyn-A

Sorry, one of my latest PRs created a confusion is test naming. I prefer using test_learning and test_inference for more (subjectively) convenient usage when running the tests. So I made several suggestions for them. Everything else looks good to me 👍

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar · 2022-10-12T00:19:55Z

on DGX A100, python tests/L0/run_test.py --include run_transformer encountered out of memory error and interleaving cases failed, while (python|pytest) tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py not. So this PR looks unrelated.

* Optionally disable stream synchronization after batched p2p communication * Add test cases with `sync_batch_comm=False` only when pytorch/pytorch#82450 is included in pytorch. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * utilize existing test methods Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * consistent naming Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com> * silly boy, to skip the sync, set False Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * cosmetic Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Test with async_pipelinign w/o sync after batch_isend_irecv Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * again, set sync_batch_comm to False Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com> * Remove `torch.testing._internal.common_cuda` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

Optionally disable stream synchronization after batched p2p communica…

5009af6

…tion

crcrpar marked this pull request as draft October 6, 2022 18:54

crcrpar added 2 commits October 6, 2022 12:06

Add test cases with sync_batch_comm=False

a47893f

only when pytorch/pytorch#82450 is included in pytorch. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

utilize existing test methods

7ac65a6

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Aidyn-A approved these changes Oct 6, 2022

View reviewed changes

crcrpar and others added 4 commits October 6, 2022 12:50

consistent naming

287f8d7

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

silly boy, to skip the sync, set False

7234123

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

cosmetic

4717056

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Test with async_pipelinign w/o sync after batch_isend_irecv

d5638fc

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar marked this pull request as ready for review October 6, 2022 20:17

crcrpar requested a review from Aidyn-A October 6, 2022 20:17

Aidyn-A reviewed Oct 6, 2022

View reviewed changes

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py Outdated Show resolved Hide resolved

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py Outdated Show resolved Hide resolved

again, set sync_batch_comm to False

3e16c68

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

Aidyn-A approved these changes Oct 6, 2022

View reviewed changes

Remove torch.testing._internal.common_cuda

eb60f96

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar merged commit 806f9b0 into NVIDIA:master Oct 12, 2022

crcrpar deleted the optionally_skip_sync_after_p2p branch October 12, 2022 00:31

crcrpar added this to the 22.09 milestone Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[transformer] Allow for skipping stream synch #1505

[transformer] Allow for skipping stream synch #1505

crcrpar commented Oct 6, 2022 •

edited

Loading

Aidyn-A left a comment

crcrpar commented Oct 12, 2022

[transformer] Allow for skipping stream synch #1505

[transformer] Allow for skipping stream synch #1505

Conversation

crcrpar commented Oct 6, 2022 • edited Loading

Aidyn-A left a comment

Choose a reason for hiding this comment

crcrpar commented Oct 12, 2022

crcrpar commented Oct 6, 2022 •

edited

Loading