Support all the softmax extensions and cherry-pick transformer-related commits #101

hubertlu-tw · 2022-12-30T02:50:13Z

Cherry-picked transformer related commits from the upstream
Support generic_scaled_masked_softmax_cuda
Support scaled_softmax_cuda
Support fused_weight_gradient_mlp_cuda for ROCm
Add optimizers, clip_grad in run_rocm_extensions.py
Fix the bug in run_rocm_extensions.py
To run the extension unit tests, please follow the following commands:

cd apex/contrib/test
APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python3 run_rocm_extensions.py

To run the transformer unit tests, please follow the following commands:

python tests/L0/run_test.py --include run_transformer

Ran 120 tests in 506.928s
FAILED (errors=7, skipped=55)

TODO: We will need to work on an IFU PR and start to look into the failed tests in order to skip them on ROCm.

* new kernel Signed-off-by: Yi Dong <yidong@nvidia.com> * added the unit tests Signed-off-by: Yi Dong <yidong@nvidia.com> * clean up unittest Signed-off-by: Yi Dong <yidong@nvidia.com> * use float Signed-off-by: Yi Dong <yidong@nvidia.com> * more clean up Signed-off-by: Yi Dong <yidong@nvidia.com> * remove the long seq test case

…DIA#1448) * less mem consumption by fused generic softmax tests ran with RTX 3070 Ti Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Deduplicate qlen of 1234

…VIDIA#1451) * Use xmlrunner.XMLTestRunner accordingly TODO: - [x] Remove `subTest` because it's not compatible with the current way of running L0 tests Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * use `torch.testing` more to enable xmlrunner Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Remove `subTest` for xmlrunner Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * removing subTest Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * not depend on an env var Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * fix syntax errors * open with `"wb"` * xml file per dir Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove comment-out Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Refactor `TestTransformer`: define member methods (#5) * setUpClass to define `test_` methods Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * manually define Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * add a missing test Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove print Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove ext Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

to use `torch.testing.assert_close` instead of `numpy.testing.assert_allclose`. The former uses a bit looser threshold values. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

…VIDIA#1451) * Use xmlrunner.XMLTestRunner accordingly TODO: - [x] Remove `subTest` because it's not compatible with the current way of running L0 tests Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * use `torch.testing` more to enable xmlrunner Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Remove `subTest` for xmlrunner Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * removing subTest Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * not depend on an env var Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * fix syntax errors * open with `"wb"` * xml file per dir Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove comment-out Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Refactor `TestTransformer`: define member methods (#5) * setUpClass to define `test_` methods Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * manually define Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * add a missing test Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove print Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * remove ext Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* apex.amp migration to torch.cuda.amp Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * add autocast tests Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * split with and without autocast Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Label smoothing in vocab parallel cross entropy Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix context saving Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove .item() calls Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update tests Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

…atron pipeline parallelism (NVIDIA#1475) * Refactor how dist Adam handles overlapped grad sync Each grad bucket independently keeps track of grads that have been generated. Add helper function to create callback functions. Change default param arg in grad norm functions to None. Perform communication for checkpointing in main stream to avoid memory pool overheads. * Support Megatron pipeline parallelism with async grad reduction Enables async grad reduction in first pipeline stage during last backward pass, and disables async grad reduction in all other pipeline stages. * Review suggestions from crcrpar Add unit test for pipeline parallelism with custom sync context. Style tweaks. * Use unittest assert functions in pipeline parallelism test Review suggestion from crcrpar

* Optionally disable stream synchronization after batched p2p communication * Add test cases with `sync_batch_comm=False` only when pytorch/pytorch#82450 is included in pytorch. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * utilize existing test methods Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * consistent naming Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com> * silly boy, to skip the sync, set False Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * cosmetic Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Test with async_pipelinign w/o sync after batch_isend_irecv Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * again, set sync_batch_comm to False Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com> * Remove `torch.testing._internal.common_cuda` Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Co-authored-by: Sangkug Lym <slym@nvidia.com> Co-authored-by: Aidyn-A <Aidyn-A@users.noreply.github.com>

…lelism (NVIDIA#1514) * Add option to use no_sync context with interleaved pipeline parallelism * Add unit test for no_sync context with interleaved pipeline parallelism * Debug no_sync context support in interleaved pipeline parallelism

…nstead of torch_ucc (NVIDIA#1495) * update HAS_TORCH_UCC to TORCH_UCC * add comments for failing tests * move HAS_UCC to _ucc_utils.py * whitespace * small changes * newline * updated list of failing tests * update failing tests list

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update megatron fused softmax follow megatron-lm Signed-off-by: Yu Yao <yuya@nvidia.com> * Add mask=None support in scaled_masked_softmax Signed-off-by: Yu Yao <yuya@nvidia.com> * Update setup.py for scaled_softmax_cuda Signed-off-by: Yu Yao <yuya@nvidia.com> * Add tests for fused_scale_softmax (mask=None) Signed-off-by: Yu Yao <yuya@nvidia.com> * Assert grad equal in fused softmax test Signed-off-by: Yu Yao <yuya@nvidia.com> * Revert "Assert grad equal in fused softmax test" Signed-off-by: Yu Yao <yuya@nvidia.com> Signed-off-by: Yu Yao <yuya@nvidia.com> Co-authored-by: Yu Yao <yuya@nvidia.com>

) * working test_bert_minimal.py * remove some debugging statements * working test_gpt_minimal.py * test_dynamic_batchsize.py having issues with torch.backends.cudnn.allow_tf32 * working test_dynamic_batchsize.py * refactor test_bert_minimal.py, need to investigate rng of MANUAL_SEED for nccl only pipeline with virtual_pipeline_model_parallel_size = 2 * add test_bert_minimal_alt.py for visibility * update test_gpt_minimal.py * lint * update loss cutoff for bert test * split with / without interleaving tests for bert * use skipTest * remove ONCE * add ignore_unknown_args=True * remove old testing files * add num_devices logic to override_args

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

yidong72 and others added 25 commits December 28, 2022 00:10

[transformer] unittest: less mem consumption for generic softmax (NVI…

c1174a8

…DIA#1448) * less mem consumption by fused generic softmax tests ran with RTX 3070 Ti Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * Deduplicate qlen of 1234

Update mlp_cuda test (NVIDIA#1425)

14ce259

to use `torch.testing.assert_close` instead of `numpy.testing.assert_allclose`. The former uses a bit looser threshold values. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Skip flaky test for ROCm

f4c4b86

introducing APEX_RUN_WITH_SLOW_TESTS env var (NVIDIA#1489)

3225191

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Skip test_grad_scaler in test_dist_adam.py

127842e

Add run_rocm_extensions.sh for skipFlakyTest and skipIfRocm

589a5a6

add tearDown (NVIDIA#1508)

4ca3c93

update exception message (NVIDIA#1524)

4a268de

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Resolve filename collision issue in compilation on ROCm (Ref: #77)

5e5331a

check is_ucc_available (NVIDIA#1523)

80a4954

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Support fused_weight_gradient_mlp_cuda for ROCm

666e769

Fix bugs in run_rocm_extensions.py

52f036d

Move test_label_smoothing.py to xentropy folder

483fb50

Add some test folders for those ROCm does not support

72f978c

hubertlu-tw requested a review from jithunnair-amd December 30, 2022 02:50

hubertlu-tw self-assigned this Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support all the softmax extensions and cherry-pick transformer-related commits #101

Support all the softmax extensions and cherry-pick transformer-related commits #101

hubertlu-tw commented Dec 30, 2022 •

edited

Loading

Support all the softmax extensions and cherry-pick transformer-related commits #101

Are you sure you want to change the base?

Support all the softmax extensions and cherry-pick transformer-related commits #101

Conversation

hubertlu-tw commented Dec 30, 2022 • edited Loading

hubertlu-tw commented Dec 30, 2022 •

edited

Loading