Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out-of-place reduce-scatter coalescing #6058

Merged
merged 2 commits into from
Dec 11, 2023

Conversation

jeffhataws
Copy link
Collaborator

In #5956 we added reduce-scatter coalescing, but the out-of-place was combined with the in-place processing, making it hard to understand/maintain the code.

Per recommendation from reviewer, this PR adds the out-of-place version of reduce-scatter coalescing.

@jeffhataws jeffhataws force-pushed the jeffhataws_reduce_scatter_coalesce_out branch from 6c1923f to a560206 Compare December 8, 2023 05:35
@jeffhataws
Copy link
Collaborator Author

I don't understand why the profiler test failed with:

======================================================================
FAIL: test_trace_detached (__main__.ProfilerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_profiler.py", line 130, in test_trace_detached
    path = self._check_xspace_pb_exist(logdir)
  File "/tmp/pytorch/xla/test/test_profiler.py", line 42, in _check_xspace_pb_exist
    self.assertEqual(1, len(paths), f'Expected one path match: ***path***')
AssertionError: 1 != 0 : Expected one path match: /tmp/tmp9cod83or/plugins/profile/*/*.xplane.pb

@JackCaoG
Copy link
Collaborator

JackCaoG commented Dec 8, 2023

@jonb377 hmm is it possible that test_trace_detached is flaky in any way?

@jonb377
Copy link
Collaborator

jonb377 commented Dec 8, 2023

I suppose it could be, the capture and assertion are async. Let's rerun the workflow and see if it passes, I'll also open a PR to add more headroom to the test.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm, I think we should add a test too

@jeffhataws jeffhataws force-pushed the jeffhataws_reduce_scatter_coalesce_out branch from 2d2b532 to 87fd463 Compare December 10, 2023 17:03
@jeffhataws jeffhataws force-pushed the jeffhataws_reduce_scatter_coalesce_out branch from 87fd463 to 451f9c5 Compare December 11, 2023 04:33
@jeffhataws
Copy link
Collaborator Author

@JackCaoG is is failing due to flaky test? It was green until I rebased.

.Epoch 1 train begin 05:43:45
| Training Device=xla:0/0 Step=0 Loss=nan Rate=30.90 GlobalRate=30.90 Time=05:43:46
Starting to trace for 5000 ms. Remaining attempt(s): 2
2023-12-11 05:44:03.218014: W external/tsl/tsl/profiler/lib/profiler_session.cc:110] Profiling is late by 1279519 nanoseconds and will start immediately.
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
/opt/conda/lib/python3.8/site-packages/torch_xla-2.2.0+git26cb269-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so(+0x5f1e196)[0x7fd2d0585196]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7fd456404140]
/usr/local/cuda/lib64/libcupti.so.12(+0x10be91)[0x7fd0acfbae91]
/usr/local/cuda/lib64/libcupti.so.12(+0x104714)[0x7fd0acfb3714]

@jeffhataws
Copy link
Collaborator Author

mostly lgtm, I think we should add a test too

Done adding new test. Please take a look. Thanks.

@JackCaoG
Copy link
Collaborator

yea that flaky test is already fixed in head, you can ignore it for now.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@JackCaoG JackCaoG merged commit ba7c347 into master Dec 11, 2023
19 of 20 checks passed
@JackCaoG
Copy link
Collaborator

I will take care of the backport once #6059 (review) is also merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants