[ADAG] Refactor nccl to communicator channel. #47845

Bye-legumes · 2024-09-27T19:43:56Z

Why are these changes needed?

This is try to enable the ADAG channel can use different hardware while the user API keeps the same.
For user side, they can use transport='nccl' or transport='hccl' for different hardware.
While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator.
In the bottom level, the nccl_group or hccl_group will be called to achieve the hardware level communicator.
The API and logical above the torch_tensor_communicator_channel.py previously torch_tensor_nccl_channel.py keeps the same.

So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.

RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing

Related issue number

Checks

[√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[√] I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
[√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-10-09T17:57:07Z

@ruisearch42 @rkooo567 Can you take a look on this and also #47658. Thx! Currently I just keep the same api at the top level so we can use transport = ''hccl or transport = ''nccl while for future hardware we just need to implement xccl_goup

ruisearch42 · 2024-10-10T19:00:20Z

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

Bye-legumes · 2024-10-10T19:03:55Z

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

Sure, you can set up a meeting next week, I am available all the time!

stephanie-wang · 2024-10-10T20:24:05Z

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

Bye-legumes · 2024-10-10T20:42:18Z

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

stephanie-wang · 2024-10-12T00:29:32Z

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

Bye-legumes · 2024-10-15T18:42:33Z

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

Yeah, add some resource checking at channel level
I am not sure we use "xccl" directly at higher level or not(Of course I think we can achieve this by checking the resource, then make decision about which xccl should be use. Currently I just keep the transport at higher level to be the same that we specify "hccl" or "nccl" then do the resource check.
To make extension of different HW, I think the level above the python/ray/experimental/channel/torch_tensor_communicator_channel.py can keeps the same logic (just rename requires_nccl to requires_communicator) then we just keep use same API for different xccl_goup (init, send, receive, destory). I think that's enough. That is : 1. Keep logic unchanged for the codes above python/ray/experimental/channel/torch_tensor_communicator_channel.py 2. Same API for xccl_goup. To add new resource, we need to add the resource check at chennel python/ray/experimental/channel/torch_tensor_type.py then implement a xccl_group that with same API.

Signed-off-by: zhilong chen <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-06T19:51:20Z

propose another PR later

Bye-legumes and others added 7 commits September 26, 2024 16:30

refactor

55c03fe

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

refactor

260f669

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

refactor

16976e9

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

refactor

f6a49df

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into refactor_nccl

b66deae

fix

9bacbb1

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

78b4240

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes changed the title ~~[WIP][ADAG] Refactor nccl to communicator channel.~~ [ADAG] Refactor nccl to communicator channel. Oct 2, 2024

Bye-legumes and others added 14 commits October 4, 2024 10:40

Merge branch 'master' into refactor_nccl

b177ce5

fix

213774b

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

b8c0998

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

b26f732

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

9485101

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

9cb2b37

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

cd15c12

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

44cef4e

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

9bd1da8

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

68057ff

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

8c21af3

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

2e5980a

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into refactor_nccl

ff32963

merge

44171fa

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

ruisearch42 self-assigned this Oct 9, 2024

anyscalesam added core Issues that should be addressed in Ray Core compiled-graphs P1 Issue that should be fixed within a few weeks labels Oct 17, 2024

Bye-legumes mentioned this pull request Oct 24, 2024

[CG] Refactor nccl to communicator channel. #48247

Open

Bye-legumes added 2 commits November 6, 2024 14:46

fix

51ddf26

Signed-off-by: zhilong chen <zhilong.chen@mail.mcgill.ca>

fix

45f98d6

Signed-off-by: zhilong chen <zhilong.chen@mail.mcgill.ca>

Bye-legumes requested review from hongpeng-guo, justinvyu, matthewdeng, raulchen, woshiyyya, sven1977, simonsays1980, scottjlee, bveeramani, stephanie-wang, omatthew98, alexeykudinkin, srinathk10, richardliaw, edoakes, aslonnie, hongchaodeng and a team as code owners November 6, 2024 19:48

Merge branch 'master' into refactor_nccl

56b30a2

Bye-legumes closed this Nov 6, 2024

Bye-legumes mentioned this pull request Nov 6, 2024

[ADAG] Refactor nccl to communicator channel. #48607

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAG] Refactor nccl to communicator channel. #47845

[ADAG] Refactor nccl to communicator channel. #47845

Bye-legumes commented Sep 27, 2024 •

edited

Loading

Bye-legumes commented Oct 9, 2024 •

edited

Loading

ruisearch42 commented Oct 10, 2024 •

edited

Loading

Bye-legumes commented Oct 10, 2024

stephanie-wang commented Oct 10, 2024

Bye-legumes commented Oct 10, 2024

stephanie-wang commented Oct 12, 2024

Bye-legumes commented Oct 15, 2024

Bye-legumes commented Nov 6, 2024

[ADAG] Refactor nccl to communicator channel. #47845

[ADAG] Refactor nccl to communicator channel. #47845

Conversation

Bye-legumes commented Sep 27, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Bye-legumes commented Oct 9, 2024 • edited Loading

ruisearch42 commented Oct 10, 2024 • edited Loading

Bye-legumes commented Oct 10, 2024

stephanie-wang commented Oct 10, 2024

Bye-legumes commented Oct 10, 2024

stephanie-wang commented Oct 12, 2024

Bye-legumes commented Oct 15, 2024

Bye-legumes commented Nov 6, 2024

Bye-legumes commented Sep 27, 2024 •

edited

Loading

Bye-legumes commented Oct 9, 2024 •

edited

Loading

ruisearch42 commented Oct 10, 2024 •

edited

Loading