Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAG] Refactor nccl to communicator channel. #47845

Closed
wants to merge 24 commits into from

Conversation

Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Sep 27, 2024

Why are these changes needed?

This is try to enable the ADAG channel can use different hardware while the user API keeps the same.
For user side, they can use transport='nccl' or transport='hccl' for different hardware.
While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator.
In the bottom level, the nccl_group or hccl_group will be called to achieve the hardware level communicator.
The API and logical above the torch_tensor_communicator_channel.py previously torch_tensor_nccl_channel.py keeps the same.

So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.

RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing

Related issue number

Checks

  • [√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [√] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • [√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Bye-legumes and others added 7 commits September 26, 2024 16:30
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes Bye-legumes changed the title [WIP][ADAG] Refactor nccl to communicator channel. [ADAG] Refactor nccl to communicator channel. Oct 2, 2024
Bye-legumes and others added 14 commits October 4, 2024 10:40
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

Bye-legumes commented Oct 9, 2024

@ruisearch42 @rkooo567 Can you take a look on this and also #47658. Thx! Currently I just keep the same api at the top level so we can use transport = ''hccl or transport = ''nccl while for future hardware we just need to implement xccl_goup

@ruisearch42 ruisearch42 self-assigned this Oct 9, 2024
@ruisearch42
Copy link
Contributor

ruisearch42 commented Oct 10, 2024

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

@Bye-legumes
Copy link
Contributor Author

Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week?

Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones.

Sure, you can set up a meeting next week, I am available all the time!

@stephanie-wang
Copy link
Contributor

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

@Bye-legumes
Copy link
Contributor Author

Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay.

For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540.

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

@stephanie-wang
Copy link
Contributor

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

@Bye-legumes
Copy link
Contributor Author

Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages.

I see, thanks! Does the PR include the change for hardware checking?

I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community?

  1. Yeah, add some resource checking at channel level
  2. I am not sure we use "xccl" directly at higher level or not(Of course I think we can achieve this by checking the resource, then make decision about which xccl should be use. Currently I just keep the transport at higher level to be the same that we specify "hccl" or "nccl" then do the resource check.
  3. To make extension of different HW, I think the level above the python/ray/experimental/channel/torch_tensor_communicator_channel.py can keeps the same logic (just rename requires_nccl to requires_communicator) then we just keep use same API for different xccl_goup (init, send, receive, destory). I think that's enough. That is : 1. Keep logic unchanged for the codes above python/ray/experimental/channel/torch_tensor_communicator_channel.py 2. Same API for xccl_goup. To add new resource, we need to add the resource check at chennel python/ray/experimental/channel/torch_tensor_type.py then implement a xccl_group that with same API.

@anyscalesam anyscalesam added core Issues that should be addressed in Ray Core compiled-graphs P1 Issue that should be fixed within a few weeks labels Oct 17, 2024
Signed-off-by: zhilong chen <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong chen <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

propose another PR later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiled-graphs core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants