-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAG] Refactor nccl to communicator channel. #47845
Conversation
@ruisearch42 @rkooo567 Can you take a look on this and also #47658. Thx! Currently I just keep the same api at the top level so we can use |
Hi @Bye-legumes , this changes quite a bit of the interface. Perhaps it's better to set up a meeting with @rkooo567 and me to discuss about your high level plans and timelines for NPU support. Can we set up something next week? Also cc @stephanie-wang @anyscalesam in case you have any thoughts on this change and future ones. |
Sure, you can set up a meeting next week, I am available all the time! |
Can you update the PR description with the changes included? If it is mainly renaming internal APIs from nccl -> communicator, that's probably okay. For adding a new transport, the preferred method for now would probably be to pass a custom communicator during compilation, which is being worked on in #47540. |
Thx for your reply! , it;s just some renaming internal APIs from nccl -> communicator and also some hardware checking during complied stage. As if we need new Hardware to use xccl, the compile will always check GPU currently. So that is my motivation so for future hardware, we just need to implement the xccl_group and add hardware checking at compile stages. |
I see, thanks! Does the PR include the change for hardware checking? I think in general we can accept the API internal renaming right away, but as @ruisearch42 said, it would be best if we can discuss the long-term plan for xccl support. One question I have is whether we need to add "xccl" as a possible transport type or if just using/extending the custom communicator interface would be sufficient. Could you put together a short RFC doc on this so the we can discuss with the OSS community? |
|
propose another PR later |
Why are these changes needed?
This is try to enable the ADAG channel can use different hardware while the user API keeps the same.
For user side, they can use
transport='nccl'
ortransport='hccl'
for different hardware.While internally, ADAG will treat them the nodes that needs the communicator. So for the complied and channel level, it just rename the nccl to communicator.
In the bottom level, the
nccl_group
orhccl_group
will be called to achieve the hardware level communicator.The API and logical above the
torch_tensor_communicator_channel.py
previouslytorch_tensor_nccl_channel.py
keeps the same.So when new accelerator will be used, what they need is just implement xccl_group, with same API for all groups. Then we can use new hardware.
RFC doc https://docs.google.com/document/d/1zu9SllrEAjPHqs-eeITtrSSbv0rBxtkyCJeweZJl100/edit?usp=sharing
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.