Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Implement distributed.py in the c++ #15061

Open
dmakoviichuk-tt opened this issue Nov 14, 2024 · 4 comments
Open

[Feature Request] Implement distributed.py in the c++ #15061

dmakoviichuk-tt opened this issue Nov 14, 2024 · 4 comments
Assignees
Labels
feature-request External feature request forge P1

Comments

@dmakoviichuk-tt
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We have a lot of classes implemented in the distribute.py which we need in c++.
But we don't have pytorch in c++. Other library (xtensor) should be used instead.
Also we have to_torch/from_torch functions in python but there are no convenient versions of them.

Describe the solution you'd like

  1. Introduce xtensor in the ttnn. We already have it in the tt-train https://github.com/tenstorrent/tt-metal/blob/main/tt-train/cmake/dependencies.cmake#L57
    Easiest way to add it is CPM:
    CPMAddPackage(NAME xtl GITHUB_REPOSITORY xtensor-stack/xtl GIT_TAG 0.7.7 OPTIONS "XTL_ENABLE_TESTS OFF")
    CPMAddPackage(NAME xtensor GITHUB_REPOSITORY xtensor-stack/xtensor GIT_TAG 0.25.0 OPTIONS "XTENSOR_ENABLE_TESTS OFF")
  2. Implement to_vector/from_vector/from_view and to_xtensor/from_xtensor in c++. Please take a look at the reference:
    tt::tt_metal::Tensor from_vector<float, DataType::BFLOAT16>(

    Need to make sure that all types are supported. In my reference I didn't add support for 4-8 bit types.
  3. Implement all Sharding and Replicating strategies described in the distributed.py.
  4. Reuse c++ implementations in python.

Seems like xtensor can support our custom bfloat16. So it might be useful in some cases.

Describe alternatives you've considered
I've considered to implement a few hacks in the tt-train.

@staylorTT
Copy link

@cfjchu Do we have any updates for this bug? either ETA or otherwise from your team?

@omilyutin-tt
Copy link
Contributor

@staylorTT we are looking for a simpler multi-device sharding API than what is currently provided in distributed.py; in parallel, I'm adding creation functions to/from vector/view + the support for xtensor.

If the multi-device sharding is a blocker for you, would it work to provide from_vector / from_xarray + an extension similar to what we have in distributed.py? I think we can get that in within a week or two, but I'm afraid that fleshing out ttnn-native API might take more time as there are some unknowns.

@nsmithtt
Copy link
Contributor

nsmithtt commented Dec 4, 2024

@omilyutin-tt I think that API is workable for us. Ideally if this is a TTNN API eventually it accepts a tensor type and returns a tensor type (multi-device sharded). It also seems like it might be straightforward to achieve this by just bouncing a host storage tensor through std::vector and turning around and calling this proposed API. Anyway, we'll work with what we can get in the short term, thank you for looking into this :)

omilyutin-tt added a commit that referenced this issue Dec 9, 2024
### Ticket
#15061 

### What's changed
* Refactor `DistributedTensorConfig` in it's own header
* Use typed `struct` to represent `MeshShape` and `MeshOffset`

### Checklist
- [X] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12210236362)
- [X] New/Existing tests provide coverage for changes
omilyutin-tt added a commit that referenced this issue Dec 17, 2024
…++ ttnn tensors (#15886)

### Ticket
#15755

### Problem description
Multi-device tensor distribution currently works through
`distributed.py`, which relies on PyTorch libraries to perform sharding
/ concatenation.

### What's changed
* Add xtensor to ttnn.
* Lower facilities from tt-train down to ttnn. In particular: `chunk`,
`concatenate` functions along with some conversion utils, and the
relevant tests.
* Add `distributed_tensor.hpp` header with the multi-device distribution
APIs.

**In follow up PRs:**
* Support bf4 / bf8 and other formats in `from_vector` / `to_vector` and
other overloads.
* Support outputting a tilized tensor.
* Migrate functionality from `pytensor.cpp` to using the new APIs.

### Checklist
- [x] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12333746639/job/34427015707)
(failure in clang-tidy in unreleated tt-train directory)
- [X] [code analysis
run](https://github.com/tenstorrent/tt-metal/actions/runs/12360844971)
- [x] [T3K unit + frequent + model reg
tests](https://github.com/tenstorrent/tt-metal/actions/runs/12360656141)
- same breakage on main.
- [X] New/Existing tests provide coverage for changes
@omilyutin-tt
Copy link
Contributor

#15886 adds the initial support for distributing a tensor across devices. I'm working on a couple of follow ups to support more data types, handle tilized layouts, also some performance optimizations. Please report any issues you are encountering!

omilyutin-tt added a commit that referenced this issue Dec 24, 2024
…rmats (#16105)

### Ticket
#15061

### Problem description
`to_vector` / `from_vector` don't support some of the special cases,
which prevents a more widespread adoption (distributing tensors across
mesh of devices in particular).

### What's changed
* Support tilized layouts.
* Support bf4 / bf8 data types with auto-padding.
* Extended `chunk` / `concat` support for the added types.

### Next steps
* Optimize certain operations on-device, such as tilization, whenever
possible.
* Perform auto-padding in tilized layouts / when using sharding.
* Switching pytensor logic to using `from_vector` API.

### Checklist
- [X] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12422597810)
- [X] New/Existing tests provide coverage for changes

---------

Co-authored-by: Oleg Milyutin <omilyutin-tt@tenstorrent.com>
arikTT pushed a commit that referenced this issue Dec 27, 2024
…rmats (#16105)

### Ticket
#15061

### Problem description
`to_vector` / `from_vector` don't support some of the special cases,
which prevents a more widespread adoption (distributing tensors across
mesh of devices in particular).

### What's changed
* Support tilized layouts.
* Support bf4 / bf8 data types with auto-padding.
* Extended `chunk` / `concat` support for the added types.

### Next steps
* Optimize certain operations on-device, such as tilization, whenever
possible.
* Perform auto-padding in tilized layouts / when using sharding.
* Switching pytensor logic to using `from_vector` API.

### Checklist
- [X] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12422597810)
- [X] New/Existing tests provide coverage for changes

---------

Co-authored-by: Oleg Milyutin <omilyutin-tt@tenstorrent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request External feature request forge P1
Projects
None yet
Development

No branches or pull requests

6 participants