-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors #15886
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great wrok, we're getting close to API parity between Python / C++ side for multi-device!
using ::tt::tt_metal::Tensor; | ||
|
||
// copypaste from deprecated tensor pybinds ttnn | ||
tt::tt_metal::OwnedBuffer create_owned_buffer(const std::vector<float>& data, DataType data_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is data - a raw data or already partially prepared data? e.g. is it supposed to be tiled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't raw data, and it is in row major layout. Does it make sense to allow providing buffer in non row major layout? Same goes for to_vector
- I am assuming the primary use case is getting the data back in row major view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for to_vector - I am assuming the primary use case is getting the data back in row major view?
Yes
Does it make sense to allow providing buffer in non row major layout?
If you take data that was already prepared but is still on host.
e.g. getting data from another tensor or getting some data that was serialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We can provide this as part of the Tensor(Data data, const TensorSpec&)
as per the comment below?
|
||
template <typename T> | ||
Tensor create_owned_tensor( | ||
std::vector<T> data, const ttnn::Shape& shape, tt::tt_metal::DataType data_type, tt::tt_metal::Layout layout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no ttnn::Shape please. anything prevents us from making this ttnn::SimpleShape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must be Tensor(vector raw_data, const TensorSpec&)
raw data in this case would mean that some transformation must happen to align raw_data to the requested spec.
we can additionally have Tensor(Data data, const TensorSpec&)
for cases where we want to create a Tensor from a data which was already processed and user knows its aligned with the spec. Data in this case is just a type that provides a semantic distinction between raw and prepared vector. Maybe its a HostBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything prevents us from making this ttnn::SimpleShape?
No reason - had this change locally:)
+1 to TensorSpec
. I'm adding some documentation to clarify that the input has to match the tensor volume, and that we are assuming the row major layout.
For the second use case where we have the prepared / pre-processed data, we can do this in a follow up, this seems a somewhat separate effort.
cc9e599
to
eb71e44
Compare
Tensor input_tensor = | ||
from_vector(test_data, GetTensorSpec(ttnn::SimpleShape{1, num_rows, num_cols, 3}, DataType::FLOAT32)); | ||
|
||
auto mapper = api::shard_tensor_2d_to_mesh_mapper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this api::
namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree, I removed. In general, we use this pattern frequently, which I don't quite understand:
namespace ttnn::distributed::api {
// Declare stuff....
}
namespace ttnn::distributed {
using namespace api;
}
I can see how this might be useful to navigate through very large headers, but then would it be better to simply split them up? Other thoughts?
tests/ttnn/unit_tests/gtests/tensor/test_distributed_tensor.cpp
Outdated
Show resolved
Hide resolved
using ::ttnn::experimental::xtensor::to_vector; | ||
|
||
const std::vector<ttnn::SimpleShape>& GetShapesForTest() { | ||
static auto* shapes = new std::vector<ttnn::SimpleShape>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to use new here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, just muscle memory to never use static storage for types with non-trivial dtor:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be corrected, there is really no need for new
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wait - what is the issue exactly? The pattern is just so that we don't have a static variable to run a non-trivial dtor. See https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables
Here of course it does not matter and I can just return a copy. No strong opinion, but was wondering where are you coming from:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good for me. just some codestyle related comments.
c8cc96d
to
3bc5157
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Looks great, just kick off the model regressions so we see whether your change had any impact on those serialized tensors and open up issues for the follow-up items in the PR!
// library for efficient host-side operations. | ||
|
||
// Splits a tensor into chunks along the specified dimension. | ||
std::vector<tt::tt_metal::Tensor> chunk(const tt::tt_metal::Tensor& tensor, int num_chunks, int dim = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sminakov-tt FYI since we talked about this yesterday. This is one of the primitives used to build the user-facing multi-device sharding (this adds C++ parity now).
ttnn::SimpleShape shape{128, 128}; | ||
|
||
auto input = arange<TypeParam>(0, shape.volume(), 1); | ||
// TODO: Support this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the issue with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to encourage on-host tilization, so ideally from_vector
should accept a device as well... There is just too many problems with tilization now, so I'd rather have this in and work on tilization / performance in a follow up.
ttnn::distributed::api::close_mesh_device(m_mesh_device); | ||
ttnn::distributed::close_mesh_device(m_mesh_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
namespace ttnn::distributed { | ||
|
||
// Mapper interface that distributes a host tensor onto a multi-device configuration. | ||
class TensorToMesh { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to hop on call to try finding better names.
Something feels very off about current names of classes and methods and it is not clear from the initial glance how this must be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These come for the Python world, but yeah, not ideal. Let's do a follow up with naming and possible API unification? I think we can do better in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
// Elements in the buffer will be stored in row-major order. The type of the elements has to match that of the | ||
// `Tensor`. | ||
template <typename T> | ||
std::vector<T> to_vector() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is only currently supported for a subset of cases. Again, afaik tile layout is not handled, block float formats not supported, sharding, including logical sharding is not supported. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarifying comment. Does tilized std::vector
make sense though? For bf4/bf8, my plan was to just convert them into float and return a row-major slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
yes, tilized output does not makes sense. returning floats is right. but it is not handled if I got it right
std::vector<xt::xarray<T>> chunk(const xt::xarray<T>& tensor, int num_chunks, int dim = 0); | ||
|
||
// Concatenates a list of tensors along the specified dimension. | ||
tt::tt_metal::Tensor concatenate(const std::vector<tt::tt_metal::Tensor>& tensors, int dim = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider concat
or join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two other apis operate on xt::xarray, but this one operates on a Tensor. This seem a bit off to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe (just maybe) this is another Tensor constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two other apis operate on xt::xarray, but this one operates on a Tensor.
Yeah, likely we will move tt-train off the one that relies on xt::array
. I'm not too worried for now, as both of these rely on the same impl.
I'm not sure about the ctor, this feels more like an utility function (which also shout sit next to chunk
IMO). Big +1 that we need to have a better organization though.
And sure, concat
sounds good. No strong opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(thought) I imagine a newcomer looking at ttnn/tensor/xtensor and getting confused. xtensor is not like boost, it is not a well known library/typename. Instead of calling this folder xtensor, I'd rather focus on the goal it helps to achieve. But I also understand that its goal seem to be to bridge tensor with xtensor... Just thinking out load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to sandbox the xtensor stuff:) I hope the namespace makes it clear it is a less stable API.
a910a0d
to
981a34f
Compare
981a34f
to
a955137
Compare
…mpatibility with tensor serialization
Ticket
#15755
Problem description
Multi-device tensor distribution currently works through
distributed.py
, which relies on PyTorch libraries to perform sharding / concatenation.What's changed
chunk
,concatenate
functions along with some conversion utils, and the relevant tests.distributed_tensor.hpp
header with the multi-device distribution APIs.In follow up PRs:
from_vector
/to_vector
and other overloads.pytensor.cpp
to using the new APIs.Checklist