Runtime stitching / runtime weights #671

AleksKnezevic · 2024-09-12T01:45:03Z

Proposal for Front-End (FE) Interaction with the tt-mlir Runtime for On-Device Tensors

In this document, I will refer to the tt-mlir runtime as the “runtime” and the forge and pjrt runtimes as “third-party runtimes” (TPRT).

Key Use Cases:

Multi-batch instance: Avoid rewriting weights/constants for every iteration.
Runtime stitching: Output of one model feeds as input to another (including past cache), without feeding IOs back through the host.
Training loops: Updatable weights on the device during training iterations.

The following, I believe, addresses all these use cases:

TPRT should push parameters once, and they should remain live on the device.
TPRT pushes activations for each iteration.
TPRT can create two input tensors for double-buffered graph execution if needed.
TPRT is responsible for deallocating all tensors.
The runtime leaves outputs on the device until explicitly copied to the host by TPRT.
Outputs can be stored in either L1 or DRAM.
The compiler may not always know the layout of an on-device tensor (e.g., the FE may not recompile the same graph when executed twice, and activations could be in DRAM on the first execution and L1 cache on the second).
If there is a layout mismatch, the runtime converts the tensor to the required layout.
Tensors in L1 that are not needed by the current program should be moved to DRAM.

To accomplish this, I propose the following API:

Tensor toDevice(Tensor, Device, Layout)

Copies a tensor to the device with the specified layout.
Returns a handle to the on-device tensor.

Tensor toDevice(Tensor, Device)

Copies a tensor to the device, interleaved into DRAM.
Returns a handle to the on-device tensor.

Tensor toHost(Tensor)

Waits for tensor operations to complete and copies a tensor to the host.
The tensor remains allocated in device memory.

void wait(Tensor)

Barriers on tensor operations being finished

Layout getLayout(Binary, ProgramIndex, InputIndex)
-Returns layout of input at index as defined in binary

std::vector submit(Device, Binary, programIndex, inputTensors)

Execute binary on device
Asserts all input tensors are on device
Non-blocking, immediately returns, caller can barrier if desired
Calls toLayout on input tensors that are not in correct layout
Moves any tensors from L1 to DRAM that are not used by this program
Returns list of on-device output tensors

Tensor toLayout(Device, Tensor, Layout)

Returns tensor of required layout

void Deallocate(Tensor, Device)

Deallocates tensor on device

jnie-TT · 2024-09-12T15:55:57Z

Hey @AleksKnezevic this looks great! Had a couple of minor comments

void toLayout(Device, Tensor, Layout)

Converts tensor layout

I don't think we can convert layouts of Tensors in place. TTNN ops always allocate/return a new tensor. Maybe we could update this to

Tensor toLayout(Tensor, Layout)

That will solely convert the layout of the tensor. If we ever want to convert layout and move tensors to device or across devices, we can call the ToDevice API.

void Deallocate(Tensor, Device)

Deallocates tensor on device

We should probably add a force flag that signals ttnn whether or not to force deallocate a tensor. By default, it will deallocate only when the reference count is 0.

Layout getLayout(Device, Binary, ProgramIndex, InputIndex)
-Returns layout of input at index as defined in binary

We probably don't need the Device for this API.

AleksKnezevic · 2024-09-12T16:33:30Z

Thanks @jnie-TT, I modified the API above. As for deallocate, if the user (through TTRT) is deallocating a tensor, then it should be fine to force-deallocate in TTNN

jnie-TT self-assigned this Nov 8, 2024

jnie-TT mentioned this issue Nov 17, 2024

Runtime stitching APIs and sanity tests, ttnn runtime submit refactor #1301

Merged

jnie-TT closed this as completed in #1301 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime stitching / runtime weights #671

Runtime stitching / runtime weights #671

AleksKnezevic commented Sep 12, 2024 •

edited

Loading

jnie-TT commented Sep 12, 2024 •

edited

Loading

AleksKnezevic commented Sep 12, 2024

Runtime stitching / runtime weights #671

Runtime stitching / runtime weights #671

Comments

AleksKnezevic commented Sep 12, 2024 • edited Loading

jnie-TT commented Sep 12, 2024 • edited Loading

AleksKnezevic commented Sep 12, 2024

AleksKnezevic commented Sep 12, 2024 •

edited

Loading

jnie-TT commented Sep 12, 2024 •

edited

Loading