Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

vwellsTT · 2025-01-21T21:22:27Z

Goal: The end-to-end goal is to integrate a path to compile and execute specific ops or sets of ops on the CPU.

Context:

The entire task will be split into (tentatively) 7 PRs, as follows:

Hoist specific ops into isolated funcs in a separate module
Convert TTIR ops to linalg ops within the module of hoisted funcs
Build a pipeline to lower linalg to llvm from existing conversion passes
Translate LLVM Dialect into a dynamic library for packing into flatbuffer
Generate helper functions so that we can call all of our hoisted funcs with a common signature
Insert TTNN instructions to move operands to host before executing hoisted func, then back to device afterwards
Update ttir-to-ttnn and ttnn-to-flatbuffer pipelines to use new passes, generate dylibs, and embed them into output flatbuffers, and update update runtime to consume dylibs from flatbuffers

This PR represents the 6th bullet above--it makes changes to TTNN passes s.t. we move inputs to cpu-hoisted func from_device to host, we move results of cpu_hoisted funcs to_device from host, we ensure that inputs to cpu-hoisted funcs are in row-major layout. We also make it so various assertions in TTNN passes do not reject func decls with no bodies, since we rely on these for hoisted funcs.

Example

Input:

  func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
    %0 = tensor.empty() : tensor<64x128xf32>
    %1 = call @hoisted_func_decl(%arg0, %arg1, %0) {hoisted_call} : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
    return %1 : tensor<64x128xf32>
  }

Output:

func.func @forward(%arg0: tensor<64x128xf32, #ttnn_layout>, %arg1: tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<f32>, layout = #ttnn.layout<row_major>, memory_config = #ttnn.memory_config<#system_memory, <<64x128>>>, shape = #ttnn.shape<64x128>}> : (!tt.device<#device>) -> tensor<64x128xf32, #ttnn_layout>
    %2 = "ttnn.from_device"(%arg0) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %3 = "ttnn.from_device"(%arg1) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %4 = "ttnn.from_device"(%1) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %5 = call @hoisted_func_decl(%2, %3, %4) : (tensor<64x128xf32, #ttnn_layout>, tensor<64x128xf32, #ttnn_layout>, tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %6 = "ttnn.to_device"(%5, %0) : (tensor<64x128xf32, #ttnn_layout>, !tt.device<#device>) -> tensor<64x128xf32, #ttnn_layout>
    return %6 : tensor<64x128xf32, #ttnn_layout>
  }

…CPU properly

vwellsTT · 2025-01-22T18:12:23Z

lib/Conversion/TTNNToEmitC/Utils.cpp

+  // If this attr is null, it should mean device is on host; this should be
+  // legal, so we propagate here.
+  if (!attr) {
+    return builder.getType<emitc::OpaqueAttr>("nullptr");


note: I'm not strictly sure this is the correct way to do this--I talked with @nsmithtt about how we should treat tensors w/o this TensorMemoryLayoutAttr and we agreed it should be treated as optional, so this seems like a reasonable way to handle for emitC.

FYI @svuckovicTT, we need some way of constructing an empty host tensor that can act as storage for the CPU fallback destination.

@vwellsTT, I don't think we can return a nullptr attribute, I think host tensor creation probably needs a whole new path in emitC that doesn't program tensor layout to begin with. I found some runtime code from runtime/lib/ttnn/runtime.cpp that creates TTNN host tensors in C++:

auto tensor = std::make_shared<::ttnn::Tensor>( createStorage<BorrowedStorage>(data.get(), numElements, dataType), ::ttnn::Shape(small_vector_shape), utils::toTTNNDataType(dataType), ::ttnn::Layout::ROW_MAJOR);

Oh 🤦‍♂️ yeah that's clearly not right, sorry. I meant to make the ::ttnn::MemoryConfig itself null, not an enum, but yeah I guess I need to spend a bit more time looking into how our emitC stuff works. (I made similar changes to ttrt recently, and I think many of the ::ttnn ops do accept std::optional, including ::ttnn::empty, so hopefully I can do something similar)

static ::ttnn::Tensor createEmptyOnSingleDevice(ProgramContext &context, EmptyTensorConfig &config, const ::tt::target::DeviceRef *deviceRef) { if (deviceRef && config.memoryConfig.has_value() && config.memoryConfig.value().buffer_type != ::ttnn::BufferType::SYSTEM_MEMORY) { ::ttnn::MeshDevice &subMesh = context.getSubMesh(deviceRef->global_id()); LOG_ASSERT(subMesh.num_devices() == 1); ::ttnn::Device *device = subMesh.get_device_index(0); return ::ttnn::empty(config.shape, config.dtype, config.layout, device, config.memoryConfig.value()); } return ::ttnn::zeros(config.shape, config.dtype, config.layout); }

I guess I probably just want to mirror this logic from ttrt ideally

vwellsTT · 2025-01-22T18:13:17Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

@@ -349,13 +406,14 @@ class TTNNLayoutForceSystemMemoryRewriter : public OpRewritePattern<SrcOp> {
          appendInputSuffix(op.getLoc(), operand.getOperandNumber());
      std::optional<Value> layout = createToLayoutOp(
          rewriter, newLoc, operand.get(), BufferType::SystemMemory,
-          nullptr /* tensorMemoryLayoutAttr */, false /* tiled */);
+          nullptr /* desiredMemLayoutAttr */, false /* tiled */);


really tensorMemoryLayoutAttr is more descriptive, but desiredMemLayoutAttr is actual param name, which imo makes more sense for this type of comment

nsmithtt · 2025-01-23T02:32:50Z

lib/Conversion/TTNNToEmitC/Utils.cpp

+  // If this attr is null, it should mean device is on host; this should be
+  // legal, so we propagate here.
+  if (!attr) {
+    return builder.getType<emitc::OpaqueAttr>("nullptr");


FYI @svuckovicTT, we need some way of constructing an empty host tensor that can act as storage for the CPU fallback destination.

@vwellsTT, I don't think we can return a nullptr attribute, I think host tensor creation probably needs a whole new path in emitC that doesn't program tensor layout to begin with. I found some runtime code from runtime/lib/ttnn/runtime.cpp that creates TTNN host tensors in C++:

auto tensor = std::make_shared<::ttnn::Tensor>( createStorage<BorrowedStorage>(data.get(), numElements, dataType), ::ttnn::Shape(small_vector_shape), utils::toTTNNDataType(dataType), ::ttnn::Layout::ROW_MAJOR);

nsmithtt · 2025-01-23T02:38:40Z

lib/Dialect/TTNN/Transforms/TTNNLayout.cpp

+      // Insert ToDevice op after the result.
+      auto toDeviceOp = rewriter.create<ttnn::ToDeviceOp>(
+          callOp.getLoc(), result.getType(), result, device,
+          ttnn::MemoryConfigAttr{});


It should be ok to unconditionally call FromDevice despite the input tensor potentially already being on host (though I haven't tested). But it's not clear what to do afterwards, ideally we put the tensor back into the exact same memory space that the input tensors were in. I think the you have this written is the most correct thing to do, just want to spawn a discussion in case others have any ideas.

Yeah, this has worked fine in all my testing so far (though I confess I haven't done anything complicated in testing so far)

vwellsTT added 12 commits January 21, 2025 21:16

Add TTNN transform to ensure hoisted ops have operands moved to/from …

9773592

…CPU properly

add some silly debugs

4fac3b8

cherry-pick fixes from e2e branch here

73003e3

cleanup

b4e83e6

cleanup

7cebba9

cleanup again

da364a3

avoid more func.hasOneBlock() asserts for func decls

92bbd1a

Merge branch 'main' into vwells/ttnn_hoisted_layout_transform

8960d2c

fixes

edcc65b

add unit test

8ce6453

Merge branch 'main' into vwells/ttnn_hoisted_layout_transform

1c15ecc

remove accidental NL

6a6a000

vwellsTT marked this pull request as ready for review January 22, 2025 18:03

vwellsTT requested review from sdjordjevicTT, svuckovicTT, mtopalovicTT, jserbedzijaTT, jnie-TT, azecevicTT, nsmithtt and mrakitaTT as code owners January 22, 2025 18:03

Merge branch 'main' into vwells/ttnn_hoisted_layout_transform

384ba0a

vwellsTT commented Jan 22, 2025

View reviewed changes

nsmithtt reviewed Jan 23, 2025

View reviewed changes

vwellsTT added 2 commits January 23, 2025 15:52

rework logic for EmptyOp targetd on host

9491dc5

oops, pick up more recent changes to TTNNToEmitC.cpp

c35f67a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

vwellsTT commented Jan 21, 2025 •

edited

Loading

vwellsTT Jan 22, 2025

nsmithtt Jan 23, 2025

vwellsTT Jan 23, 2025

vwellsTT Jan 23, 2025

vwellsTT Jan 22, 2025

nsmithtt Jan 23, 2025

nsmithtt Jan 23, 2025

vwellsTT Jan 23, 2025

Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

Are you sure you want to change the base?

Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

Conversation

vwellsTT commented Jan 21, 2025 • edited Loading

Goal: The end-to-end goal is to integrate a path to compile and execute specific ops or sets of ops on the CPU.

Context:

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vwellsTT commented Jan 21, 2025 •

edited

Loading