Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TTNN ToLayout + To/FromDevice Ops around hoisted func calls #1925

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

vwellsTT
Copy link
Contributor

@vwellsTT vwellsTT commented Jan 21, 2025

Goal: The end-to-end goal is to integrate a path to compile and execute specific ops or sets of ops on the CPU.

Context:

The entire task will be split into (tentatively) 7 PRs, as follows:

  1. Hoist specific ops into isolated funcs in a separate module
  2. Convert TTIR ops to linalg ops within the module of hoisted funcs
  3. Build a pipeline to lower linalg to llvm from existing conversion passes
  4. Translate LLVM Dialect into a dynamic library for packing into flatbuffer
  5. Generate helper functions so that we can call all of our hoisted funcs with a common signature
  6. Insert TTNN instructions to move operands to host before executing hoisted func, then back to device afterwards
  7. Update ttir-to-ttnn and ttnn-to-flatbuffer pipelines to use new passes, generate dylibs, and embed them into output flatbuffers, and update update runtime to consume dylibs from flatbuffers

This PR represents the 6th bullet above--it makes changes to TTNN passes s.t. we move inputs to cpu-hoisted func from_device to host, we move results of cpu_hoisted funcs to_device from host, we ensure that inputs to cpu-hoisted funcs are in row-major layout. We also make it so various assertions in TTNN passes do not reject func decls with no bodies, since we rely on these for hoisted funcs.

Example

Input:

  func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
    %0 = tensor.empty() : tensor<64x128xf32>
    %1 = call @hoisted_func_decl(%arg0, %arg1, %0) {hoisted_call} : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
    return %1 : tensor<64x128xf32>
  }

Output:

func.func @forward(%arg0: tensor<64x128xf32, #ttnn_layout>, %arg1: tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout> {
    %0 = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !tt.device<#device>
    %1 = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes<f32>, layout = #ttnn.layout<row_major>, memory_config = #ttnn.memory_config<#system_memory, <<64x128>>>, shape = #ttnn.shape<64x128>}> : (!tt.device<#device>) -> tensor<64x128xf32, #ttnn_layout>
    %2 = "ttnn.from_device"(%arg0) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %3 = "ttnn.from_device"(%arg1) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %4 = "ttnn.from_device"(%1) : (tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %5 = call @hoisted_func_decl(%2, %3, %4) : (tensor<64x128xf32, #ttnn_layout>, tensor<64x128xf32, #ttnn_layout>, tensor<64x128xf32, #ttnn_layout>) -> tensor<64x128xf32, #ttnn_layout>
    %6 = "ttnn.to_device"(%5, %0) : (tensor<64x128xf32, #ttnn_layout>, !tt.device<#device>) -> tensor<64x128xf32, #ttnn_layout>
    return %6 : tensor<64x128xf32, #ttnn_layout>
  }

// If this attr is null, it should mean device is on host; this should be
// legal, so we propagate here.
if (!attr) {
return builder.getType<emitc::OpaqueAttr>("nullptr");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I'm not strictly sure this is the correct way to do this--I talked with @nsmithtt about how we should treat tensors w/o this TensorMemoryLayoutAttr and we agreed it should be treated as optional, so this seems like a reasonable way to handle for emitC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @svuckovicTT, we need some way of constructing an empty host tensor that can act as storage for the CPU fallback destination.

@vwellsTT, I don't think we can return a nullptr attribute, I think host tensor creation probably needs a whole new path in emitC that doesn't program tensor layout to begin with. I found some runtime code from runtime/lib/ttnn/runtime.cpp that creates TTNN host tensors in C++:

    auto tensor = std::make_shared<::ttnn::Tensor>(    
        createStorage<BorrowedStorage>(data.get(), numElements, dataType),    
        ::ttnn::Shape(small_vector_shape), utils::toTTNNDataType(dataType),    
        ::ttnn::Layout::ROW_MAJOR);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh 🤦‍♂️ yeah that's clearly not right, sorry. I meant to make the ::ttnn::MemoryConfig itself null, not an enum, but yeah I guess I need to spend a bit more time looking into how our emitC stuff works. (I made similar changes to ttrt recently, and I think many of the ::ttnn ops do accept std::optional, including ::ttnn::empty, so hopefully I can do something similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static ::ttnn::Tensor
createEmptyOnSingleDevice(ProgramContext &context, EmptyTensorConfig &config,
                          const ::tt::target::DeviceRef *deviceRef) {
  if (deviceRef && config.memoryConfig.has_value() &&
      config.memoryConfig.value().buffer_type !=
          ::ttnn::BufferType::SYSTEM_MEMORY) {
    ::ttnn::MeshDevice &subMesh = context.getSubMesh(deviceRef->global_id());
    LOG_ASSERT(subMesh.num_devices() == 1);
    ::ttnn::Device *device = subMesh.get_device_index(0);
    return ::ttnn::empty(config.shape, config.dtype, config.layout, device,
                         config.memoryConfig.value());
  }
  return ::ttnn::zeros(config.shape, config.dtype, config.layout);
}

I guess I probably just want to mirror this logic from ttrt ideally

@@ -349,13 +406,14 @@ class TTNNLayoutForceSystemMemoryRewriter : public OpRewritePattern<SrcOp> {
appendInputSuffix(op.getLoc(), operand.getOperandNumber());
std::optional<Value> layout = createToLayoutOp(
rewriter, newLoc, operand.get(), BufferType::SystemMemory,
nullptr /* tensorMemoryLayoutAttr */, false /* tiled */);
nullptr /* desiredMemLayoutAttr */, false /* tiled */);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really tensorMemoryLayoutAttr is more descriptive, but desiredMemLayoutAttr is actual param name, which imo makes more sense for this type of comment

// If this attr is null, it should mean device is on host; this should be
// legal, so we propagate here.
if (!attr) {
return builder.getType<emitc::OpaqueAttr>("nullptr");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @svuckovicTT, we need some way of constructing an empty host tensor that can act as storage for the CPU fallback destination.

@vwellsTT, I don't think we can return a nullptr attribute, I think host tensor creation probably needs a whole new path in emitC that doesn't program tensor layout to begin with. I found some runtime code from runtime/lib/ttnn/runtime.cpp that creates TTNN host tensors in C++:

    auto tensor = std::make_shared<::ttnn::Tensor>(    
        createStorage<BorrowedStorage>(data.get(), numElements, dataType),    
        ::ttnn::Shape(small_vector_shape), utils::toTTNNDataType(dataType),    
        ::ttnn::Layout::ROW_MAJOR);

// Insert ToDevice op after the result.
auto toDeviceOp = rewriter.create<ttnn::ToDeviceOp>(
callOp.getLoc(), result.getType(), result, device,
ttnn::MemoryConfigAttr{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok to unconditionally call FromDevice despite the input tensor potentially already being on host (though I haven't tested). But it's not clear what to do afterwards, ideally we put the tensor back into the exact same memory space that the input tensors were in. I think the you have this written is the most correct thing to do, just want to spawn a discussion in case others have any ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this has worked fine in all my testing so far (though I confess I haven't done anything complicated in testing so far)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants