#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

cfjchu · 2024-09-21T01:03:29Z

Ticket

Link to Github Issue

Problem description

We want to add support to instantiate submeshes on our mesh. Key reasons:

Port over additional T3000 tests onto galaxy. There is Infrastructure support is needed to specify submeshes at arbitrary offsets in the SystemMesh.
Scale throughput performance by stamping-out / “tiling” our submesh on our full-mesh.

What's changed

T3000 Tests to TG: Supporting changes to enable all T3000 tests to be ported to TG
SubMesh Support:: Enable a user to instanatiate a submesh (MeshDevice) from a mesh. This also specializes the MeshDevice type: {RowMajor, Ring, Line}.
SubMesh Stamping: "Stamp-out"/tile the submesh across TG. llama is added as an example.

Example API Usage:

For example, create two ring 2x2 on T3000 machines and execute two ring all-gather on 2x2 ring.

def run_model(submesh):
    full_tensor = torch.ones((1, 1, 32, 32 * submesh.get_num_devices()), dtype=torch.bfloat16)
    for i in range(submesh.get_num_devices()):
        full_tensor[..., i * 32 : (i + 1) * 32] = i

    ttnn_tensor = ttnn.from_torch(full_tensor, mesh_mapper=ShardTensorToMesh(submesh, dim=3))
    ttnn_tensor = ttnn.to_device(ttnn_tensor, submesh)
    ttnn_tensor = ttnn.all_gather(ttnn_tensor, dim=3, num_links=1)

    for device_tensor in ttnn.get_device_tensors(ttnn_tensor):
        device_tensor_torch = ttnn.to_torch(device_tensor)
        assert torch.all(device_tensor_torch == full_tensor)

submesh_devices = mesh_device.get_submeshes((2, 2), ttnn.MeshType.Ring)
for submesh in submesh_devices:
    run_model(submesh)

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

SeanNijjar · 2024-09-21T15:08:24Z

Hey @cfjchu I'm mildly concerned with this inversion of responsibility where the ccl must request the topology it "wants" because in general it shouldn't or wouldn't know what it wants.

Instead what I thought the usage model would be is we have a universal entry point for a given ccl op and the ccl op is given APIS to query mesh information so it can set of its a ring or a line.

Is this still the case except the mesh device infra knows which op variant to dispatch to? Little confused about this interface.

cfjchu · 2024-09-21T19:28:50Z

Hey @cfjchu I'm mildly concerned with this inversion of responsibility where the ccl must request the topology it "wants" because in general it shouldn't or wouldn't know what it wants.

Instead what I thought the usage model would be is we have a universal entry point for a given ccl op and the ccl op is given APIS to query mesh information so it can set of its a ring or a line.

Is this still the case except the mesh device infra knows which op variant to dispatch to? Little confused about this interface.

@SeanNijjar This is a layered problem. So let's talk about what's currently in main. Today, our CCL all_gather op, receives a list of devices and then tries to make a decision about whether to use RING / LINE topology based on number of devices (I'm ignoring ttnn.line_all_gather here and just talking about ttnn.all_gather). This "inversion of responsibility" is already happening.

In main, if it infers we should be using ring topology, then we assume that the devices are ordered in a ring order, and today that order is currently tied to the MeshDevice device creation. Part of my changes in this PR is to explicitly decouple the relationship of between how ring-all-gather expects the ordering of devices and how the MeshDevice is instantiated and internally orders the devices. This does not attempt to resolve the "inversion of responsibility" that already exists. This PR is needed as a stepping stone so that I can actually attempt to fix the problem.

My plan is to commonize ttnn.all_gather(..) and ttnn.line_all_gather(..) into ttnn.all_gather(..., topology={LINE, RING} and make the TT-NN user in control of the decision and declare what all-gather topology they want as a function parametrization. This is what we talked about here: #10915.

SeanNijjar · 2024-09-22T12:49:37Z

ttnn/cpp/ttnn/operations/ccl/all_gather/device/all_gather_op.cpp

@@ -196,17 +197,19 @@ Tensor all_gather(
    if (num_devices == 1){
        topology = all_gather_op::Topology::Linear;
    }
+    auto mesh_device = MeshDevice::fetch_mesh_device(devices);
+    auto ring_devices = mesh_device->get_ring_devices();


This line is really my only current concern.

I agree that we should have only one allgather entrypoint which can temporarily be qualified with a topology. Then allgather can request a chip order accordingly. I also agree we need a way to have a consistent way to get an order of chips.

However, I wouldn't want the op to have to request a ring or line but instead just be given a sequence of chips and it decides if it can make a ring out of it or not. This way we get portability across galaxy variants ( those that are and aren't torused up) and the op can automatically take advantage of the torus without the model writer having to update all their ccl calls.

This is more inline with allgather forcing a line for a 2 chip invocation.

I agree that we should have only one allgather entrypoint which can temporarily be qualified with a topology. Then allgather can request a chip order accordingly. I also agree we need a way to have a consistent way to get an order of chips.

However, I wouldn't want the op to have to request a ring or line but instead just be given a sequence of chips and it decides if it can make a ring out of it or not.

Seems to conflict. MeshTensor is mapped onto a set of devices, and invocation to ttnn.all_gather(mesh_tensor) will land here. What we're doing here is requesting the ring devices from the mesh tensor.

TT-BrianLiu

Looks good

cfjchu marked this pull request as ready for review September 21, 2024 01:03

cglagovichTT approved these changes Sep 21, 2024

View reviewed changes

SeanNijjar approved these changes Sep 22, 2024

View reviewed changes

caixunshiren approved these changes Sep 23, 2024

View reviewed changes

TT-BrianLiu approved these changes Sep 24, 2024

View reviewed changes

aliuTT approved these changes Sep 24, 2024

View reviewed changes

cfjchu temporarily deployed to dev October 3, 2024 05:16 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 06:08 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 06:18 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 06:23 — with GitHub Actions Inactive

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy

83bfde3

cfjchu force-pushed the jchu/port-t3k-tests-on-tg branch from a8df31f to 83bfde3 Compare October 3, 2024 06:45

cfjchu temporarily deployed to dev October 3, 2024 07:03 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 07:04 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 07:18 — with GitHub Actions Inactive

cfjchu temporarily deployed to dev October 3, 2024 07:19 — with GitHub Actions Inactive

cfjchu merged commit a4a9a4c into main Oct 3, 2024
111 of 112 checks passed

cfjchu deleted the jchu/port-t3k-tests-on-tg branch October 3, 2024 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

cfjchu commented Sep 21, 2024 •

edited

Loading

SeanNijjar commented Sep 21, 2024

cfjchu commented Sep 21, 2024 •

edited

Loading

SeanNijjar Sep 22, 2024 •

edited

Loading

cfjchu Sep 23, 2024

TT-BrianLiu left a comment

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

Conversation

cfjchu commented Sep 21, 2024 • edited Loading

Ticket

Problem description

What's changed

Example API Usage:

Checklist

SeanNijjar commented Sep 21, 2024

cfjchu commented Sep 21, 2024 • edited Loading

SeanNijjar Sep 22, 2024 • edited Loading

Choose a reason for hiding this comment

cfjchu Sep 23, 2024

Choose a reason for hiding this comment

TT-BrianLiu left a comment

Choose a reason for hiding this comment

cfjchu commented Sep 21, 2024 •

edited

Loading

cfjchu commented Sep 21, 2024 •

edited

Loading

SeanNijjar Sep 22, 2024 •

edited

Loading