Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#11403: SubMesh Support + Porting/Stamping T3K Tests to Galaxy #12962

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

cfjchu
Copy link
Collaborator

@cfjchu cfjchu commented Sep 21, 2024

Ticket

Link to Github Issue

Problem description

We want to add support to instantiate submeshes on our mesh. Key reasons:

  1. Port over additional T3000 tests onto galaxy. There is Infrastructure support is needed to specify submeshes at arbitrary offsets in the SystemMesh.
  2. Scale throughput performance by stamping-out / “tiling” our submesh on our full-mesh.

What's changed

  1. T3000 Tests to TG: Supporting changes to enable all T3000 tests to be ported to TG
  2. SubMesh Support:: Enable a user to instanatiate a submesh (MeshDevice) from a mesh. This also specializes the MeshDevice type: {RowMajor, Ring, Line}.
  3. SubMesh Stamping: "Stamp-out"/tile the submesh across TG. llama is added as an example.

Example API Usage:

For example, create two ring 2x2 on T3000 machines and execute two ring all-gather on 2x2 ring.

def run_model(submesh):
    full_tensor = torch.ones((1, 1, 32, 32 * submesh.get_num_devices()), dtype=torch.bfloat16)
    for i in range(submesh.get_num_devices()):
        full_tensor[..., i * 32 : (i + 1) * 32] = i

    ttnn_tensor = ttnn.from_torch(full_tensor, mesh_mapper=ShardTensorToMesh(submesh, dim=3))
    ttnn_tensor = ttnn.to_device(ttnn_tensor, submesh)
    ttnn_tensor = ttnn.all_gather(ttnn_tensor, dim=3, num_links=1)

    for device_tensor in ttnn.get_device_tensors(ttnn_tensor):
        device_tensor_torch = ttnn.to_torch(device_tensor)
        assert torch.all(device_tensor_torch == full_tensor)

submesh_devices = mesh_device.get_submeshes((2, 2), ttnn.MeshType.Ring)
for submesh in submesh_devices:
    run_model(submesh)

image

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

@SeanNijjar
Copy link
Contributor

Hey @cfjchu I'm mildly concerned with this inversion of responsibility where the ccl must request the topology it "wants" because in general it shouldn't or wouldn't know what it wants.

Instead what I thought the usage model would be is we have a universal entry point for a given ccl op and the ccl op is given APIS to query mesh information so it can set of its a ring or a line.

Is this still the case except the mesh device infra knows which op variant to dispatch to? Little confused about this interface.

@cfjchu
Copy link
Collaborator Author

cfjchu commented Sep 21, 2024

Hey @cfjchu I'm mildly concerned with this inversion of responsibility where the ccl must request the topology it "wants" because in general it shouldn't or wouldn't know what it wants.

Instead what I thought the usage model would be is we have a universal entry point for a given ccl op and the ccl op is given APIS to query mesh information so it can set of its a ring or a line.

Is this still the case except the mesh device infra knows which op variant to dispatch to? Little confused about this interface.

@SeanNijjar This is a layered problem. So let's talk about what's currently in main. Today, our CCL all_gather op, receives a list of devices and then tries to make a decision about whether to use RING / LINE topology based on number of devices (I'm ignoring ttnn.line_all_gather here and just talking about ttnn.all_gather). This "inversion of responsibility" is already happening.

In main, if it infers we should be using ring topology, then we assume that the devices are ordered in a ring order, and today that order is currently tied to the MeshDevice device creation. Part of my changes in this PR is to explicitly decouple the relationship of between how ring-all-gather expects the ordering of devices and how the MeshDevice is instantiated and internally orders the devices. This does not attempt to resolve the "inversion of responsibility" that already exists. This PR is needed as a stepping stone so that I can actually attempt to fix the problem.

My plan is to commonize ttnn.all_gather(..) and ttnn.line_all_gather(..) into ttnn.all_gather(..., topology={LINE, RING} and make the TT-NN user in control of the decision and declare what all-gather topology they want as a function parametrization. This is what we talked about here: #10915.

@@ -196,17 +197,19 @@ Tensor all_gather(
if (num_devices == 1){
topology = all_gather_op::Topology::Linear;
}
auto mesh_device = MeshDevice::fetch_mesh_device(devices);
auto ring_devices = mesh_device->get_ring_devices();
Copy link
Contributor

@SeanNijjar SeanNijjar Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really my only current concern.

I agree that we should have only one allgather entrypoint which can temporarily be qualified with a topology. Then allgather can request a chip order accordingly. I also agree we need a way to have a consistent way to get an order of chips.

However, I wouldn't want the op to have to request a ring or line but instead just be given a sequence of chips and it decides if it can make a ring out of it or not. This way we get portability across galaxy variants ( those that are and aren't torused up) and the op can automatically take advantage of the torus without the model writer having to update all their ccl calls.

This is more inline with allgather forcing a line for a 2 chip invocation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should have only one allgather entrypoint which can temporarily be qualified with a topology. Then allgather can request a chip order accordingly. I also agree we need a way to have a consistent way to get an order of chips.

However, I wouldn't want the op to have to request a ring or line but instead just be given a sequence of chips and it decides if it can make a ring out of it or not.

Seems to conflict. MeshTensor is mapped onto a set of devices, and invocation to ttnn.all_gather(mesh_tensor) will land here. What we're doing here is requesting the ring devices from the mesh tensor.

Copy link
Contributor

@TT-BrianLiu TT-BrianLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.