Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLIR] API Request: Query location of device #10745

Closed
jnie-TT opened this issue Jul 25, 2024 · 7 comments
Closed

[MLIR] API Request: Query location of device #10745

jnie-TT opened this issue Jul 25, 2024 · 7 comments
Assignees

Comments

@jnie-TT
Copy link
Contributor

jnie-TT commented Jul 25, 2024

Requesting an API that can be used to query the location of a ttnn device.

Usage: contributes to creating a system description in Uhuru, which includes information of chip locations.

Related ticket: #10671

@nsmithtt
Copy link
Contributor

Just adding an asterisk, in order to be as flexible as possible, we want this API for TTMetal device, potentially TTNN device can just forward the info from TTMetal device and or modify the API to however it sees fit.

We not only would like to know chip locations, but their connectivity. This could be a list of edges where each edge is a pair of device ids + placeholder for additional info, like BW info. This API should also capture ethernet streams that are exclusively reversed for fast dispatch, i.e. not advertise those as available connections.

@aliuTT
Copy link
Contributor

aliuTT commented Aug 1, 2024

We're doing a round of refactor for Device APIs, probably going to pull out useful user-facing APIs into an api list like host_api.hpp. I'll keep these asks in mind, a few comments/questions for now:

  • What is chip location and why do you need it? Will it be used to infer any connectivity?
  • We have some APIs for querying connectivity in between chips, although documentation is very poor. That is a miss we'll address when pulling together the Device APIs.
  • APIs already ignore dispatch reserved connections.
  • BW info is interesting, do you want a static number? Or want runtime to run real workloads and collect numbers into a file?

@aliuTT aliuTT added the P2 label Aug 1, 2024
@aliuTT aliuTT changed the title API Request: Query location of device [MLIR] API Request: Query location of device Aug 1, 2024
@nsmithtt
Copy link
Contributor

nsmithtt commented Aug 1, 2024

Hey @aliuTT thanks, I have a couple more runtime API requests I'll file those now! Comments inline:

  • What is chip location and why do you need it? Will it be used to infer any connectivity?

Chip location as in its physical coordinate in the galaxy system. This should include rack/shelf whatever nomenclature we're using to distinguish chips between galaxies (like on a TGG).

  • We have some APIs for querying connectivity in between chips, although documentation is very poor. That is a miss we'll address when pulling together the Device APIs.

Sounds good.

  • APIs already ignore dispatch reserved connections.

So I think we're asking for the opposite, we want to know the set of usable links. I..e these APIs should filter out connections / links / chips / etc. that are reserved for fast dispatch and only present usable ones.

  • BW info is interesting, do you want a static number? Or want runtime to run real workloads and collect numbers into a file?

Probably a static number of theoretical max (or max that we've physically measured) is the way to go. And then different users could implement whatever heuristic to better model their real world usage.

@aliuTT
Copy link
Contributor

aliuTT commented Aug 1, 2024

Continuing discussions:

  • Hey @aliuTT thanks, I have a couple more runtime API requests I'll file those now! Comments inline:

For tracking, can you follow what I tagged in this request? Link the Metal Runtime - Team Dashboard and definitely tag feature or bug, otherwise it wouldn't show up in the github UI we have setup. I also prefaced the issue with [MLIR], just so from a glance we can tell where issues come from.

  • Chip location as in its physical coordinate in the galaxy system. This should include rack/shelf whatever nomenclature we're using to distinguish chips between galaxies (like on a TGG).

This is an interesting request. Overall ethernet coordinates are only used for link training. Today we have ethernet coordinates as (x, y, rack, shelf), and so it nicely maps to the physical topology. But in the future we will have 3D coordinates for Torus connectivity. Specs are still up in the air but as an example you can have four chips in the same Galaxy shelf/box with coordinates (0,0,0,0), (1,0,0,0), (1,0,0,1), (0, 0, 0, 1). Are these helpful to print out? In this example, to get the physical locations we'll need some other bits flashed to the chip to list actual rack and shelf locations. Does that make sense? Is the raw ethernet coords something you are interested in exposing?

  • So I think we're asking for the opposite, we want to know the set of usable links. I..e these APIs should filter out connections / links / chips / etc. that are reserved for fast dispatch and only present usable ones.

I meant to say, we do have what you list here. Device APIs today present active links/connectivity that filter out links reserved for fast dispatch.

  • Probably a static number of theoretical max (or max that we've physically measured) is the way to go. And then different users could implement whatever heuristic to better model their real world usage.

Sounds good!

@nsmithtt
Copy link
Contributor

nsmithtt commented Aug 1, 2024

For tracking, can you follow what I tagged in this request? Link the Metal Runtime - Team Dashboard and definitely tag feature or bug, otherwise it wouldn't show up in the github UI we have setup. I also prefaced the issue with [MLIR], just so from a glance we can tell where issues come from.

Done! Updated the other 2 issues I filed.

  • Chip location as in its physical coordinate in the galaxy system. This should include rack/shelf whatever nomenclature we're using to distinguish chips between galaxies (like on a TGG).

This is an interesting request. Overall ethernet coordinates are only used for link training. Today we have ethernet coordinates as (x, y, rack, shelf), and so it nicely maps to the physical topology. But in the future we will have 3D coordinates for Torus connectivity. Specs are still up in the air but as an example you can have four chips in the same Galaxy shelf/box with coordinates (0,0,0,0), (1,0,0,0), (1,0,0,1), (0, 0, 0, 1). Are these helpful to print out? In this example, to get the physical locations we'll need some other bits flashed to the chip to list actual rack and shelf locations. Does that make sense? Is the raw ethernet coords something you are interested in exposing?

I see, are device_ids guaranteed to be unique across a whole TGG system? If so I think this might be enough (+ the API below) to reconstruct the full topology.

  • So I think we're asking for the opposite, we want to know the set of usable links. I..e these APIs should filter out connections / links / chips / etc. that are reserved for fast dispatch and only present usable ones.

I meant to say, we do have what you list here. Device APIs today present active links/connectivity that filter out links reserved for fast dispatch.

Ah ok, I didn't realize this. I suppose we can get all of this information using w/ skip_reserved_tunnel_cores=true:

std::unordered_set<CoreCoord> get_active_ethernet_cores(bool skip_reserved_tunnel_cores=false)

And then to get connectivity using:

std::tuple<chip_id_t, CoreCoord> get_connected_ethernet_core(CoreCoord eth_core)

@aliuTT
Copy link
Contributor

aliuTT commented Aug 1, 2024

For tracking, can you follow what I tagged in this request? Link the Metal Runtime - Team Dashboard and definitely tag feature or bug, otherwise it wouldn't show up in the github UI we have setup. I also prefaced the issue with [MLIR], just so from a glance we can tell where issues come from.

Done! Updated the other 2 issues I filed.

Thanks!

  • Chip location as in its physical coordinate in the galaxy system. This should include rack/shelf whatever nomenclature we're using to distinguish chips between galaxies (like on a TGG).

This is an interesting request. Overall ethernet coordinates are only used for link training. Today we have ethernet coordinates as (x, y, rack, shelf), and so it nicely maps to the physical topology. But in the future we will have 3D coordinates for Torus connectivity. Specs are still up in the air but as an example you can have four chips in the same Galaxy shelf/box with coordinates (0,0,0,0), (1,0,0,0), (1,0,0,1), (0, 0, 0, 1). Are these helpful to print out? In this example, to get the physical locations we'll need some other bits flashed to the chip to list actual rack and shelf locations. Does that make sense? Is the raw ethernet coords something you are interested in exposing?

I see, are device_ids guaranteed to be unique across a whole TGG system? If so I think this might be enough (+ the API below) to reconstruct the full topology.

Device ids are guaranteed to be unique. And you're right, ids + APIs should be able to reconstruct full topology.

  • So I think we're asking for the opposite, we want to know the set of usable links. I..e these APIs should filter out connections / links / chips / etc. that are reserved for fast dispatch and only present usable ones.

I meant to say, we do have what you list here. Device APIs today present active links/connectivity that filter out links reserved for fast dispatch.

Ah ok, I didn't realize this. I suppose we can get all of this information using w/ skip_reserved_tunnel_cores=true:

std::unordered_set<CoreCoord> get_active_ethernet_cores(bool skip_reserved_tunnel_cores=false)

And then to get connectivity using:

std::tuple<chip_id_t, CoreCoord> get_connected_ethernet_core(CoreCoord eth_core)

Right exactly. Or we have this other concept of a socket, which is returns an ordered representation of ethernet cores (also excludes dispatch reserved cores).

CoreCoord free_eth_core = device_0->get_active_ethernet_cores(true).begin()
std::tuple<chip_id_t, CoreCoord> pair = device_0->get_connected_ethernet_core(free_eth_core)
OR
CoreCoord free_eth_core = device_0->get_ethernet_sockets(/*chip_id=*/1)[0]
CoreCoord connected_eth_core = device_1->get_ethernet_sockets(/*chip_id=*/0)[0]
CoreCoord free_eth_core = device_0->get_ethernet_sockets(/*chip_id=*/1)[1]
CoreCoord connected_eth_core = device_1->get_ethernet_sockets(/*chip_id=*/0)[1]

So the following ethernet cores are connected:

device_0.get_ethernet_sockets(device_1)[0]
device_1.get_ethernet_sockets(device_0)[0]

@nsmithtt
Copy link
Contributor

nsmithtt commented Aug 1, 2024

Awesome, thanks @aliuTT for the explanations. I think we probably have what we need then. Will close this ticket and file a new one if something comes up.

@nsmithtt nsmithtt closed this as completed Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants