-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi-device system descriptor #220
Conversation
I think we can just create a PR and it shouldn't be an issue. Can you send me your branch name and I'll take a look? |
runtime/lib/ttnn/runtime.cpp
Outdated
chipCapabilities.push_back(chipCapability); | ||
// Extract chip location | ||
int x, y, rack, shelf; | ||
std::tie(x, y, rack, shelf) = device.get_chip_location(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is get_chip_location
the only new API you added to metal? If so let's file an issue on metal (to have an issue number in the commit) something like "Add device chip location API".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsmithtt Metal PR here: https://github.com/tenstorrent/tt-metal/pull/10674/files
issue here: tenstorrent/tt-metal#10671
runtime/lib/ttnn/runtime.cpp
Outdated
std::tie(x, y, rack, shelf) = device.get_chip_location(); | ||
chipCoords.emplace_back(rack, shelf, y, x); | ||
// Extract chip connected channels | ||
std::pair<int, int> connectedChips = getConnectedChips(device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't there be more than 1 connect chips? Seems like a remote chip in the middle of a galaxy could have up to 4 connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm that's a good point, I'll update the chipChannels schema to be an array of ints then instead of having only 2 endpoints hardcoded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention for chipChannels
was to be just a flat list of connections, not tied to the other arrays, but maybe this is more cumbersome to query. If you think of it like a graph you have a list of nodes chipIds
and a list of edges chipChannels
where each edge is just a pair of node ids (but could be extended with new attributes in the future, like number of streams).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsmithtt Ahhh I see, I misinterpreted what chipChannels was intended to do. I pushed an update that now constructs a flat list of connections - also ported to a metal device mesh API that handles opening/closing devices safely.
Updated system descriptor looks like:
{'version': {'major': 0, 'minor': 0, 'patch': 40}, 'ttmlir_git_hash': '1fdd4a0970b17925db82c74c163834ccb24c737f', 'product_identifier': 'unknown', 'system_desc': {'chip_descs': [{'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}, {'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}], 'chip_desc_indices': [0, 1], 'chip_capabilities': [3, 0], 'chip_coords': [{'rack': 0, 'shelf': 0, 'y': 0, 'x': 0}, {'rack': 0, 'shelf': 0, 'y': 0, 'x': 1}], 'chip_channels': [{'endpoint0': 0, 'endpoint1': 1}]}}
@tapspatel I had to comment out the strict system descriptor check or else the assertion always fires. I'll open an issue for the compiler to properly generate a system descriptor. @nsmithtt does this match what you were expecting? I ignored chip coords and updated chip channels to include the two endpoint devices and their connected ethernet coords. Multi-device system descriptor looks like this for an x2 card: {
"chip_descs": [
{
"arch": "Wormhole_b0",
"grid_size": {
"y": 8,
"x": 8
},
"l1_size": 1499136,
"num_dram_channels": 12,
"dram_channel_size": 1073741824,
"noc_l1_address_align_bytes": 16,
"pcie_address_align_bytes": 32,
"noc_dram_address_align_bytes": 32
},
{
"arch": "Wormhole_b0",
"grid_size": {
"y": 8,
"x": 8
},
"l1_size": 1499136,
"num_dram_channels": 12,
"dram_channel_size": 1073741824,
"noc_l1_address_align_bytes": 16,
"pcie_address_align_bytes": 32,
"noc_dram_address_align_bytes": 32
}
],
"chip_desc_indices": [0, 1],
"chip_capabilities": [3, 0],
"chip_coords": [],
"chip_channels": [
{
"device_id0": 0,
"ethernet_core_coord0": {
"y": 8,
"x": 0
},
"device_id1": 1,
"ethernet_core_coord1": {
"y": 0,
"x": 0
}
}
]
} |
} | ||
|
||
inline ::tt::target::Dim2d toFlatbuffer(FlatbufferObjectCache &cache, | ||
GridAttr arch) { | ||
const GridAttr &arch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking the review, but typically we only pass the MLIR types, i.e. Value
, *Attr
, *Type
, by value. MLIR already enforces that they are wrappers around some underlying storage and it is convention to treat them as value types. It doesn't hurt to pass it by const&
, but it's not necessary. All MLIR types are also immutable, so they are truly value types in this sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I see, thanks for the explanation, I'll undo these changes!
runtime/lib/ttnn/runtime.cpp
Outdated
std::vector<::tt::target::ChipChannel> allConnections; | ||
allConnections.resize(connectionSet.size()); | ||
|
||
std::transform( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Couple comments inline
runtime/lib/ttnn/runtime.cpp
Outdated
|
||
for (const ::ttnn::Device *device : devices) { | ||
// Construct chip descriptor | ||
::tt::target::Dim2d deviceGrid = toFlatbuffer(device->logical_grid_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device->logical_grid_size()
-> device.compute_with_storage_grid_size()
@@ -153,7 +153,8 @@ def run(args): | |||
|
|||
# execution | |||
print("executing action for all provided flatbuffers") | |||
device = ttrt.runtime.open_device(device_ids) | |||
system_desc, device_ids = ttrt.runtime.get_current_system_desc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already called earlier on in the function, you can remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! This was leftover after the rebase, removed it
ttrt code lgtm! |
71cde8c
to
1bf8d22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome refactor! Both runtimes have a shared system desc! 🙌
Hey @jnie-TT , am I nuts or is there some issue here at the merged commit 0b393ec in main? Looks good in this format:
Not so good here though, empty brackets causes error:
|
@kmabeeTT Seems like chip_capabilities=0 is causing an issue, I think it's an issue on the mlir side, I'll take a closer look at this. In the meantime you can try running on an n150 machine, the issue should go away. |
Created a ticket #380 |
#162
Example of system descriptor for n300:
{'version': {'major': 0, 'minor': 0, 'patch': 39}, 'ttmlir_git_hash': '484963fb271e0817d5aa5fd6e6c4c3626787c2f2', 'product_identifier': 'unknown', 'system_desc': {'chip_descs': [{'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}, {'arch': 'Wormhole_b0', 'grid_size': {'y': 8, 'x': 8}, 'l1_size': 1499136, 'num_dram_channels': 12, 'dram_channel_size': 1073741824, 'noc_l1_address_align_bytes': 16, 'pcie_address_align_bytes': 32, 'noc_dram_address_align_bytes': 32}], 'chip_desc_indices': [0, 1], 'chip_capabilities': [3, 0], 'chip_coords': [{'rack': 0, 'shelf': 0, 'y': 0, 'x': 0}, {'rack': 0, 'shelf': 0, 'y': 0, 'x': 1}], 'chip_channels': [{'endpoint0': 1, 'endpoint1': -1}, {'endpoint0': 0, 'endpoint1': -1}]}}
This requires metal API updates, specifically to support querying the chip location. I added it locally for testing, but currently this API is missing so the build will fail. @nsmithtt What's the process for adding such APIs to metal?