Support 16B aligned reads over PCIE #407

nsmithtt · 2024-08-15T20:45:16Z

Even though PCIE on WH only supports 32B aligned memory transactions, I believe we should still be able collect arbitrarily aligned data. This was working at one point, but somehow regressed, most likely mis-programming the metal buffer that views the L1 allocation.

Repro (access row major data that's only 16B aligned):

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<4x8xf32, #l1_>>                                                                                                                               
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <2x2>, memref<2x4xf32, #l1_>>                                                                                                                       
func.func @simple(%arg0: tensor<4x8xf32, #layout>) -> tensor<4x8xf32, #layout1> {                                                                                                            
  %0 = tensor.empty() : tensor<4x8xf32, #layout1>                                                                                                                                                      
  %1 = "ttir.to_layout"(%arg0, %0) : (tensor<4x8xf32, #layout>, tensor<4x8xf32, #layout1>) -> tensor<4x8xf32, #layout1>                                                                                  
  return %1 : tensor<4x8xf32, #layout1>                                                                                                                                                                 
}

- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407

jdesousa-TT · 2025-01-21T21:33:47Z

Hey @nsmithtt , you had a comment in TTIRToTTMetal.cpp:208 which removed the memorySpace argument to systemDesc.getAddressAlignBytes which meant the maximum alignment restriction (32B) of all memory spaces was always being used. The comment quoted this issue.

Simply removing the comment and allowing the address alignment to be based on the input memory space seems to allow tt-opt to compile this example, and it runs on device successfully (although the data is incorrect on some rows of the tensor).

Wondering what the context of that comment is, assuming it has something to do with when this broke?

nsmithtt · 2025-01-22T02:48:58Z

(although the data is incorrect on some rows of the tensor).

@jdesousa-TT, I noticed exactly this and narrowed it down to PCIE reads not being able to handle alignment < 32B. It does run, but ends up copying the data back incorrectly, it might be possible that we're programming the metal buffer incorrectly, but I sanity checked it many times and nothing popped out. It'd be interesting to see if we could make a repro using a standalone metal test.

you had a comment in TTIRToTTMetal.cpp:208 which removed the memorySpace argument to systemDesc.getAddressAlignBytes which meant the maximum alignment restriction (32B) of all memory spaces was always being used

I probably should have called this out in the issue initially, so here goes 😃. Since PCIE seems to only be able to read 32B aligned source data (regardless of memory space it's reading from), it seemed safer to just align everything to 32B to aid in debugging, i.e. if you wanted to read any intermediate tensor off the device via some runtime debugger then the tensor your reading needs to be already aligned to 32B. It seemed safer (for now) to just align everything to 32B vs having someone else run into this which is difficult to debug. Also I got tired of investigating metal source to figure out why, so this was an easier fix.

nsmithtt self-assigned this Aug 15, 2024

nsmithtt added this to the [Metal Direct 0] milestone Aug 15, 2024

nsmithtt mentioned this issue Aug 15, 2024

Support basic row major data movement + tilize/untilize #321 #323 #410

Merged

nsmithtt modified the milestones: [Metal Direct 0], [D2M 1] Jan 20, 2025

nsmithtt added the bug Something isn't working label Jan 20, 2025

nsmithtt removed their assignment Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 16B aligned reads over PCIE #407

Support 16B aligned reads over PCIE #407

nsmithtt commented Aug 15, 2024

jdesousa-TT commented Jan 21, 2025 •

edited

Loading

nsmithtt commented Jan 22, 2025

Support 16B aligned reads over PCIE #407

Support 16B aligned reads over PCIE #407

Comments

nsmithtt commented Aug 15, 2024

jdesousa-TT commented Jan 21, 2025 • edited Loading

nsmithtt commented Jan 22, 2025

jdesousa-TT commented Jan 21, 2025 •

edited

Loading