Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 16B aligned reads over PCIE #407

Open
nsmithtt opened this issue Aug 15, 2024 · 2 comments
Open

Support 16B aligned reads over PCIE #407

nsmithtt opened this issue Aug 15, 2024 · 2 comments
Labels
bug Something isn't working
Milestone

Comments

@nsmithtt
Copy link
Contributor

Even though PCIE on WH only supports 32B aligned memory transactions, I believe we should still be able collect arbitrarily aligned data. This was working at one point, but somehow regressed, most likely mis-programming the metal buffer that views the L1 allocation.

Repro (access row major data that's only 16B aligned):

#layout = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<4x8xf32, #l1_>>                                                                                                                               
#layout1 = #tt.layout<(d0, d1) -> (d0, d1), undef, <2x2>, memref<2x4xf32, #l1_>>                                                                                                                       
func.func @simple(%arg0: tensor<4x8xf32, #layout>) -> tensor<4x8xf32, #layout1> {                                                                                                            
  %0 = tensor.empty() : tensor<4x8xf32, #layout1>                                                                                                                                                      
  %1 = "ttir.to_layout"(%arg0, %0) : (tensor<4x8xf32, #layout>, tensor<4x8xf32, #layout1>) -> tensor<4x8xf32, #layout1>                                                                                  
  return %1 : tensor<4x8xf32, #layout1>                                                                                                                                                                 
}
@nsmithtt nsmithtt self-assigned this Aug 15, 2024
@nsmithtt nsmithtt added this to the [Metal Direct 0] milestone Aug 15, 2024
nsmithtt added a commit that referenced this issue Aug 15, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 15, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 15, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 16, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 16, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 16, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
nsmithtt added a commit that referenced this issue Aug 20, 2024
- Introduce new `LayoutAttr` member `projectOnto` which projects a
  tensor linear map onto a physical grid. We simply sample this map to
  generate all data movement.
- Read physical cores from the system desc to generate noc coordinates.
- Enforce address alignment, could be relaxed per #407
@nsmithtt nsmithtt modified the milestones: [Metal Direct 0], [D2M 1] Jan 20, 2025
@nsmithtt nsmithtt added the bug Something isn't working label Jan 20, 2025
@nsmithtt nsmithtt removed their assignment Jan 20, 2025
@jdesousa-TT
Copy link
Contributor

jdesousa-TT commented Jan 21, 2025

Hey @nsmithtt , you had a comment in TTIRToTTMetal.cpp:208 which removed the memorySpace argument to systemDesc.getAddressAlignBytes which meant the maximum alignment restriction (32B) of all memory spaces was always being used. The comment quoted this issue.

Simply removing the comment and allowing the address alignment to be based on the input memory space seems to allow tt-opt to compile this example, and it runs on device successfully (although the data is incorrect on some rows of the tensor).

Wondering what the context of that comment is, assuming it has something to do with when this broke?

@nsmithtt
Copy link
Contributor Author

(although the data is incorrect on some rows of the tensor).

@jdesousa-TT, I noticed exactly this and narrowed it down to PCIE reads not being able to handle alignment < 32B. It does run, but ends up copying the data back incorrectly, it might be possible that we're programming the metal buffer incorrectly, but I sanity checked it many times and nothing popped out. It'd be interesting to see if we could make a repro using a standalone metal test.

you had a comment in TTIRToTTMetal.cpp:208 which removed the memorySpace argument to systemDesc.getAddressAlignBytes which meant the maximum alignment restriction (32B) of all memory spaces was always being used

I probably should have called this out in the issue initially, so here goes 😃. Since PCIE seems to only be able to read 32B aligned source data (regardless of memory space it's reading from), it seemed safer to just align everything to 32B to aid in debugging, i.e. if you wanted to read any intermediate tensor off the device via some runtime debugger then the tensor your reading needs to be already aligned to 32B. It seemed safer (for now) to just align everything to 32B vs having someone else run into this which is difficult to debug. Also I got tired of investigating metal source to figure out why, so this was an easier fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants