-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 16B aligned reads over PCIE #407
Comments
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
- Introduce new `LayoutAttr` member `projectOnto` which projects a tensor linear map onto a physical grid. We simply sample this map to generate all data movement. - Read physical cores from the system desc to generate noc coordinates. - Enforce address alignment, could be relaxed per #407
Hey @nsmithtt , you had a comment in Simply removing the comment and allowing the address alignment to be based on the input memory space seems to allow tt-opt to compile this example, and it runs on device successfully (although the data is incorrect on some rows of the tensor). Wondering what the context of that comment is, assuming it has something to do with when this broke? |
@jdesousa-TT, I noticed exactly this and narrowed it down to PCIE reads not being able to handle alignment < 32B. It does run, but ends up copying the data back incorrectly, it might be possible that we're programming the metal buffer incorrectly, but I sanity checked it many times and nothing popped out. It'd be interesting to see if we could make a repro using a standalone metal test.
I probably should have called this out in the issue initially, so here goes 😃. Since PCIE seems to only be able to read 32B aligned source data (regardless of memory space it's reading from), it seemed safer to just align everything to 32B to aid in debugging, i.e. if you wanted to read any intermediate tensor off the device via some runtime debugger then the tensor your reading needs to be already aligned to 32B. It seemed safer (for now) to just align everything to 32B vs having someone else run into this which is difficult to debug. Also I got tired of investigating metal source to figure out why, so this was an easier fix. |
Even though PCIE on WH only supports 32B aligned memory transactions, I believe we should still be able collect arbitrarily aligned data. This was working at one point, but somehow regressed, most likely mis-programming the metal buffer that views the L1 allocation.
Repro (access row major data that's only 16B aligned):
The text was updated successfully, but these errors were encountered: