Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2066: Updated tests to ensure parity between all supported ccl operations and all supported meshes. Cleaned up test directory structure #2067

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tapspatel
Copy link
Contributor

@tapspatel tapspatel commented Feb 2, 2025

Supported meshes

1x2
1x32
1x8
2x4
8x4

Supported machine configurations

n150
n300
llmbox
tg

There are 4 directories under Silicon, one for each supported machine configuration.

@tapspatel tapspatel requested a review from nsmithtt February 2, 2025 18:31
@tapspatel tapspatel self-assigned this Feb 2, 2025
@tapspatel tapspatel linked an issue Feb 2, 2025 that may be closed by this pull request
@tapspatel
Copy link
Contributor Author

Directory structure of multi-device tests to potentially use the following config: https://github.com/llvm/llvm-project/blob/main/llvm/test/MC/M68k/lit.local.cfg

@tapspatel
Copy link
Contributor Author

tapspatel commented Feb 3, 2025

tg machine huge pages not set up correctly. Downloaded them from metal. Fixed
tg machine galaxy reset.json file had incorrect keys for credo + disabled_ports, corrected.
tg machine needed to have tt-topology set in order to train n300 <-> galaxy connections, fixed.

llmbox machine was not flashed with tt-topology. Installed tool and set to mesh.
llmbox reset script was not set to correct one when installing from infra repo, we need the one that does tt-smi -r 0,1,2,3 (all pcie devices). I manually updated the script.

@tapspatel
Copy link
Contributor Author

We can speed up galaxy reset by removing credo + disabled port from reset.json file. When you flash a galaxy, everything gets erased, so you have to run tt-topology and reset once with the full reset.json file (have credo + disabled params in there). After this is done, you can “quick reset” using just reset.json with credo + disabled ports removed. However, if you flash again, you have to repeat entire steps again

fyi @vmilosevic

…tions and all supported meshes. Cleaned up test directory structure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expand test coverage for all multi-device configurations
1 participant