Test hook support #263

TaekyungHeo · 2024-10-14T20:47:20Z

Summary

This PR introduces hooks to CloudAI. Hooks are tests that run either before or after each test in a test scenario. They are defined globally within a test scenario and are automatically executed for each test. There are two types of hooks: pre-tests and post-tests. Pre-tests run before the tests, while post-tests are executed after the tests. Multiple pre-tests and post-tests can be specified in each scenario.

An example of how hooks are defined within a test scenario:

name = "nccl-test"

pre_test = "nccl_test_pre"
post_test = "nccl_test_post"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
num_nodes = "2"
time_limit = "00:20:00"

[[Tests]]
id = "Tests.2"
test_name = "nccl_test_all_gather"
num_nodes = "2"
time_limit = "00:20:00"
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.1"

You can see the pre_test and post_test fields. These are used to look up the corresponding hook file. A hook file is a separate test scenario file as shown below:

name = "nccl_test_pre"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
time_limit = "00:20:00"

If any of the tests in the pre-test fail, the main test or the post-test will not run. In other words, the main test and post-test run conditionally when the pre-test is successful. The tests in hooks have time limits, just as tests in the main scenario do. Output files should be stored in the output directory, in a subdirectory called "pre_test" or "post_test," following a proper directory hierarchy. Hooks are not supported for NeMo 1.0 (NeMo launcher).

Note

Idea
- We may need to generate reports from plugins.
- We may need to consider the performance impact of plugins.
- Dependencies are not implemented for now.

Test Plan

CI passes
Manual run
2.1 Success

$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_tes
t.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported ver
sion!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "                                                                                                                
[INFO] System Name: Israel-1                                                                            
[INFO] Scheduler: slurm                                                                                 
[INFO] Test Scenario Name: nccl-test                                                                    
[INFO] Checking if test templates are installed.                                                        
[INFO] Test Scenario: nccl-test                                                                         

Section Name: Tests.1                                                                                   
  Test Name: nccl_test_all_reduce                                                                       
  Description: all_reduce                                                                               
  No dependencies                                                                                       
[INFO] Initializing Runner [RUN] mode                                                                   
[INFO] Creating SlurmRunner                                                                             
[INFO] Starting test scenario execution.                                                                
[INFO] Starting test: Tests.1                                                                           
[INFO] Running test: Tests.1                                                                            
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0/cloudai_sbatch_script.sh
[INFO] Job completed: Tests.1
[INFO] All test scenario results stored at: /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

$ cd /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-05-46/Tests.1/0
$ ls
cloudai_sbatch_script.sh  epilogue  prologue  stderr.txt  stdout.txt

$ ls prologue/nccl_test_all_reduce/
stderr.txt  stdout.txt

$ ls epilogue/nccl_test_all_gather/
stderr.txt  stdout.txt

2.2 Failure

$ cloudai run --system-config ~/cloudaix-main/conf/common/system/israel_1.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nccl_test.toml
/.autodirect/mswg2/E2E/theo/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.19) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: Israel-1
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Executing command for test Tests.1: sbatch /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/cloudai_sbatch_script.sh
[ERROR] Job 383928 for test Tests.1 failed: Missing success indicators in /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt: '# Out of bounds values', '# Avg bus bandwidth'. These keywords are expected to be present in stdout.txt, usually towards the end of the file. Please review the NCCL test output and errors in the file. Ensure the NCCL test ran to completion. You can run the generated sbatch script manually and check if /auto/e2e/israel1/workload_results/nccl-test_2024-10-25_22-16-25/Tests.1/0/stdout.txt is created and contains the expected keywords. If the issue persists, contact the system administrator.
[INFO] Terminating all jobs...
[INFO] All jobs have been killed.

amaslenn

For existing prologue we use real NCCL run. In your examples it seems that we are switching to some predefined commands.

How are we going to generate it?
Will that cover our needs? cc @srivatsankrishnan

I do have some code related notes, but let's leave it for later discussion.

TaekyungHeo · 2024-10-16T19:43:18Z

@amaslenn

How are we going to generate it?

Yes, it is one of the main design choices that we need to make.

amaslenn · 2024-10-17T09:54:24Z

Yes, it is one of the main design choices that we need to make.

Can we rely on existing mechanisms? Each plugin will be defined as a regular Test TOML, meaning we can generate a CLI for it for a particular system. This is what we do now and it seems to cover all our needs for this feature.

TaekyungHeo · 2024-10-31T16:07:08Z

@amaslenn , I ran verify-configs and got this warning

$ cloudai verify-configs conf
[WARNING] Test configuration directory not provided, using all found test TOMLs in the specified directory.
[INFO] Checked systems: 3, all passed
[INFO] Checked tests: 40, all passed
[WARNING] System configuration not provided, mocking it.
[WARNING] Prologue 'nccl_test_prologue' not found in plugin mapping. Ensure that a proper plugin directory is set under the working directory.
[WARNING] Epilogue 'nccl_test_epilogue' not found in plugin mapping. Ensure that a proper plugin directory is set under the working directory.
[INFO] Checked scenarios: 9, all passed
[INFO] Checked 52 configuration files, all passed

tests/test_parser.py

src/cloudai/parser.py

src/cloudai/_core/test_scenario_parser.py

src/cloudai/parser.py

conf/common/test_scenario/nccl_test.toml

tests/ref_data/gpt-no-plugin.sbatch

TaekyungHeo · 2024-11-04T21:28:26Z

do you have any suggestions for fixing the CI error in the verification function?

Let's always add hooks into lookup, we always now where it is:

...
err, tomls = expand_file_list(root, glob="**/*.toml")
err, hook_tomls = expand_file_list(HOOKS_DIR, glob="**/*.toml")
tomls += hook_tomls
...

Let's also change if "conf" in toml_file.parts and "hook" in toml_file.parts to == with hooks' dirs constants in load_tomls_by_type.

src/cloudai/_core/test_scenario_parser.py

src/cloudai/parser.py

src/cloudai/cli/handlers.py

srivatsankrishnan

As discussed in the call, Taekyung mentioned that there he tested with different nccl test for both pre and post scenarios. This will be a continuing feature to cover for other use cases.

TaekyungHeo force-pushed the plugin-jan branch from 8e3e909 to 17f8388 Compare October 14, 2024 20:52

TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 15, 2024

TaekyungHeo force-pushed the plugin-jan branch from 17f8388 to e5841d2 Compare October 15, 2024 16:08

TaekyungHeo changed the title ~~Plugin Draft~~ Plugin support Oct 15, 2024

TaekyungHeo force-pushed the plugin-jan branch 3 times, most recently from ec87e78 to 07fc86f Compare October 15, 2024 17:26

TaekyungHeo requested review from amaslenn, artemry-nv, srivatsankrishnan and srinivas212 October 15, 2024 17:49

amaslenn reviewed Oct 16, 2024

View reviewed changes

TaekyungHeo force-pushed the plugin-jan branch 15 times, most recently from 7594c19 to 852fee8 Compare October 24, 2024 19:54

TaekyungHeo added 2 commits October 31, 2024 11:59

Reflect Andrei's comments

a9f5c97

Make vulture happy

d3c7cfd

amaslenn reviewed Nov 1, 2024

View reviewed changes

tests/test_parser.py Outdated Show resolved Hide resolved

tests/test_parser.py Show resolved Hide resolved

src/cloudai/parser.py Outdated Show resolved Hide resolved

src/cloudai/_core/test_scenario_parser.py Outdated Show resolved Hide resolved

TaekyungHeo added 2 commits November 1, 2024 07:03

Add logging messages to parser.parse

f123099

Simplify unit test for readability

e886bf6

amaslenn reviewed Nov 1, 2024

View reviewed changes

src/cloudai/parser.py Outdated Show resolved Hide resolved

src/cloudai/parser.py Outdated Show resolved Hide resolved

TaekyungHeo added 2 commits November 1, 2024 08:56

Reflect Andrei's comments

59d5cb3

Reflect Andrei's comments

4894972

TaekyungHeo commented Nov 4, 2024

View reviewed changes

conf/common/test_scenario/nccl_test.toml Outdated Show resolved Hide resolved

TaekyungHeo commented Nov 4, 2024

View reviewed changes

tests/ref_data/gpt-no-plugin.sbatch Outdated Show resolved Hide resolved

TaekyungHeo added 4 commits November 4, 2024 14:03

Rename plugin to hook

c19f24b

Rename plugin to hook

f53420c

Rename plugin to hook

904f377

Rename plugin to hook

2c84d43

TaekyungHeo changed the title ~~Plugin support~~ Test hook support Nov 4, 2024

TaekyungHeo added 3 commits November 4, 2024 14:20

Rename plugin to hook

b598779

Raise an exception when hooks are not found

de1c1a6

Fix verify-configs errors

70e8fd7

amaslenn reviewed Nov 5, 2024

View reviewed changes

src/cloudai/_core/test_scenario_parser.py Show resolved Hide resolved

src/cloudai/parser.py Outdated Show resolved Hide resolved

TaekyungHeo added 4 commits November 5, 2024 08:08

Reflect Andrei's comments

b852bb8

Reflect Andrei's comments

701cf94

Rename plugin to hook

8c0cbb5

Fix verify-configs errors

526aecb

amaslenn reviewed Nov 5, 2024

View reviewed changes

src/cloudai/cli/handlers.py Show resolved Hide resolved

src/cloudai/cli/handlers.py Outdated Show resolved Hide resolved

Reflect Andrei's comments

04430e4

amaslenn approved these changes Nov 5, 2024

View reviewed changes

srivatsankrishnan approved these changes Nov 7, 2024

View reviewed changes

TaekyungHeo merged commit 95b3681 into NVIDIA:main Nov 7, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test hook support #263

Test hook support #263

TaekyungHeo commented Oct 14, 2024 •

edited

Loading

amaslenn left a comment

TaekyungHeo commented Oct 16, 2024

amaslenn commented Oct 17, 2024

TaekyungHeo commented Oct 31, 2024

TaekyungHeo commented Nov 4, 2024 •

edited by amaslenn

Loading

srivatsankrishnan left a comment

Test hook support #263

Test hook support #263

Conversation

TaekyungHeo commented Oct 14, 2024 • edited Loading

Summary

Note

Test Plan

amaslenn left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 16, 2024

amaslenn commented Oct 17, 2024

TaekyungHeo commented Oct 31, 2024

TaekyungHeo commented Nov 4, 2024 • edited by amaslenn Loading

srivatsankrishnan left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Oct 14, 2024 •

edited

Loading

TaekyungHeo commented Nov 4, 2024 •

edited by amaslenn

Loading