Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test hook support #263

Merged
merged 66 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
bf1d9fb
Reorder SlurmCommandGenStrategy methods
TaekyungHeo Oct 22, 2024
38bb8a7
Rename generate_srun_command to _gen_srun_command
TaekyungHeo Oct 22, 2024
5719230
Remove pre-test implementation from JaxToolbox
TaekyungHeo Oct 23, 2024
8e8ee3e
Add prologue and epilogue to _TestScenarioTOML
TaekyungHeo Oct 23, 2024
32d7d93
Add example plugin files
TaekyungHeo Oct 22, 2024
28a38b8
Add plugin option to CLI
TaekyungHeo Oct 23, 2024
bb3275f
Parse plugins and pass them to TestRun
TaekyungHeo Oct 23, 2024
3bc3822
Generate plugin commands
TaekyungHeo Oct 25, 2024
06d4b7d
Remove plugin option from CLI
TaekyungHeo Oct 25, 2024
9174f00
Make plugin directory self-contained
TaekyungHeo Oct 25, 2024
7afa73f
Update Parser to support self-contained plugin directory
TaekyungHeo Oct 25, 2024
3e45b25
Refactor plugin path handling in parse to use a single plugin_path param
TaekyungHeo Oct 28, 2024
5634f74
Remove test_scenario directory from conf/common/plugin/
TaekyungHeo Oct 28, 2024
b798981
Restore comments in src/cloudai/parser.py
TaekyungHeo Oct 29, 2024
da9abbf
Remove unused tmp_path from unit tests
TaekyungHeo Oct 29, 2024
6b11f54
Set prologue and epilogue to None by default
TaekyungHeo Oct 29, 2024
b84c16f
Add validation to ensure 'prologue' and 'epilogue' are not empty strings
TaekyungHeo Oct 29, 2024
8d840cf
Reorder SlurmCommandGenStrategy methods
TaekyungHeo Oct 22, 2024
9725b36
Rename generate_srun_command to _gen_srun_command
TaekyungHeo Oct 22, 2024
5a658c3
Remove pre-test implementation from JaxToolbox
TaekyungHeo Oct 23, 2024
cac5484
Add prologue and epilogue to _TestScenarioTOML
TaekyungHeo Oct 23, 2024
aab165b
Add example plugin files
TaekyungHeo Oct 22, 2024
265e42e
Add plugin option to CLI
TaekyungHeo Oct 23, 2024
d9b2e83
Parse plugins and pass them to TestRun
TaekyungHeo Oct 23, 2024
bfb653f
Generate plugin commands
TaekyungHeo Oct 25, 2024
9f83cd5
Remove plugin option from CLI
TaekyungHeo Oct 25, 2024
f656eee
Make plugin directory self-contained
TaekyungHeo Oct 25, 2024
5af7113
Update Parser to support self-contained plugin directory
TaekyungHeo Oct 25, 2024
b22c2f2
Refactor plugin path handling in parse to use a single plugin_path param
TaekyungHeo Oct 28, 2024
c88fe2e
Remove test_scenario directory from conf/common/plugin/
TaekyungHeo Oct 28, 2024
d66e0fe
Merge branch 'main'
TaekyungHeo Oct 29, 2024
c814ccb
Use Pydantic model to load prologue and epilogue
TaekyungHeo Oct 29, 2024
a6d3efc
Recover acceptance tests with plugin
TaekyungHeo Oct 29, 2024
46cabe9
Clean up unit tests
TaekyungHeo Oct 29, 2024
764e181
Refactor parser to remove explicit plugin_path argument, use default …
TaekyungHeo Oct 29, 2024
d9e8c1f
Refactor gen_exec_command to simplify indentation logic for readability
TaekyungHeo Oct 29, 2024
897a7da
Make prologue and epilogue fields optional
TaekyungHeo Oct 29, 2024
d44023b
Set prologue and epilogue to None by default
TaekyungHeo Oct 29, 2024
12022de
Recover comments
TaekyungHeo Oct 29, 2024
3cf27df
Remove unused tmp_path from unit tests
TaekyungHeo Oct 29, 2024
b37127d
Merge branch 'main'
TaekyungHeo Oct 30, 2024
9244bd6
Do not allow empty test runs in plugins
TaekyungHeo Oct 30, 2024
00f34f2
Simplify prologue unit tests
TaekyungHeo Oct 30, 2024
7de1185
Move plugin directory to conf
TaekyungHeo Oct 30, 2024
c2b8d83
Reflect Andrei's comments
TaekyungHeo Oct 30, 2024
e1534d1
Reflect Andrei's comments
TaekyungHeo Oct 31, 2024
42080e2
Print out warning when plugins are missing
TaekyungHeo Oct 31, 2024
3dbb4d3
Update acceptance test sbatch script names
TaekyungHeo Oct 31, 2024
a9f5c97
Reflect Andrei's comments
TaekyungHeo Oct 31, 2024
d3c7cfd
Make vulture happy
TaekyungHeo Oct 31, 2024
f123099
Add logging messages to parser.parse
TaekyungHeo Nov 1, 2024
e886bf6
Simplify unit test for readability
TaekyungHeo Nov 1, 2024
59d5cb3
Reflect Andrei's comments
TaekyungHeo Nov 1, 2024
4894972
Reflect Andrei's comments
TaekyungHeo Nov 1, 2024
c19f24b
Rename plugin to hook
TaekyungHeo Nov 4, 2024
f53420c
Rename plugin to hook
TaekyungHeo Nov 4, 2024
904f377
Rename plugin to hook
TaekyungHeo Nov 4, 2024
2c84d43
Rename plugin to hook
TaekyungHeo Nov 4, 2024
b598779
Rename plugin to hook
TaekyungHeo Nov 4, 2024
de1c1a6
Raise an exception when hooks are not found
TaekyungHeo Nov 4, 2024
70e8fd7
Fix verify-configs errors
TaekyungHeo Nov 4, 2024
b852bb8
Reflect Andrei's comments
TaekyungHeo Nov 5, 2024
701cf94
Reflect Andrei's comments
TaekyungHeo Nov 5, 2024
8c0cbb5
Rename plugin to hook
TaekyungHeo Nov 5, 2024
526aecb
Fix verify-configs errors
TaekyungHeo Nov 5, 2024
04430e4
Reflect Andrei's comments
TaekyungHeo Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions conf/common/test_scenario/nccl_test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
# limitations under the License.

name = "nccl-test"

prologue = "nccl_test_prologue"
epilogue = "nccl_test_epilogue"
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
Expand Down
22 changes: 22 additions & 0 deletions conf/plugin/nccl_test_epilogue.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_epilogue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_gather"
time_limit = "00:20:00"
22 changes: 22 additions & 0 deletions conf/plugin/nccl_test_prologue.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_prologue"

[[Tests]]
id = "Tests.1"
test_name = "nccl_test_all_reduce"
time_limit = "00:20:00"
33 changes: 33 additions & 0 deletions conf/plugin/test/nccl_test_all_gather.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_all_gather"
description = "all_gather"
test_template_name = "NcclTest"

[cmd_args]
"subtest_name" = "all_gather_perf_mpi"
"ngpus" = "1"
"minbytes" = "128"
"maxbytes" = "4G"
"iters" = "100"
"warmup_iters" = "50"

[extra_cmd_args]
"--stepfactor" = "2"

[extra_env_vars]
"NCCL_TEST_SPLIT_MASK" = "0x7"
30 changes: 30 additions & 0 deletions conf/plugin/test/nccl_test_all_reduce.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "nccl_test_all_reduce"
description = "all_reduce"
test_template_name = "NcclTest"

[cmd_args]
"subtest_name" = "all_reduce_perf_mpi"
"ngpus" = "1"
"minbytes" = "128"
"maxbytes" = "16G"
"iters" = "100"
"warmup_iters" = "50"

[extra_cmd_args]
"--stepfactor" = "2"
26 changes: 26 additions & 0 deletions src/cloudai/_core/command_gen_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,29 @@ def gen_exec_command(self, tr: TestRun) -> str:
str: The generated execution command.
"""
pass

@abstractmethod
def gen_srun_command(self, tr: TestRun) -> str:
"""
Generate the Slurm srun command for a test based on the given parameters.

Args:
tr (TestRun): Contains the test and its run-specific configurations.

Returns:
str: The generated Slurm srun command.
"""
pass

@abstractmethod
def gen_srun_success_check(self, tr: TestRun) -> str:
"""
Generate the Slurm success check command to verify if a test run was successful.

Args:
tr (TestRun): Contains the test and its run-specific configurations.

Returns:
str: The generated command to check the success of the test run.
"""
pass
2 changes: 2 additions & 0 deletions src/cloudai/_core/test_scenario.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ class TestRun:
weight: float = 0.0
ideal_perf: float = 1.0
dependencies: dict[str, TestDependency] = field(default_factory=dict)
prologue: Optional["TestScenario"] = None
epilogue: Optional["TestScenario"] = None

def __hash__(self) -> int:
return hash(self.name + self.test.name + str(self.iterations) + str(self.current_iteration))
Expand Down
35 changes: 32 additions & 3 deletions src/cloudai/_core/test_scenario_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ class _TestScenarioTOML(BaseModel):
name: str
job_status_check: bool = True
tests: list[_TestRunTOML] = Field(alias="Tests", min_length=1)
prologue: Optional[str] = None
epilogue: Optional[str] = None

@model_validator(mode="after")
def check_no_self_dependency(self):
Expand Down Expand Up @@ -99,9 +101,10 @@ class TestScenarioParser:

__test__ = False

def __init__(self, file_path: Path, test_mapping: Dict[str, Test]) -> None:
def __init__(self, file_path: Path, test_mapping: Dict[str, Test], plugin_mapping: Dict[str, TestScenario]) -> None:
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved
self.file_path = file_path
self.test_mapping = test_mapping
self.plugin_mapping = plugin_mapping

def parse(self) -> TestScenario:
"""
Expand Down Expand Up @@ -136,8 +139,24 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
total_weight = sum(tr.weight for tr in ts_model.tests)
normalized_weight = 0 if total_weight == 0 else 100 / total_weight

prologue, epilogue = None, None
if ts_model.prologue:
prologue = self.plugin_mapping.get(ts_model.prologue)
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved
if prologue is None:
logging.warning(
f"Prologue '{ts_model.prologue}' not found in plugin mapping. "
"Ensure that a proper plugin directory is set under the working directory."
)
if ts_model.epilogue:
epilogue = self.plugin_mapping.get(ts_model.epilogue)
if epilogue is None:
logging.warning(
f"Epilogue '{ts_model.epilogue}' not found in plugin mapping. "
"Ensure that a proper plugin directory is set under the working directory."
)

test_runs_by_id: dict[str, TestRun] = {
tr.id: self._create_test_run(tr, normalized_weight) for tr in ts_model.tests
tr.id: self._create_test_run(tr, normalized_weight, prologue, epilogue) for tr in ts_model.tests
}

tests_data: dict[str, _TestRunTOML] = {tr.id: tr for tr in ts_model.tests}
Expand All @@ -153,13 +172,21 @@ def _parse_data(self, data: Dict[str, Any]) -> TestScenario:
job_status_check=ts_model.job_status_check,
)

def _create_test_run(self, test_info: _TestRunTOML, normalized_weight: float) -> TestRun:
def _create_test_run(
self,
test_info: _TestRunTOML,
normalized_weight: float,
prologue: Optional[TestScenario] = None,
epilogue: Optional[TestScenario] = None,
) -> TestRun:
"""
Create a section-specific Test object by copying from the test mapping.

Args:
test_info (Dict[str, Any]): Information of the test.
normalized_weight (float): Normalized weight for the test.
prologue (Optional[TestScenario]): TestScenario object representing the prologue sequence.
epilogue (Optional[TestScenario]): TestScenario object representing the epilogue sequence.

Returns:
Test: Copied and updated Test object for the section.
Expand Down Expand Up @@ -192,5 +219,7 @@ def _create_test_run(self, test_info: _TestRunTOML, normalized_weight: float) ->
sol=test_info.sol,
weight=test_info.weight * normalized_weight,
ideal_perf=test_info.ideal_perf,
prologue=prologue,
epilogue=epilogue,
)
return tr
34 changes: 34 additions & 0 deletions src/cloudai/_core/test_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,40 @@ def gen_exec_command(self, tr: TestRun) -> str:
)
return self.command_gen_strategy.gen_exec_command(tr)

def gen_srun_command(self, tr: TestRun) -> str:
"""
Generate an Slurm srun command for a test using the provided command generation strategy.

Args:
tr (TestRun): Contains the test and its run-specific configurations.

Returns:
str: The generated Slurm srun command.
"""
if self.command_gen_strategy is None:
raise ValueError(
"command_gen_strategy is missing. Ensure the strategy is registered in the Registry "
"by calling the appropriate registration function for the system type."
)
return self.command_gen_strategy.gen_srun_command(tr)

def gen_srun_success_check(self, tr: TestRun) -> str:
"""
Generate a Slurm success check command for a test using the provided command generation strategy.

Args:
tr (TestRun): Contains the test and its run-specific configurations.

Returns:
str: The generated command to check the success of the test run.
"""
if self.command_gen_strategy is None:
raise ValueError(
"command_gen_strategy is missing. Ensure the strategy is registered in the Registry "
"by calling the appropriate registration function for the system type."
)
return self.command_gen_strategy.gen_srun_success_check(tr)

def gen_json(self, tr: TestRun) -> Dict[Any, Any]:
"""
Generate a JSON string representing the Kubernetes job specification for this test using this template.
Expand Down
77 changes: 63 additions & 14 deletions src/cloudai/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@
format_validation_error,
)

PLUGIN_ROOT = Path("conf/plugin")
PLUGIN_TEST_ROOT = PLUGIN_ROOT / "test"


class Parser:
"""Main parser for parsing all types of configurations."""
Expand All @@ -49,14 +52,21 @@ def __init__(self, system_config_path: Path) -> None:
self.system_config_path = system_config_path

def parse(
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved
self, test_path: Path, test_scenario_path: Optional[Path] = None
self,
test_path: Path,
test_scenario_path: Optional[Path] = None,
) -> Tuple[System, List[Test], Optional[TestScenario]]:
"""
Parse configurations for system, test templates, and test scenarios.

Returns
Tuple[System, List[TestTemplate], TestScenario]: A tuple containing the system object, a list of test
template objects, and the test scenario object.
Args:
test_path (Path): The file path for tests.
test_scenario_path (Optional[Path]): The file path for the main test scenario.
If None, all tests are included.

Returns:
Tuple[System, List[Test], Optional[TestScenario]]: A tuple containing the system object, a list of filtered
test template objects, and the main test scenario object if provided.
"""
if not test_path.exists():
raise FileNotFoundError(f"Test path '{test_path}' not found.")
Expand All @@ -71,24 +81,63 @@ def parse(
except TestConfigParsingError:
exit(1) # exit right away to keep error message readable for users

logging.debug(f"Parsed {len(tests)} tests: {[t.name for t in tests]}")
test_mapping = {t.name: t for t in tests}
try:
plugin_tests = (
self.parse_tests(list(PLUGIN_TEST_ROOT.glob("*.toml")), system) if PLUGIN_TEST_ROOT.exists() else []
)
except TestConfigParsingError:
exit(1) # exit right away to keep error message readable for users

filtered_tests = tests
test_scenario: Optional[TestScenario] = None
if test_scenario_path:
if not test_scenario_path:
all_tests = list({test.name: test for test in tests + plugin_tests}.values())
return system, all_tests, None

test_mapping = {t.name: t for t in tests}
plugin_test_scenario_mapping = {}
if PLUGIN_ROOT.exists() and list(PLUGIN_ROOT.glob("*.toml")):
TaekyungHeo marked this conversation as resolved.
Show resolved Hide resolved
try:
test_scenario = self.parse_test_scenario(test_scenario_path, test_mapping)
plugin_test_scenario_mapping = self.parse_plugins(
list(PLUGIN_ROOT.glob("*.toml")), {t.name: t for t in plugin_tests}
)
except TestScenarioParsingError:
exit(1) # exit right away to keep error message readable for users
scenario_tests = set(tr.test.name for tr in test_scenario.test_runs)
filtered_tests = [t for t in tests if t.name in scenario_tests]

try:
test_scenario = self.parse_test_scenario(test_scenario_path, test_mapping, plugin_test_scenario_mapping)
except TestScenarioParsingError:
exit(1) # exit right away to keep error message readable for users

scenario_tests = {tr.test.name for tr in test_scenario.test_runs}
plugin_scenario_tests = {
tr.test.name
for plugin_scenario in plugin_test_scenario_mapping.values()
for tr in plugin_scenario.test_runs
}

relevant_test_names = scenario_tests.union(plugin_scenario_tests)
filtered_tests = [t for t in tests if t.name in relevant_test_names] + plugin_tests
filtered_tests = list({test.name: test for test in filtered_tests}.values())

return system, filtered_tests, test_scenario

@staticmethod
def parse_test_scenario(test_scenario_path: Path, test_mapping: Dict[str, Test]) -> TestScenario:
test_scenario_parser = TestScenarioParser(test_scenario_path, test_mapping)
def parse_plugins(plugin_tomls: List[Path], test_mapping: Dict[str, Test]) -> Dict[str, TestScenario]:
plugin_mapping = {}
for plugin_test_scenario_path in plugin_tomls:
plugin_scenario = Parser.parse_test_scenario(plugin_test_scenario_path, test_mapping)
plugin_mapping[plugin_scenario.name] = plugin_scenario
return plugin_mapping

@staticmethod
def parse_test_scenario(
test_scenario_path: Path,
test_mapping: Dict[str, Test],
plugin_mapping: Optional[Dict[str, TestScenario]] = None,
) -> TestScenario:
if plugin_mapping is None:
plugin_mapping = {}

test_scenario_parser = TestScenarioParser(test_scenario_path, test_mapping, plugin_mapping)
test_scenario = test_scenario_parser.parse()
return test_scenario

Expand Down
Loading