-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for producing dashboard outputs #426
Conversation
Create a `LaunchedManifest` class to surface `Step` info to the `Controller` so that if can be dumped into a `manifest.json` file as a static representation of an `Experiment`'s execution for use by other tools external to a driver script itself. [ committed by @MattToast ] [ reviewed by @ankona @AlyssaCote ] Co-authored-by: Matt Drozt <drozt@hpe.com>
* Add ability to passthrough to CLI plugin [ committed by @ankona ] [ reviewed by @MattToast ]
add initial telemetry monitor Co-authored-by: Andrew Shao <Andrew.Shao@hpe.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ committed by @ankona @ashao @MattToast ] [ reviewed by @MattToast ]
Moves logic for launching a process through a the indirect module into a dedicated proxy step class decorator, so that unmanaged steps can be tracked by the TM without updating the logic to existing steps or launchers.
…or (#416) * reduce snooze time in tests * use telemetry cooldown to avoid premature auto-shutdown * add cooldown param, test autoshutdown * remove commented line * fix incorrect CLI arg type * add debug logging * fix launcher overwrite bug * add new logger param to tests * update test for fixed CLI arg type * avoid suppressing telemetrymonitor output * add multi-start tests for telmon * add test assertions for cooldown verification * loosen assertion * use torch.save to avoid jit segfault * better logging redirection * add faux return codes for WLM tasks * format tests w/black * fix incorrectly logged data * fix help text typo * add typehint
[ committed by @ankona @AlyssaCote ] [ reviewed by @al-rigazzi ]
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #426 +/- ##
===========================================
+ Coverage 89.65% 90.33% +0.67%
===========================================
Files 59 60 +1
Lines 3617 3839 +222
===========================================
+ Hits 3243 3468 +225
+ Misses 374 371 -3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small notes from discussion
* Fix hiding -h parameter from plugin * add test for plugin argument passthrough * Avoid exception breaking CLI on invalid plugin * modify plugin load check for python>=3.7 * Revert removal of positional-only args to cli action handlers * remove unnecessary member var * remove self.args member from init * fix formatting issues * remove invalidated test assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some things I would like addressed in the TM tests before we approve for merge
Hides stacktrace from users who terminate the dashboard via ctrl-c. Now the keyboard interrupt is caught and the CLI will shutdown gracefully displaying an appropriate message to stdout [ committed by @ankona ] [ reviewed by @MattToast ]
Respond to the first round of reviewer feedback for the dashboard. [ committed by @ankona, @MattToast ] [ reviewed by @MattToast ]
As an author, I addressed my own change requests and my review is now stale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of getting feedback quickly, I'll do two reviews. One which goes through the implementation (here). I'll post any comments about the tests later on.
# "exe_args": run_settings.exe_args, | ||
"run_command": run_settings.run_command, | ||
"run_args": run_settings.run_args, | ||
# TODO: We currently do not have a way to represent MPMD commands! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like it should be that hard to do? MPMD should just be lists of the extra slurm/pbs commands, e.g. -N 1 -n 1 -c 32 : -N 1 -n32 -c 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree if the dashboard were expecting a str
. Unfortunately right now the dashboard expects a dict[str, str]
and we cannot "re-add" keys for the dashboard to display (for e.g. in your example it would be non-trivial to display the two different values of -n
).
We could turn the "run_args"
key into a list[dict[str, str]]
, but I would have to check with @AlyssaCote to see how much effort it would take for the dashboard to render this info on the frontend!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh, I see. Ok, as a stop gap for now let's throw a warning here and put up a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor comments. Otherwise the tests look really good!
@@ -398,7 +398,7 @@ def test_colocated_db_model_tf(fileutils, wlmutils, mlutils): | |||
test_script = fileutils.get_test_conf_path("run_tf_dbmodel_smartredis.py") | |||
|
|||
# Create SmartSim Experience | |||
exp = Experiment(exp_name, launcher=test_launcher) | |||
exp = Experiment(exp_name, launcher=test_launcher, exp_path=test_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Al and I found that this is also needed to get the tests to run on horizon. Out of curiosity, do you know what the behaviour is for dashboard if you omit exp_path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you omit exp_path
and leave have the SMARTSIM_FLAG_TELEMETRY
flag enabled, it would lead to placement of .smartsim
directories in the CWD. In order to make sure that these dir always ended up in the test_output
dir it was decided to give every experiment an appropriate path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far! Just requesting a couple of minor fixes to docs and maybe some more tests for files for which coverage seems a bit low.
smartsim/_core/_cli/__main__.py
Outdated
return smart_cli.execute(sys.argv) | ||
except SmartSimCLIActionCancelled as ssi: | ||
logger.debug(ssi, exc_info=True) | ||
logger.info(ssi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't lines L44-45 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"
? I understand that the debug
version has exception info attached, but in that case it looks like we don't really need the info
one? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that is absolutely the case! will give the debug logger call a generic "here is the traceback" message
smartsim/_core/_cli/__main__.py
Outdated
logger.info(ssi) | ||
except KeyboardInterrupt: | ||
msg = "SmartSim was terminated by user" | ||
logger.debug(msg, exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't lines L48-49 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"
? I understand that the debug
version has exception info attached, but in that case it looks like we don't really need the info
one? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here
smartsim/_core/_cli/plugin.py
Outdated
if spec is None: | ||
raise AttributeError() | ||
except (ModuleNotFoundError, AttributeError): | ||
print(not_found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to resort to print
instead of using a better logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
smartsim/_core/_cli/plugin.py
Outdated
stdout, _ = process.communicate() | ||
|
||
plugin_stdout = stdout.decode("utf-8") | ||
print(plugin_stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to resort to print
instead of using a better logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second read of this, I'm not sure why we capture the output and then print it to the screen immediately after the process dies. Seems like we could just let the plugin decide what to do with it's output and we could simply just not capture it.
I replaced this block with a subprocess.run
call without capturing any io. Let me know if you can think of a good reason that we should instead capture the output and send it through a logger!
@@ -204,6 +204,17 @@ def test_account(self) -> t.Optional[str]: # pragma: no cover | |||
# no account by default | |||
return os.environ.get("SMARTSIM_TEST_ACCOUNT", None) | |||
|
|||
@property | |||
def telemetry_frequency(self) -> int: | |||
return int(os.environ.get("SMARTSIM_TELEMETRY_FREQUENCY", 5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an easy test to add to make Codecov happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two super-minor fixes to docs (I missed on first review) and then it looks good to me!
self.task_id: str = "" | ||
self.type: str = "" | ||
self.timestamp: int = 0 | ||
self.status_dir: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
exp_name: str | ||
exp_path: str | ||
launcher_name: str | ||
run_id: str = field(default_factory=_helpers.create_short_id_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, possibly not needed after you explained, but really appreciate making this clearer for devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing the very last remarks!
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]
Add support for producing & consuming dashboard outputs.
telemetry monitor
to check for updates and produce events for the dashboardcontroller
to conditionally starttelemetry monitor
controller
to produce a runtime manifest to trigger telemetry collectionindirect proxy
to produce events for the dashboard for unmanaged tasks