Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for producing dashboard outputs #426

Merged
merged 31 commits into from
Dec 6, 2023
Merged

Add support for producing dashboard outputs #426

merged 31 commits into from
Dec 6, 2023

Conversation

ankona
Copy link
Contributor

@ankona ankona commented Nov 29, 2023

Add support for producing & consuming dashboard outputs.

  • Adds telemetry monitor to check for updates and produce events for the dashboard
  • Updates controller to conditionally start telemetry monitor
  • Updates controller to produce a runtime manifest to trigger telemetry collection
  • Adds indirect proxy to produce events for the dashboard for unmanaged tasks
  • Adds CLI capability to launch dashboard

ankona and others added 8 commits October 24, 2023 16:59
Create a `LaunchedManifest` class to surface `Step` info to the `Controller`
so that if can be dumped into a `manifest.json` file as a static representation
of an `Experiment`'s execution for use by other tools external to a driver
script itself.

[ committed by @MattToast ]
[ reviewed by @ankona @AlyssaCote ]

Co-authored-by: Matt Drozt <drozt@hpe.com>
* Add ability to passthrough to CLI plugin

[ committed by @ankona ]
[ reviewed by @MattToast ]
add initial telemetry monitor 

Co-authored-by: Andrew Shao <Andrew.Shao@hpe.com>
Co-authored-by: Matt Drozt <drozt@hpe.com>

[ committed by @ankona @ashao @MattToast  ]
[ reviewed by @MattToast ]
Moves logic for launching a process through a the indirect module into a dedicated proxy step class decorator, so that unmanaged steps can be tracked by the TM without updating the logic to existing steps or launchers.
…or (#416)

* reduce snooze time in tests

* use telemetry cooldown to avoid premature auto-shutdown

* add cooldown param, test autoshutdown

* remove commented line

* fix incorrect CLI arg type

* add debug logging

* fix launcher overwrite bug

* add new logger param to tests

* update test for fixed CLI arg type

* avoid suppressing telemetrymonitor output

* add multi-start tests for telmon

* add test assertions for cooldown verification

* loosen assertion

* use torch.save to avoid jit segfault

* better logging redirection

* add faux return codes for WLM tasks

* format tests w/black

* fix incorrectly logged data

* fix help text typo

* add typehint
[ committed by @ankona @AlyssaCote ]
[ reviewed by @al-rigazzi ]
Copy link

codecov bot commented Nov 29, 2023

Codecov Report

Merging #426 (c353cf3) into develop (15ba145) will increase coverage by 0.67%.
Report is 1 commits behind head on develop.
The diff coverage is 99.29%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #426      +/-   ##
===========================================
+ Coverage    89.65%   90.33%   +0.67%     
===========================================
  Files           59       60       +1     
  Lines         3617     3839     +222     
===========================================
+ Hits          3243     3468     +225     
+ Misses         374      371       -3     
Files Coverage Δ
smartsim/_core/config/config.py 98.71% <100.00%> (+0.16%) ⬆️
smartsim/_core/control/controller.py 86.88% <100.00%> (+1.49%) ⬆️
smartsim/_core/control/job.py 94.94% <100.00%> (+1.52%) ⬆️
smartsim/_core/control/jobmanager.py 93.95% <100.00%> (-0.09%) ⬇️
smartsim/_core/launcher/launcher.py 100.00% <ø> (ø)
smartsim/_core/launcher/local/local.py 95.91% <100.00%> (+1.91%) ⬆️
smartsim/_core/launcher/step/localStep.py 89.74% <100.00%> (+1.17%) ⬆️
smartsim/_core/launcher/step/mpiStep.py 88.88% <100.00%> (+0.15%) ⬆️
smartsim/_core/utils/helpers.py 92.17% <100.00%> (+1.97%) ⬆️
smartsim/_core/utils/serialize.py 100.00% <100.00%> (ø)
... and 6 more

... and 1 file with indirect coverage changes

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small notes from discussion

conftest.py Outdated Show resolved Hide resolved
smartsim/_core/utils/serialize.py Outdated Show resolved Hide resolved
smartsim/_core/control/controller.py Show resolved Hide resolved
smartsim/_core/config/config.py Outdated Show resolved Hide resolved
* Fix hiding -h parameter from plugin

* add test for plugin argument passthrough

* Avoid exception breaking CLI on invalid plugin

* modify plugin load check for python>=3.7

* Revert removal of positional-only args to cli action handlers

* remove unnecessary member var

* remove self.args member from init

* fix formatting issues

* remove invalidated test assertion
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some things I would like addressed in the TM tests before we approve for merge

tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
Hides stacktrace from users who terminate the dashboard via ctrl-c. 
Now the keyboard interrupt is caught and the CLI will shutdown gracefully
displaying an appropriate message to stdout

[ committed by @ankona ]
[ reviewed by @MattToast ]
Respond to the first round of reviewer feedback for the dashboard.

[ committed by @ankona, @MattToast ]
[ reviewed by @MattToast ]
@MattToast MattToast self-requested a review December 1, 2023 17:16
@MattToast MattToast marked this pull request as ready for review December 1, 2023 17:25
@MattToast MattToast dismissed their stale review December 4, 2023 19:16

As an author, I addressed my own change requests and my review is now stale

Copy link
Collaborator

@ashao ashao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of getting feedback quickly, I'll do two reviews. One which goes through the implementation (here). I'll post any comments about the tests later on.

smartsim/_core/_cli/validate.py Outdated Show resolved Hide resolved
smartsim/_core/_cli/cli.py Outdated Show resolved Hide resolved
smartsim/_core/_cli/cli.py Outdated Show resolved Hide resolved
smartsim/_core/_cli/dbcli.py Outdated Show resolved Hide resolved
smartsim/_core/_cli/plugin.py Outdated Show resolved Hide resolved
smartsim/_core/entrypoints/telemetrymonitor.py Outdated Show resolved Hide resolved
smartsim/_core/entrypoints/telemetrymonitor.py Outdated Show resolved Hide resolved
smartsim/_core/launcher/step/slurmStep.py Outdated Show resolved Hide resolved
# "exe_args": run_settings.exe_args,
"run_command": run_settings.run_command,
"run_args": run_settings.run_args,
# TODO: We currently do not have a way to represent MPMD commands!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like it should be that hard to do? MPMD should just be lists of the extra slurm/pbs commands, e.g. -N 1 -n 1 -c 32 : -N 1 -n32 -c 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree if the dashboard were expecting a str. Unfortunately right now the dashboard expects a dict[str, str] and we cannot "re-add" keys for the dashboard to display (for e.g. in your example it would be non-trivial to display the two different values of -n).

We could turn the "run_args" key into a list[dict[str, str]], but I would have to check with @AlyssaCote to see how much effort it would take for the dashboard to render this info on the frontend!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh, I see. Ok, as a stop gap for now let's throw a warning here and put up a ticket.

smartsim/_core/utils/serialize.py Show resolved Hide resolved
Copy link
Collaborator

@ashao ashao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor comments. Otherwise the tests look really good!

@@ -398,7 +398,7 @@ def test_colocated_db_model_tf(fileutils, wlmutils, mlutils):
test_script = fileutils.get_test_conf_path("run_tf_dbmodel_smartredis.py")

# Create SmartSim Experience
exp = Experiment(exp_name, launcher=test_launcher)
exp = Experiment(exp_name, launcher=test_launcher, exp_path=test_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Al and I found that this is also needed to get the tests to run on horizon. Out of curiosity, do you know what the behaviour is for dashboard if you omit exp_path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you omit exp_path and leave have the SMARTSIM_FLAG_TELEMETRY flag enabled, it would lead to placement of .smartsim directories in the CWD. In order to make sure that these dir always ended up in the test_output dir it was decided to give every experiment an appropriate path.

tests/test_indirect.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far! Just requesting a couple of minor fixes to docs and maybe some more tests for files for which coverage seems a bit low.

return smart_cli.execute(sys.argv)
except SmartSimCLIActionCancelled as ssi:
logger.debug(ssi, exc_info=True)
logger.info(ssi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't lines L44-45 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"? I understand that the debug version has exception info attached, but in that case it looks like we don't really need the info one? Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that is absolutely the case! will give the debug logger call a generic "here is the traceback" message

logger.info(ssi)
except KeyboardInterrupt:
msg = "SmartSim was terminated by user"
logger.debug(msg, exc_info=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't lines L48-49 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"? I understand that the debug version has exception info attached, but in that case it looks like we don't really need the info one? Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here

if spec is None:
raise AttributeError()
except (ModuleNotFoundError, AttributeError):
print(not_found)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to resort to print instead of using a better logger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

stdout, _ = process.communicate()

plugin_stdout = stdout.decode("utf-8")
print(plugin_stdout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to resort to print instead of using a better logger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second read of this, I'm not sure why we capture the output and then print it to the screen immediately after the process dies. Seems like we could just let the plugin decide what to do with it's output and we could simply just not capture it.

I replaced this block with a subprocess.run call without capturing any io. Let me know if you can think of a good reason that we should instead capture the output and send it through a logger!

@@ -204,6 +204,17 @@ def test_account(self) -> t.Optional[str]: # pragma: no cover
# no account by default
return os.environ.get("SMARTSIM_TEST_ACCOUNT", None)

@property
def telemetry_frequency(self) -> int:
return int(os.environ.get("SMARTSIM_TELEMETRY_FREQUENCY", 5))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an easy test to add to make Codecov happy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added!

tests/test_configs/telemetry/db_and_model_1run.json Outdated Show resolved Hide resolved
tests/test_configs/telemetry/ensembles.json Outdated Show resolved Hide resolved
tests/test_configs/telemetry/serialmodels.json Outdated Show resolved Hide resolved
tests/test_configs/telemetry/telemetry.json Outdated Show resolved Hide resolved
tests/test_telemetry_monitor.py Outdated Show resolved Hide resolved
@MattToast MattToast requested review from al-rigazzi and ashao December 6, 2023 01:05
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two super-minor fixes to docs (I missed on first review) and then it looks good to me!

smartsim/_core/control/controller.py Show resolved Hide resolved
self.task_id: str = ""
self.type: str = ""
self.timestamp: int = 0
self.status_dir: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

exp_name: str
exp_path: str
launcher_name: str
run_id: str = field(default_factory=_helpers.create_short_id_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, possibly not needed after you explained, but really appreciate making this clearer for devs

smartsim/_core/launcher/step/step.py Show resolved Hide resolved
smartsim/experiment.py Outdated Show resolved Hide resolved
@MattToast MattToast requested a review from al-rigazzi December 6, 2023 18:21
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the very last remarks!

@MattToast MattToast merged commit d8fba1b into develop Dec 6, 2023
26 checks passed
MattToast added a commit that referenced this pull request Dec 15, 2023
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426.

[ committed by @MattToast ]
[ reviewed by @ashao ]
@MattToast MattToast deleted the dashboard branch December 18, 2023 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants