Add support for producing dashboard outputs #426

ankona · 2023-11-29T16:49:35Z

Add support for producing & consuming dashboard outputs.

Adds telemetry monitor to check for updates and produce events for the dashboard
Updates controller to conditionally start telemetry monitor
Updates controller to produce a runtime manifest to trigger telemetry collection
Adds indirect proxy to produce events for the dashboard for unmanaged tasks
Adds CLI capability to launch dashboard

@MattToast

Create a `LaunchedManifest` class to surface `Step` info to the `Controller` so that if can be dumped into a `manifest.json` file as a static representation of an `Experiment`'s execution for use by other tools external to a driver script itself. [ committed by @MattToast ] [ reviewed by @ankona @AlyssaCote ] Co-authored-by: Matt Drozt <drozt@hpe.com>

@ankona

* Add ability to passthrough to CLI plugin [ committed by @ankona ] [ reviewed by @MattToast ]

@ankona

add initial telemetry monitor Co-authored-by: Andrew Shao <Andrew.Shao@hpe.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ committed by @ankona @ashao @MattToast ] [ reviewed by @MattToast ]

Moves logic for launching a process through a the indirect module into a dedicated proxy step class decorator, so that unmanaged steps can be tracked by the TM without updating the logic to existing steps or launchers.

…or (#416) * reduce snooze time in tests * use telemetry cooldown to avoid premature auto-shutdown * add cooldown param, test autoshutdown * remove commented line * fix incorrect CLI arg type * add debug logging * fix launcher overwrite bug * add new logger param to tests * update test for fixed CLI arg type * avoid suppressing telemetrymonitor output * add multi-start tests for telmon * add test assertions for cooldown verification * loosen assertion * use torch.save to avoid jit segfault * better logging redirection * add faux return codes for WLM tasks * format tests w/black * fix incorrectly logged data * fix help text typo * add typehint

@ankona

[ committed by @ankona @AlyssaCote ] [ reviewed by @al-rigazzi ]

codecov · 2023-11-29T16:58:17Z

Codecov Report

Merging #426 (c353cf3) into develop (15ba145) will increase coverage by 0.67%.
Report is 1 commits behind head on develop.
The diff coverage is 99.29%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #426      +/-   ##
===========================================
+ Coverage    89.65%   90.33%   +0.67%     
===========================================
  Files           59       60       +1     
  Lines         3617     3839     +222     
===========================================
+ Hits          3243     3468     +225     
+ Misses         374      371       -3

Files	Coverage Δ
smartsim/_core/config/config.py	`98.71% <100.00%> (+0.16%)`	⬆️
smartsim/_core/control/controller.py	`86.88% <100.00%> (+1.49%)`	⬆️
smartsim/_core/control/job.py	`94.94% <100.00%> (+1.52%)`	⬆️
smartsim/_core/control/jobmanager.py	`93.95% <100.00%> (-0.09%)`	⬇️
smartsim/_core/launcher/launcher.py	`100.00% <ø> (ø)`
smartsim/_core/launcher/local/local.py	`95.91% <100.00%> (+1.91%)`	⬆️
smartsim/_core/launcher/step/localStep.py	`89.74% <100.00%> (+1.17%)`	⬆️
smartsim/_core/launcher/step/mpiStep.py	`88.88% <100.00%> (+0.15%)`	⬆️
smartsim/_core/utils/helpers.py	`92.17% <100.00%> (+1.97%)`	⬆️
smartsim/_core/utils/serialize.py	`100.00% <100.00%> (ø)`
... and 6 more

... and 1 file with indirect coverage changes

smartsim/_core/control/controller.py

smartsim/_core/entrypoints/telemetrymonitor.py

MattToast

Small notes from discussion

conftest.py

smartsim/_core/utils/serialize.py

smartsim/_core/control/controller.py

smartsim/_core/config/config.py

* Fix hiding -h parameter from plugin * add test for plugin argument passthrough * Avoid exception breaking CLI on invalid plugin * modify plugin load check for python>=3.7 * Revert removal of positional-only args to cli action handlers * remove unnecessary member var * remove self.args member from init * fix formatting issues * remove invalidated test assertion

smartsim/_core/entrypoints/indirect.py

MattToast

Found some things I would like addressed in the TM tests before we approve for merge

tests/test_telemetry_monitor.py

smartsim/_core/config/config.py

tests/test_telemetry_monitor.py

@ankona

Hides stacktrace from users who terminate the dashboard via ctrl-c. Now the keyboard interrupt is caught and the CLI will shutdown gracefully displaying an appropriate message to stdout [ committed by @ankona ] [ reviewed by @MattToast ]

@ankona

Respond to the first round of reviewer feedback for the dashboard. [ committed by @ankona, @MattToast ] [ reviewed by @MattToast ]

As an author, I addressed my own change requests and my review is now stale

ashao

In the interest of getting feedback quickly, I'll do two reviews. One which goes through the implementation (here). I'll post any comments about the tests later on.

smartsim/_core/_cli/validate.py

smartsim/_core/_cli/cli.py

smartsim/_core/_cli/dbcli.py

smartsim/_core/_cli/plugin.py

smartsim/_core/entrypoints/telemetrymonitor.py

smartsim/_core/launcher/step/slurmStep.py

ashao · 2023-12-04T19:56:28Z

smartsim/_core/utils/serialize.py

+        # "exe_args": run_settings.exe_args,
+        "run_command": run_settings.run_command,
+        "run_args": run_settings.run_args,
+        # TODO: We currently do not have a way to represent MPMD commands!


This doesn't seem like it should be that hard to do? MPMD should just be lists of the extra slurm/pbs commands, e.g. -N 1 -n 1 -c 32 : -N 1 -n32 -c 1

I would agree if the dashboard were expecting a str. Unfortunately right now the dashboard expects a dict[str, str] and we cannot "re-add" keys for the dashboard to display (for e.g. in your example it would be non-trivial to display the two different values of -n).

We could turn the "run_args" key into a list[dict[str, str]], but I would have to check with @AlyssaCote to see how much effort it would take for the dashboard to render this info on the frontend!

Ahhhh, I see. Ok, as a stop gap for now let's throw a warning here and put up a ticket.

smartsim/_core/utils/serialize.py

ashao

Some very minor comments. Otherwise the tests look really good!

ashao · 2023-12-04T22:47:19Z

tests/backends/test_dbmodel.py

@@ -398,7 +398,7 @@ def test_colocated_db_model_tf(fileutils, wlmutils, mlutils):
    test_script = fileutils.get_test_conf_path("run_tf_dbmodel_smartredis.py")

    # Create SmartSim Experience
-    exp = Experiment(exp_name, launcher=test_launcher)
+    exp = Experiment(exp_name, launcher=test_launcher, exp_path=test_dir)


Al and I found that this is also needed to get the tests to run on horizon. Out of curiosity, do you know what the behaviour is for dashboard if you omit exp_path?

If you omit exp_path and leave have the SMARTSIM_FLAG_TELEMETRY flag enabled, it would lead to placement of .smartsim directories in the CWD. In order to make sure that these dir always ended up in the test_output dir it was decided to give every experiment an appropriate path.

tests/test_indirect.py

al-rigazzi

Great work so far! Just requesting a couple of minor fixes to docs and maybe some more tests for files for which coverage seems a bit low.

al-rigazzi · 2023-12-04T17:58:45Z

smartsim/_core/_cli/__main__.py

+        return smart_cli.execute(sys.argv)
+    except SmartSimCLIActionCancelled as ssi:
+        logger.debug(ssi, exc_info=True)
+        logger.info(ssi)


Don't lines L44-45 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"? I understand that the debug version has exception info attached, but in that case it looks like we don't really need the info one? Or am I missing something?

Nope, that is absolutely the case! will give the debug logger call a generic "here is the traceback" message

al-rigazzi · 2023-12-04T18:01:52Z

smartsim/_core/_cli/__main__.py

+        logger.info(ssi)
+    except KeyboardInterrupt:
+        msg = "SmartSim was terminated by user"
+        logger.debug(msg, exc_info=True)


Don't lines L48-49 lead to duplicate message if SMARTSIM_LOG_LEVEL>="debug"? I understand that the debug version has exception info attached, but in that case it looks like we don't really need the info one? Or am I missing something?

al-rigazzi · 2023-12-04T18:29:46Z

smartsim/_core/_cli/plugin.py

+            if spec is None:
+                raise AttributeError()
+        except (ModuleNotFoundError, AttributeError):
+            print(not_found)


Do we want to resort to print instead of using a better logger?

al-rigazzi · 2023-12-04T18:29:52Z

smartsim/_core/_cli/plugin.py

+                    stdout, _ = process.communicate()
+
+                plugin_stdout = stdout.decode("utf-8")
+                print(plugin_stdout)


Do we want to resort to print instead of using a better logger?

On second read of this, I'm not sure why we capture the output and then print it to the screen immediately after the process dies. Seems like we could just let the plugin decide what to do with it's output and we could simply just not capture it.

I replaced this block with a subprocess.run call without capturing any io. Let me know if you can think of a good reason that we should instead capture the output and send it through a logger!

al-rigazzi · 2023-12-04T18:42:41Z

smartsim/_core/config/config.py

@@ -204,6 +204,17 @@ def test_account(self) -> t.Optional[str]:  # pragma: no cover
        # no account by default
        return os.environ.get("SMARTSIM_TEST_ACCOUNT", None)

+    @property
+    def telemetry_frequency(self) -> int:
+        return int(os.environ.get("SMARTSIM_TELEMETRY_FREQUENCY", 5))


Looks like an easy test to add to make Codecov happy.

Test added!

tests/test_configs/telemetry/db_and_model_1run.json

tests/test_configs/telemetry/ensembles.json

tests/test_configs/telemetry/serialmodels.json

tests/test_configs/telemetry/telemetry.json

tests/test_telemetry_monitor.py

al-rigazzi

Two super-minor fixes to docs (I missed on first review) and then it looks good to me!

smartsim/_core/control/controller.py

al-rigazzi · 2023-12-06T02:29:49Z

smartsim/_core/control/job.py

+        self.task_id: str = ""
+        self.type: str = ""
+        self.timestamp: int = 0
+        self.status_dir: str = ""


Looks good!

al-rigazzi · 2023-12-06T02:31:31Z

smartsim/_core/control/manifest.py

+    exp_name: str
+    exp_path: str
+    launcher_name: str
+    run_id: str = field(default_factory=_helpers.create_short_id_str)


Thanks, possibly not needed after you explained, but really appreciate making this clearer for devs

smartsim/_core/launcher/step/step.py

smartsim/experiment.py

al-rigazzi

LGTM! Thanks for addressing the very last remarks!

@MattToast

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

ankona and others added 8 commits October 24, 2023 16:59

Add ability to passthrough to CLI plugin (#397)

266b3c6

* Add ability to passthrough to CLI plugin [ committed by @ankona ] [ reviewed by @MattToast ]

Oss517 - Add telemetry monitor process

59ab617

add initial telemetry monitor Co-authored-by: Andrew Shao <Andrew.Shao@hpe.com> Co-authored-by: Matt Drozt <drozt@hpe.com> [ committed by @ankona @ashao @MattToast ] [ reviewed by @MattToast ]

Merge branch 'develop' into dashboard

535964b

Allow TM to track Unmanaged Steps through WLM (#418)

ead1854

Moves logic for launching a process through a the indirect module into a dedicated proxy step class decorator, so that unmanaged steps can be tracked by the TM without updating the logic to existing steps or launchers.

mitigate merge conflict develop->dashboard

c4af23e

SmartDashboard docs (#420)

0c5f315

[ committed by @ankona @AlyssaCote ] [ reviewed by @al-rigazzi ]

ankona commented Nov 29, 2023

View reviewed changes

smartsim/_core/control/controller.py Outdated Show resolved Hide resolved

ankona commented Nov 29, 2023

View reviewed changes

smartsim/_core/control/controller.py Outdated Show resolved Hide resolved

ankona commented Nov 29, 2023

View reviewed changes

smartsim/_core/entrypoints/telemetrymonitor.py Show resolved Hide resolved

MattToast reviewed Nov 29, 2023

View reviewed changes

conftest.py Outdated Show resolved Hide resolved

smartsim/_core/utils/serialize.py Outdated Show resolved Hide resolved

smartsim/_core/control/controller.py Show resolved Hide resolved

smartsim/_core/config/config.py Outdated Show resolved Hide resolved

MattToast requested review from ashao and al-rigazzi November 29, 2023 21:20

MattToast reviewed Nov 29, 2023

View reviewed changes

smartsim/_core/entrypoints/indirect.py Show resolved Hide resolved

MattToast previously requested changes Nov 30, 2023

View reviewed changes

MattToast reviewed Nov 30, 2023

View reviewed changes

smartsim/_core/config/config.py Outdated Show resolved Hide resolved

MattToast reviewed Nov 30, 2023

View reviewed changes

tests/test_telemetry_monitor.py Show resolved Hide resolved

ankona added 2 commits November 30, 2023 13:55

Initial Round of Dashboard Reviewer Feedback (#428)

01bbfa5

Respond to the first round of reviewer feedback for the dashboard. [ committed by @ankona, @MattToast ] [ reviewed by @MattToast ]

MattToast self-requested a review December 1, 2023 17:16

MattToast marked this pull request as ready for review December 1, 2023 17:25

ashao requested changes Dec 4, 2023

View reviewed changes

al-rigazzi requested changes Dec 4, 2023

View reviewed changes

MattToast added 2 commits December 4, 2023 19:06

addresss reviewer feedback

ce3ad2a

unify docstring quote style

96ef9a5

MattToast added 12 commits December 4, 2023 22:01

rm dubious one-off vars

17ef192

add new test modules to test groups so they are included in coverage

47a6f56

better docstring contents

27ffe46

better logging of plugin module

3b7648d

minor fixes

35089c2

add some reviewer recomended tests

229bcef

typo

a7b9c32

fix test that was checking for sysout to now check log

6326235

Anonymize paths

0ed0eb9

complete line coverage

02e258a

remove duplicate message

e9a616d

add experiment level switch for TM per reviewer request

7227fa2

MattToast requested review from al-rigazzi and ashao December 6, 2023 01:05

MattToast added 2 commits December 5, 2023 19:12

lint

af8e0a7

fix the test I broke

94b867b

al-rigazzi requested changes Dec 6, 2023

View reviewed changes

ankona added 4 commits December 6, 2023 10:07

fix verbiage

b6d41de

Update docstring w/new parameters

5efbfce

minor var spelling fix

0b3bc88

swap CLI to use default smartdashboard module

c353cf3

MattToast requested a review from al-rigazzi December 6, 2023 18:21

al-rigazzi approved these changes Dec 6, 2023

View reviewed changes

ashao approved these changes Dec 6, 2023

View reviewed changes

MattToast merged commit d8fba1b into develop Dec 6, 2023
26 checks passed

MattToast mentioned this pull request Dec 15, 2023

Test Docs: Update Number of Required Nodes #442

Merged

MattToast added a commit that referenced this pull request Dec 15, 2023

Test Docs: Update Number of Required Nodes (#442)

91224d7

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

MattToast deleted the dashboard branch December 18, 2023 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for producing dashboard outputs #426

Add support for producing dashboard outputs #426

ankona commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading

MattToast left a comment

MattToast left a comment

ashao left a comment

ashao Dec 4, 2023

MattToast Dec 4, 2023

ashao Dec 4, 2023

ashao left a comment

ashao Dec 4, 2023

MattToast Dec 4, 2023

al-rigazzi left a comment

al-rigazzi Dec 4, 2023

MattToast Dec 6, 2023

al-rigazzi Dec 4, 2023

MattToast Dec 6, 2023

al-rigazzi Dec 4, 2023

MattToast Dec 5, 2023

al-rigazzi Dec 4, 2023

MattToast Dec 5, 2023

al-rigazzi Dec 4, 2023

MattToast Dec 5, 2023

al-rigazzi left a comment

al-rigazzi Dec 6, 2023

al-rigazzi Dec 6, 2023

al-rigazzi left a comment

Add support for producing dashboard outputs #426

Add support for producing dashboard outputs #426

Conversation

ankona commented Nov 29, 2023 • edited Loading

codecov bot commented Nov 29, 2023 • edited Loading

Codecov Report

MattToast left a comment

Choose a reason for hiding this comment

MattToast left a comment

Choose a reason for hiding this comment

ashao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

al-rigazzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

al-rigazzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

al-rigazzi left a comment

Choose a reason for hiding this comment

ankona commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading