v0.6.0 release #443

al-rigazzi · 2023-12-18T18:39:45Z

Bring master up to date with develop for v0.6.0 release.

@MattToast

Updates the python package and documentation from 0.5.0 to 0.5.1. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]

@MattToast

Update version names for v0.5.1. Removed unintentional ignore of `docs/`. Ignore `.DS_Store` [ committed by @MattToast ] [ reviewed @ashao ]

@al-rigazzi

Adds explicit shutdown of DB shards. Previously, DBs were terminated by simply terminating their processes, but that does not work in certain settings. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@ashao

When trying to debug tests on the Github CI, the output from pytest is often insufficient because some errors are written to the redirected stderr and stdout of launched applications. This PR uploads the test_output directory as an artifact of the workflow on failure. [ committed by @ashao ] [ reviewed by @al-rigazzi , @MattToast ]

@MattToast

The 'Install' link in README was pointing to a since moved and renamed page on the Cray Labs docs. It has been updated to now point to the basic installation of SmartSim. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]

@MattToast

Make `EntityList` a generic collection. Add an internal covariant `EntitySequence` type for internal type hints, with a note explaining why it is not truely covariant and should only be used for internal type checking. [ committed by @MattToast ] [ reviewed by @ankona ]

@al-rigazzi

This PR adds log files reporting the value set to parameters during generation phase. A file named smartsim_params.txt is added in the root directory of each generated entity, where the user can see what values were assigned to parameters for its execution. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@al-rigazzi

Pin pylint version as 3.0.0 has some bugs. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@al-rigazzi

Drops support for RedisAI 1.2.5. The only RedisAI version is now 1.2.7. Since the officially released RedisAI 1.2.7 has a bug which breaks the build process on Mac OSX, it was decided to use commit RedisAI/RedisAI@634916c from RedisAI's GitHub repository, where such bug has been fixed. This applies to all operating systems. [ committed by @al-rigazzi ] [ reviewed by @MattToast @mellis13 ]

@al-rigazzi

This PR improves the way colocated DBs are launched. An erroneous warning has now been suppressed, and colocated launcher script names contain the name of the entity being launched, this avoids race conditions. Colocated launcher scripts have been moved to a hidden directory named `.smartsim`, to reduce cluttering in the user directories. [ committed by @al-rigazzi ] [ reviewed by @MattToast @mellis13 @ashao @ankona ]

@MattToast

A general refactor of the DBNode class to: - Remove the duplicate methods with near identical functionality where one was intended to be called if the underlying RunSettings were standard and the other if they were MPMD. This removes undefined behavior when the "wrong" method was called - Allow MPMD DBNodes to map information "per shard" by scraping output files for a serializable LaunchedShardData class. [ committed by @MattToast ] [ reviewed by @ashao ]

@billschereriii

Add support for MINBATCHTIMEOUT [ committed by @billschereriii ] [ reviewed by @ashao @al-rigazzi]

@ashao

Enables experiments to have multiple databases running. This involves users passing a database identifier to methods create_database, colocate_db_tcp and colocate_db_uds. SmartSim will set environment variables SSDB and SR_DB_TYPE with the identifier as a suffix. This PR requires an up to date SmartRedis. [committed by @julia.putko ] [reviewed by @ashao @ankona @al-rigazzi @billschereriii ]

@billschereriii

…394) Expose first_device parameter for setting models, scripts, functions [ committed by @billschereriii ] [ reviewed by @al-rigazzi @MattToast ]

@al-rigazzi

Changes `format` argument of `Experiment.summary()` to `style`, to avoid redefinition of builtin `format`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 ]

@rickybalin

Added an option to specify an affinity script for the PALSMpiexecSettings and add it to the run command after the run arguments, with no "--" prefix. Optional arguments can be passed after the affinity script. [ committed by @rickybalin ] [ reviewed by @al-rigazzi ]

@al-rigazzi

Fixes a bug which resulted in multiple DB addresses being mixed up in the Job Manager. [ committed by @al-rigazzi @amandarichardsonn ] [ reviewed by @MattToast ]

An email address was added to documentation footer and to README.md. The Slack link was also added to the documentation footer.

@juliaputko

…ate_database() documentation (#408) Fixed typo in Experiment.create_model() documentation. Added db_identifier to Experiment.create_database() documentation. [ committed by @juliaputko ] [ reviewed by @al-rigazzi ]

@al-rigazzi

The version of `types-tensorflow` was pinned to be compatible with the features of TensorFlow used by SmartSim. [ committed by @al-rigazzi ] [ reviewed by @ankona @MattToast ]

@billschereriii

Split tests into groups for parallel execution in CI/CD runners [ committed by @billschereriii ] [ reviewed by @al-rigazzi @ankona ]

@ankona

Mitigate pytest warnings due to unregistered test groups [ committed by @ankona ] [ reviewed by @billschereriii ]

@ankona

Add support for producing & consuming telemetry outputs. - Adds telemetry monitor to check for updates and produce events for the dashboard - Updates controller to conditionally start telemetry monitor - Updates controller to produce a runtime manifest to trigger telemetry collection - Adds indirect proxy to produce events for the dashboard for unmanaged tasks - Adds CLI capability to launch dashboard [ committed by @ankona, @MattToast, @AlyssaCote ] [ reviewed by @al-rigazzi, @ashao ] --------- Co-authored-by: Matt Drozt <drozt@hpe.com> Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com>

@al-rigazzi

A number of defects were found in the testing that primarily affected the tests run on HPC platforms. Many of these were uncovered during the recent introductions of major features that touched on various aspects of the testing suite. Most fixes are focused primarily on changes to the actual tests as opposed to fundamental changes in the underlying code base. Some of the major changes are: - Most tests are run in their own experiment directory avoiding the overwrite of directories between tests - When attempting to run multi-gpu tests, a bug (presumably in RedisAI) was found that prevents the setting of multiple GPUs when using the Tensorflow backend. These tests now only use a single GPU regardless of the value of `SMARTSIM_TEST_NUM_GPUS` - Ensures that tests that spin up an `Orchestrator` always stop it before exiting, either due to success or failure of a different component of the test Lastly, changes were also made to `QsubBatchSettings` to add support for PBS-like platforms that use the `resources` tag to define additional resources and/or otherwise customize PBS batch jobs [ committed by @al-rigazzi and @ashao ] [ reviewed by: @MattToast ] Co-authored-by: Alessandro Rigazzi <al.rigazzi@hpe.com> Co-authored-by: Andrew Shao <andrew.shao@hpe.com>

@amandarichardsonn

SmartSim multi Orchestrator Example (#409) This PR merges in an example demonstrating setting up multiple Orchestrators and connecting to the databases from within an application and the driver script. The example guides the user through the newly released multi Orchestrator functionality. [ committed by @amandarichardsonn ] [ reviewed by @ashao @billschereriii @juliaputko ]

@al-rigazzi

This PR adds a section named ML Features to the documentation. The section contains multiple examples of how ML models can be uploaded to and executed on the database. The PR also adds a section to the Online Analysis tutorial. The section uses TorchScript functions to post-process the simulation data. [ committed by @al-rigazzi ] [ reviewed by @amandarichardsonn @mellis13 ]

@al-rigazzi

This PR makes `sacct` and `sstat` errors result in an exception when running Slurm-based workflows. Previously, the errors were ignored and this could result in SmartSim's state becoming inconsistent and unstable. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@MattToast

Fixes some conflicting directives in the SmartSim packaging instructions: - `setup.py` manually listed an incomplete list of packages to include in SmartSim while the `setup.cfg` was using`find_packages`. This commit defaults to `setup.cfg` with a slightly refined `include` directive. - `setup.py` manually listed `package_data` directives while the `setup.cfg` set `include_package_data=True`. This commit keeps both strategies but lists both in the `setup.cfg`. - In order to exclude the `__pycache__` created from SmartSim modules imported during the `setup.py` script, `__pycache__` and related files were explicitly ignored in the `MANIFEST.in`. - In order to ensure the the `smartsim._core.launcher.local` package was found by `find_packages` without reverting to `find_namespace_packages` an `__init__.py` module was added to the directory. [ committed by @MattToast ] [ reviewed by @ankona @al-rigazzi ]

@al-rigazzi

This PR uniforms style across the code base. `make check-style` now passes. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

@MattToast

Bump manifest version number to match SmartSim Dashboard [ committed by @MattToast ] [ reviewed by @AlyssaCote ]

@MattToast

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

@al-rigazzi

This PR updates version numbers for SmartSim (now v0.6.0) and the SmartRedis dependency (now v0.5.0). Minor changes to tutorial containers were also added. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

codecov · 2023-12-18T18:47:53Z

Codecov Report

Merging #443 (e3aa517) into master (1f78e4c) will increase coverage by 1.79%.
The diff coverage is 96.41%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
+ Coverage   88.54%   90.34%   +1.79%     
==========================================
  Files          59       60       +1     
  Lines        3580     3748     +168     
==========================================
+ Hits         3170     3386     +216     
+ Misses        410      362      -48

Files	Coverage Δ
smartsim/_core/config/config.py	`98.75% <100.00%> (+0.33%)`	⬆️
smartsim/_core/control/job.py	`94.94% <100.00%> (+1.52%)`	⬆️
smartsim/_core/control/jobmanager.py	`94.19% <100.00%> (+0.90%)`	⬆️
smartsim/_core/generation/generator.py	`96.00% <100.00%> (+2.06%)`	⬆️
smartsim/_core/generation/modelwriter.py	`100.00% <100.00%> (+1.38%)`	⬆️
smartsim/_core/launcher/__init__.py	`100.00% <100.00%> (ø)`
smartsim/_core/launcher/colocated.py	`97.82% <100.00%> (+3.38%)`	⬆️
smartsim/_core/launcher/launcher.py	`100.00% <100.00%> (ø)`
smartsim/_core/launcher/local/local.py	`95.45% <100.00%> (+1.45%)`	⬆️
smartsim/_core/launcher/step/__init__.py	`100.00% <100.00%> (ø)`
... and 37 more

... and 2 files with indirect coverage changes

MattToast

LGTM pending tests!!

al-rigazzi and others added 30 commits September 15, 2023 18:53

Update to v0.5.1

4aadac9

Updates the python package and documentation from 0.5.0 to 0.5.1. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]

Update version names for v0.5.1 (#370)

a332823

Update version names for v0.5.1. Removed unintentional ignore of `docs/`. Ignore `.DS_Store` [ committed by @MattToast ] [ reviewed @ashao ]

Send shutdown command to DB (#355)

bc7b232

Adds explicit shutdown of DB shards. Previously, DBs were terminated by simply terminating their processes, but that does not work in certain settings. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

Fix broken 'Install' link in README (#374)

318333e

The 'Install' link in README was pointing to a since moved and renamed page on the Cray Labs docs. It has been updated to now point to the basic installation of SmartSim. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]

Pin pylint version to 2.x

933e7bd

Pin pylint version as 3.0.0 has some bugs. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

Add support for MINBATCHTIMEOUT (#387)

f00c426

Add support for MINBATCHTIMEOUT [ committed by @billschereriii ] [ reviewed by @ashao @al-rigazzi]

Expose first_device parameter for setting models, scripts, functions (#…

b509efd

…394) Expose first_device parameter for setting models, scripts, functions [ committed by @billschereriii ] [ reviewed by @al-rigazzi @MattToast ]

Rename Experiment.summary() argument (#391)

63d1fa5

Changes `format` argument of `Experiment.summary()` to `style`, to avoid redefinition of builtin `format`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 ]

Multi-db address fix (#406)

ee9662d

Fixes a bug which resulted in multiple DB addresses being mixed up in the Job Manager. [ committed by @al-rigazzi @amandarichardsonn ] [ reviewed by @MattToast ]

Add contact information (#403)

9ecfbed

An email address was added to documentation footer and to README.md. The Slack link was also added to the documentation footer.

Pin version of types-tensorflow (#415)

96765d9

The version of `types-tensorflow` was pinned to be compatible with the features of TensorFlow used by SmartSim. [ committed by @al-rigazzi ] [ reviewed by @ankona @MattToast ]

Split tests into groups for parallel execution in CI/CD runners (#417)

15ba145

Split tests into groups for parallel execution in CI/CD runners [ committed by @billschereriii ] [ reviewed by @al-rigazzi @ankona ]

Register test groups, mitigate warning (#424)

508cba3

Mitigate pytest warnings due to unregistered test groups [ committed by @ankona ] [ reviewed by @billschereriii ]

Uniform style across codebase (#438)

93b9ab2

This PR uniforms style across the code base. `make check-style` now passes. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

Manifest Version Bump (#440)

d1a5e19

Bump manifest version number to match SmartSim Dashboard [ committed by @MattToast ] [ reviewed by @AlyssaCote ]

MattToast and others added 2 commits December 15, 2023 17:52

Test Docs: Update Number of Required Nodes (#442)

91224d7

Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]

Release 0.6.0 (#441)

e3aa517

This PR updates version numbers for SmartSim (now v0.6.0) and the SmartRedis dependency (now v0.5.0). Minor changes to tutorial containers were also added. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]

al-rigazzi requested review from MattToast and mellis13 December 18, 2023 18:40

MattToast approved these changes Dec 18, 2023

View reviewed changes

al-rigazzi merged commit 9d97397 into master Dec 18, 2023
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 release #443

v0.6.0 release #443

al-rigazzi commented Dec 18, 2023

codecov bot commented Dec 18, 2023 •

edited

Loading

MattToast left a comment

v0.6.0 release #443

v0.6.0 release #443

Conversation

al-rigazzi commented Dec 18, 2023

codecov bot commented Dec 18, 2023 • edited Loading

Codecov Report

MattToast left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 18, 2023 •

edited

Loading