-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.6.0 release #443
Merged
Merged
v0.6.0 release #443
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Updates the python package and documentation from 0.5.0 to 0.5.1. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]
Update version names for v0.5.1. Removed unintentional ignore of `docs/`. Ignore `.DS_Store` [ committed by @MattToast ] [ reviewed @ashao ]
Adds explicit shutdown of DB shards. Previously, DBs were terminated by simply terminating their processes, but that does not work in certain settings. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
When trying to debug tests on the Github CI, the output from pytest is often insufficient because some errors are written to the redirected stderr and stdout of launched applications. This PR uploads the test_output directory as an artifact of the workflow on failure. [ committed by @ashao ] [ reviewed by @al-rigazzi , @MattToast ]
The 'Install' link in README was pointing to a since moved and renamed page on the Cray Labs docs. It has been updated to now point to the basic installation of SmartSim. [ committed by @MattToast ] [ reviewed by @al-rigazzi ]
Make `EntityList` a generic collection. Add an internal covariant `EntitySequence` type for internal type hints, with a note explaining why it is not truely covariant and should only be used for internal type checking. [ committed by @MattToast ] [ reviewed by @ankona ]
This PR adds log files reporting the value set to parameters during generation phase. A file named smartsim_params.txt is added in the root directory of each generated entity, where the user can see what values were assigned to parameters for its execution. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
Pin pylint version as 3.0.0 has some bugs. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
Drops support for RedisAI 1.2.5. The only RedisAI version is now 1.2.7. Since the officially released RedisAI 1.2.7 has a bug which breaks the build process on Mac OSX, it was decided to use commit RedisAI/RedisAI@634916c from RedisAI's GitHub repository, where such bug has been fixed. This applies to all operating systems. [ committed by @al-rigazzi ] [ reviewed by @MattToast @mellis13 ]
This PR improves the way colocated DBs are launched. An erroneous warning has now been suppressed, and colocated launcher script names contain the name of the entity being launched, this avoids race conditions. Colocated launcher scripts have been moved to a hidden directory named `.smartsim`, to reduce cluttering in the user directories. [ committed by @al-rigazzi ] [ reviewed by @MattToast @mellis13 @ashao @ankona ]
A general refactor of the DBNode class to: - Remove the duplicate methods with near identical functionality where one was intended to be called if the underlying RunSettings were standard and the other if they were MPMD. This removes undefined behavior when the "wrong" method was called - Allow MPMD DBNodes to map information "per shard" by scraping output files for a serializable LaunchedShardData class. [ committed by @MattToast ] [ reviewed by @ashao ]
Add support for MINBATCHTIMEOUT [ committed by @billschereriii ] [ reviewed by @ashao @al-rigazzi]
Enables experiments to have multiple databases running. This involves users passing a database identifier to methods create_database, colocate_db_tcp and colocate_db_uds. SmartSim will set environment variables SSDB and SR_DB_TYPE with the identifier as a suffix. This PR requires an up to date SmartRedis. [committed by @julia.putko ] [reviewed by @ashao @ankona @al-rigazzi @billschereriii ]
…394) Expose first_device parameter for setting models, scripts, functions [ committed by @billschereriii ] [ reviewed by @al-rigazzi @MattToast ]
Changes `format` argument of `Experiment.summary()` to `style`, to avoid redefinition of builtin `format`. [ committed by @al-rigazzi ] [ reviewed by @mellis13 ]
Added an option to specify an affinity script for the PALSMpiexecSettings and add it to the run command after the run arguments, with no "--" prefix. Optional arguments can be passed after the affinity script. [ committed by @rickybalin ] [ reviewed by @al-rigazzi ]
Fixes a bug which resulted in multiple DB addresses being mixed up in the Job Manager. [ committed by @al-rigazzi @amandarichardsonn ] [ reviewed by @MattToast ]
An email address was added to documentation footer and to README.md. The Slack link was also added to the documentation footer.
…ate_database() documentation (#408) Fixed typo in Experiment.create_model() documentation. Added db_identifier to Experiment.create_database() documentation. [ committed by @juliaputko ] [ reviewed by @al-rigazzi ]
The version of `types-tensorflow` was pinned to be compatible with the features of TensorFlow used by SmartSim. [ committed by @al-rigazzi ] [ reviewed by @ankona @MattToast ]
Split tests into groups for parallel execution in CI/CD runners [ committed by @billschereriii ] [ reviewed by @al-rigazzi @ankona ]
Mitigate pytest warnings due to unregistered test groups [ committed by @ankona ] [ reviewed by @billschereriii ]
Add support for producing & consuming telemetry outputs. - Adds telemetry monitor to check for updates and produce events for the dashboard - Updates controller to conditionally start telemetry monitor - Updates controller to produce a runtime manifest to trigger telemetry collection - Adds indirect proxy to produce events for the dashboard for unmanaged tasks - Adds CLI capability to launch dashboard [ committed by @ankona, @MattToast, @AlyssaCote ] [ reviewed by @al-rigazzi, @ashao ] --------- Co-authored-by: Matt Drozt <drozt@hpe.com> Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com>
A number of defects were found in the testing that primarily affected the tests run on HPC platforms. Many of these were uncovered during the recent introductions of major features that touched on various aspects of the testing suite. Most fixes are focused primarily on changes to the actual tests as opposed to fundamental changes in the underlying code base. Some of the major changes are: - Most tests are run in their own experiment directory avoiding the overwrite of directories between tests - When attempting to run multi-gpu tests, a bug (presumably in RedisAI) was found that prevents the setting of multiple GPUs when using the Tensorflow backend. These tests now only use a single GPU regardless of the value of `SMARTSIM_TEST_NUM_GPUS` - Ensures that tests that spin up an `Orchestrator` always stop it before exiting, either due to success or failure of a different component of the test Lastly, changes were also made to `QsubBatchSettings` to add support for PBS-like platforms that use the `resources` tag to define additional resources and/or otherwise customize PBS batch jobs [ committed by @al-rigazzi and @ashao ] [ reviewed by: @MattToast ] Co-authored-by: Alessandro Rigazzi <al.rigazzi@hpe.com> Co-authored-by: Andrew Shao <andrew.shao@hpe.com>
SmartSim multi Orchestrator Example (#409) This PR merges in an example demonstrating setting up multiple Orchestrators and connecting to the databases from within an application and the driver script. The example guides the user through the newly released multi Orchestrator functionality. [ committed by @amandarichardsonn ] [ reviewed by @ashao @billschereriii @juliaputko ]
This PR adds a section named ML Features to the documentation. The section contains multiple examples of how ML models can be uploaded to and executed on the database. The PR also adds a section to the Online Analysis tutorial. The section uses TorchScript functions to post-process the simulation data. [ committed by @al-rigazzi ] [ reviewed by @amandarichardsonn @mellis13 ]
This PR makes `sacct` and `sstat` errors result in an exception when running Slurm-based workflows. Previously, the errors were ignored and this could result in SmartSim's state becoming inconsistent and unstable. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
Fixes some conflicting directives in the SmartSim packaging instructions: - `setup.py` manually listed an incomplete list of packages to include in SmartSim while the `setup.cfg` was using`find_packages`. This commit defaults to `setup.cfg` with a slightly refined `include` directive. - `setup.py` manually listed `package_data` directives while the `setup.cfg` set `include_package_data=True`. This commit keeps both strategies but lists both in the `setup.cfg`. - In order to exclude the `__pycache__` created from SmartSim modules imported during the `setup.py` script, `__pycache__` and related files were explicitly ignored in the `MANIFEST.in`. - In order to ensure the the `smartsim._core.launcher.local` package was found by `find_packages` without reverting to `find_namespace_packages` an `__init__.py` module was added to the directory. [ committed by @MattToast ] [ reviewed by @ankona @al-rigazzi ]
This PR uniforms style across the code base. `make check-style` now passes. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
Bump manifest version number to match SmartSim Dashboard [ committed by @MattToast ] [ reviewed by @AlyssaCote ]
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426. [ committed by @MattToast ] [ reviewed by @ashao ]
This PR updates version numbers for SmartSim (now v0.6.0) and the SmartRedis dependency (now v0.5.0). Minor changes to tutorial containers were also added. [ committed by @al-rigazzi ] [ reviewed by @MattToast ]
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #443 +/- ##
==========================================
+ Coverage 88.54% 90.34% +1.79%
==========================================
Files 59 60 +1
Lines 3580 3748 +168
==========================================
+ Hits 3170 3386 +216
+ Misses 410 362 -48
|
MattToast
approved these changes
Dec 18, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending tests!!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bring master up to date with develop for v0.6.0 release.