Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.6.0 release #443

Merged
merged 32 commits into from
Dec 18, 2023
Merged

v0.6.0 release #443

merged 32 commits into from
Dec 18, 2023

Conversation

al-rigazzi
Copy link
Collaborator

Bring master up to date with develop for v0.6.0 release.

al-rigazzi and others added 30 commits September 15, 2023 18:53
Updates the python package and documentation from 0.5.0 to 0.5.1.

[ committed by @MattToast   ]
[ reviewed by @al-rigazzi  ]
Update version names for v0.5.1.
Removed unintentional ignore of `docs/`. Ignore `.DS_Store`

[ committed by @MattToast ]
[ reviewed @ashao ]
Adds explicit shutdown of DB shards. Previously, DBs
were terminated by simply terminating their processes,
but that does not work in certain settings.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
When trying to debug tests on the Github CI, the output from
pytest is often insufficient because some errors are written to
the redirected stderr and stdout of launched applications. This
PR uploads the test_output directory as an artifact of the
workflow on failure.

[ committed by @ashao ]
[ reviewed by @al-rigazzi , @MattToast ]
The 'Install' link in README was pointing to a since moved and
renamed page on the Cray Labs docs. It has been updated to now
point to the basic installation of SmartSim.

[ committed by @MattToast ]
[ reviewed by @al-rigazzi ]
Make `EntityList` a generic collection. Add an internal covariant
`EntitySequence` type for internal type hints, with a note explaining
why it is not truely covariant and should only be used for internal 
type checking.

[ committed by @MattToast ]
[ reviewed by @ankona ]
This PR adds log files reporting the value set to parameters during generation phase.
A file named smartsim_params.txt is added in the root directory of each generated entity,
where the user can see what values were assigned to parameters for its execution.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
Pin pylint version as 3.0.0 has some bugs.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
Drops support for RedisAI 1.2.5. The only RedisAI version
is now 1.2.7. Since the officially released RedisAI 1.2.7 has a
bug which breaks the build process on Mac OSX, it was decided to
use commit RedisAI/RedisAI@634916c
from RedisAI's GitHub repository, where such
bug has been fixed. This applies to all operating systems.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast @mellis13 ]
This PR improves the way colocated DBs are launched. An erroneous
warning has now been suppressed, and colocated launcher script names
contain the name of the entity being launched, this avoids race conditions.
Colocated launcher scripts have been moved to a hidden directory
named `.smartsim`, to reduce cluttering in the user directories.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast @mellis13 @ashao @ankona ]
A general refactor of the DBNode class to:
- Remove the duplicate methods with near identical functionality where one was intended to be called if the underlying RunSettings were standard and the other if they were MPMD. This removes undefined behavior when the "wrong" method was called
- Allow MPMD DBNodes to map information "per shard" by scraping output files for a serializable LaunchedShardData class.

[ committed by @MattToast ]
[ reviewed by @ashao ]
Add support for MINBATCHTIMEOUT
[ committed by @billschereriii ]
[ reviewed by @ashao @al-rigazzi]
Enables experiments to have multiple databases running. This involves
users passing a database identifier to methods create_database,
colocate_db_tcp and colocate_db_uds. SmartSim will set environment
variables SSDB and SR_DB_TYPE with the identifier as a suffix. This PR
requires an up to date SmartRedis.

[committed by @julia.putko ]
[reviewed by @ashao @ankona @al-rigazzi @billschereriii ]
…394)

Expose first_device parameter for setting models, scripts, functions
[ committed by @billschereriii ]
[ reviewed by @al-rigazzi @MattToast ]
Changes `format` argument of `Experiment.summary()` to `style`,
to avoid redefinition of builtin `format`.

[ committed by @al-rigazzi ]
[ reviewed by @mellis13  ]
Added an option to specify an affinity script for the PALSMpiexecSettings
and add it to the run command after the run arguments, with no "--" prefix.
Optional arguments can be passed after the affinity script.

[ committed by @rickybalin ]
[ reviewed by @al-rigazzi ]
Fixes a bug which resulted in multiple DB addresses being mixed up in the Job Manager.

[ committed by @al-rigazzi @amandarichardsonn ]
[ reviewed by @MattToast ]
An email address was added to documentation footer and to README.md.
The Slack link was also added to the documentation footer.
…ate_database() documentation (#408)

Fixed typo in Experiment.create_model() documentation. Added db_identifier to Experiment.create_database() documentation. 

[ committed by @juliaputko ]
[ reviewed by @al-rigazzi ]
The version of `types-tensorflow` was pinned to be compatible with the features of TensorFlow used by SmartSim.

[ committed by @al-rigazzi ]
[ reviewed by @ankona @MattToast ]
Split tests into groups for parallel execution in CI/CD runners

[ committed by @billschereriii ]
[ reviewed by @al-rigazzi @ankona ]
Mitigate pytest warnings due to unregistered test groups

[ committed by @ankona ]
[ reviewed by @billschereriii ]
Add support for producing & consuming telemetry outputs.

- Adds telemetry monitor to check for updates and produce events for the dashboard
- Updates controller to conditionally start telemetry monitor
- Updates controller to produce a runtime manifest to trigger telemetry collection
- Adds indirect proxy to produce events for the dashboard for unmanaged tasks
- Adds CLI capability to launch dashboard

[ committed by @ankona, @MattToast, @AlyssaCote ]
[ reviewed by @al-rigazzi, @ashao ]

---------

Co-authored-by: Matt Drozt <drozt@hpe.com>
Co-authored-by: Alyssa Cote <46540273+AlyssaCote@users.noreply.github.com>
A number of defects were found in the testing that primarily affected
the tests run on HPC platforms. Many of these were uncovered during
the recent introductions of major features that touched on various
aspects of the testing suite. Most fixes are focused primarily on changes
to the actual tests as opposed to fundamental changes in the underlying
code base. Some of the major changes are:
- Most tests are run in their own experiment directory avoiding the
  overwrite of directories between tests
- When attempting to run multi-gpu tests, a bug (presumably in RedisAI)
   was found that prevents the setting of multiple GPUs when using the
   Tensorflow backend. These tests now only use a single GPU regardless
   of the value of `SMARTSIM_TEST_NUM_GPUS`
- Ensures that tests that spin up an `Orchestrator` always stop it before
   exiting, either due to success or failure of a different component of the
   test

Lastly, changes were also made to `QsubBatchSettings` to add support
for PBS-like platforms that use the `resources` tag to define additional
resources and/or otherwise customize PBS batch jobs

[ committed by @al-rigazzi and @ashao ]
[ reviewed by: @MattToast ]

Co-authored-by: Alessandro Rigazzi <al.rigazzi@hpe.com>
Co-authored-by: Andrew Shao <andrew.shao@hpe.com>
SmartSim multi Orchestrator Example (#409)

This PR merges in an example demonstrating setting up
multiple Orchestrators and connecting to the databases
from within an application and the driver script. The example
guides the user through the newly released multi Orchestrator
functionality.

[ committed by @amandarichardsonn ]
[ reviewed by @ashao @billschereriii @juliaputko  ]
This PR adds a section named ML Features to the documentation. The section contains multiple examples of how ML models can be uploaded to and executed on the database. The PR also adds a section to the Online Analysis tutorial. The section uses TorchScript functions to post-process the simulation data.

[ committed by @al-rigazzi ]
[ reviewed by @amandarichardsonn @mellis13 ]
This PR makes `sacct` and `sstat` errors result in an exception when running Slurm-based workflows. Previously, the errors were ignored and this could result in SmartSim's state becoming inconsistent and unstable.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
Fixes some conflicting directives in the SmartSim packaging instructions:

- `setup.py` manually listed an incomplete list of packages to include in SmartSim while the
   `setup.cfg` was using`find_packages`. This commit defaults to `setup.cfg` with a slightly
   refined `include` directive.
- `setup.py` manually listed `package_data` directives while the `setup.cfg` set
   `include_package_data=True`. This commit keeps both strategies but lists both in the
    `setup.cfg`.
- In order to exclude the `__pycache__` created from SmartSim modules imported during the
   `setup.py` script, `__pycache__` and related files were explicitly ignored in the `MANIFEST.in`.
- In order to ensure the the `smartsim._core.launcher.local` package was found by `find_packages`
   without reverting to `find_namespace_packages` an `__init__.py` module was added to the directory.

[ committed by @MattToast ]
[ reviewed by @ankona @al-rigazzi ]
This PR uniforms style across the code base. `make check-style` now passes.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
Bump manifest version number to match SmartSim Dashboard

[ committed by @MattToast ]
[ reviewed by @AlyssaCote ]
MattToast and others added 2 commits December 15, 2023 17:52
Bumps the required number of nodes in the test docs from 3 to 4 as required by the tests in #381 and #426.

[ committed by @MattToast ]
[ reviewed by @ashao ]
This PR updates version numbers for SmartSim (now v0.6.0) and the SmartRedis dependency (now v0.5.0). Minor changes to tutorial containers were also added.

[ committed by @al-rigazzi ]
[ reviewed by @MattToast ]
Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Merging #443 (e3aa517) into master (1f78e4c) will increase coverage by 1.79%.
The diff coverage is 96.41%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #443      +/-   ##
==========================================
+ Coverage   88.54%   90.34%   +1.79%     
==========================================
  Files          59       60       +1     
  Lines        3580     3748     +168     
==========================================
+ Hits         3170     3386     +216     
+ Misses        410      362      -48     
Files Coverage Δ
smartsim/_core/config/config.py 98.75% <100.00%> (+0.33%) ⬆️
smartsim/_core/control/job.py 94.94% <100.00%> (+1.52%) ⬆️
smartsim/_core/control/jobmanager.py 94.19% <100.00%> (+0.90%) ⬆️
smartsim/_core/generation/generator.py 96.00% <100.00%> (+2.06%) ⬆️
smartsim/_core/generation/modelwriter.py 100.00% <100.00%> (+1.38%) ⬆️
smartsim/_core/launcher/__init__.py 100.00% <100.00%> (ø)
smartsim/_core/launcher/colocated.py 97.82% <100.00%> (+3.38%) ⬆️
smartsim/_core/launcher/launcher.py 100.00% <100.00%> (ø)
smartsim/_core/launcher/local/local.py 95.45% <100.00%> (+1.45%) ⬆️
smartsim/_core/launcher/step/__init__.py 100.00% <100.00%> (ø)
... and 37 more

... and 2 files with indirect coverage changes

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests!!

@al-rigazzi al-rigazzi merged commit 9d97397 into master Dec 18, 2023
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants