-
Notifications
You must be signed in to change notification settings - Fork 30
Backlog
Johannes Klicpera edited this page Nov 22, 2021
·
78 revisions
Possible and planned features:
- Proper documentation via readthedocs
- Tests
- define multiple (redundant) MongoDB instances to leverage our MongoDB replica. (https://pymongo.readthedocs.io/en/stable/examples/high_availability.html)
- Raise error when setting both a value and a sub-value, e.g.
a
anda.b
, except when one of them is from a sub-config, then show the usual "special overwrites general" warning. - Warn and confirm if user is deleting or resetting a submitted job (before cancelling it)
- Pass "Batch job submission failed" errors to user
- detect Slurm state instead of only whether the job got killed, reflect in database (raw (last seen), seml-equivalent); potentially remove KILLED state, add "reason" field instead; remove detect-killed, make
seml status
the primary way of detecting Slurm states -
seml pause
command. Detecting paused experiments requires parsing the REASON field. We could print the REASON for pending experiments also withseml status
. - SEML portable mode for publishing source code: Start local experiment directly from config (no MongoDB and Slurm, only Sacred)
-
include
for including SEML base configs (which are merged into other configs) - integrate with Tensorboard HParams for nicer evaluation
- Pausing experiments (hold/stop/suspend)
- Override config parameters via command line argument
- Ability to manually select a
mongodb.config
in the command - Recommend using separate DB for each user. Maybe provide installation instructions?
Low priority:
- Job chaining via sbatch
--dependency
- suspend (and then restart) experiments
- Integrate with PyTorch Lightning (what would this even mean? Some convenience functions?)
- Automatic hyperparameter optimization (via Sherpa, hyperopt, Optuna?) -> parallel, on Cluster
- Make Sacred optional (makes SEML easier for beginners, and Sacred might be discontinued at some point)
- detect local experiments that failed outside Python by using the heartbeat