-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air][tune] Aim logger #30674
[air][tune] Aim logger #30674
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your contribution!
I left a couple comments.
@SGevorg can you take a look to see if the aim usage looks right? |
@richardliaw sure, thanks for the follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor improvements to the first implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ju2ez, thank you for the contribution, this is looking good!
I've left some comments, let me know if you need any clarification or some more guidance. I am also happy to help out if needed!
2 other things:
- We'll need to add
aim
as a Tune dependency for the CI. Add it here: https://github.com/ray-project/ray/blob/master/python/requirements/ml/requirements_tune.txt - Another thing that we should add to increase the visibility of the Aim logger: add it to the docs here (I'm thinking right under the
How to log to Tensorboard?
section). Basically just showing the user how to add the callback and how to view logs locally withaim up
at the experiment/repo directory.
python/ray/tune/logger/aim.py
Outdated
Returns: Run | ||
""" | ||
run = self._run_cls( | ||
repo=self._repo_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like aim logs of all trials within a Tune experiment should be written to a shared .aim
repo (concurrent jobs writing to the same repo is supported).
This works if the user specifies a _repo_path
manually, but we should default to having it written to the experiment directory. Otherwise it would get written to the user's original working directory, which may be confusing.
Ex:
~/ray_results/experiment_name/
.aim/
trial0_logdir/
trial1_logdir/
...
We should default the logs to be written to the experiment directory, and we can also fill in a default experiment name:
experiment_dir = str(Path(trial.logdir).parent)
run = self._run_cls(
repo=self._repo_path or experiment_dir,
experiment=self._experiment or trial.experiment_dir_name,
**self._aim_run_kwargs,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving PR since all comments from my side are addressed.
Hey @ju2ez, I am Hovhannes from Aim 😊. I will love to help you finalize and push the changes. 🚀 |
@ju2ez also we can add a flag: |
Hey @tmynn, thanks for your support! I think the Best, |
c1d3e3a
to
7430873
Compare
@justinvyu thanks for catching up. Let's try to get it done until the end of next week! |
@ju2ez please let me know if you need my help with any of these tasks. |
Thanks! @tmynn Also it would be great if you could help with the tests, since i don't really know your best practices. |
@ju2ez nice, I will start with the docs first. :) |
There is no ubuntu20.04 image for cuda 10.2 and 10.1. Signed-off-by: Kornél Csernai <749306+csko@users.noreply.github.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…ay-project#31493) Signed-off-by: Kai Fricke <kai@anyscale.com> Release tests are currently failing with an error on file upload (botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This is likely because some tests are using an anyscale push-based API to upload files. By switching to the job-based filemanager for all tests the upload issue should be mitigated. Please note that execution will still happen with SDK commands for those tests that haven't specified to use jobs for execution, so actual test execution should be unaffected. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…31495) Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…oject#31468) Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] ray-project#31459 - [x] ray-project#31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] ray-project#31417 - [ ] Update documentation Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…r events (ray-project#31489) Signed-off-by: praveeng <praveeng@anyscale.com> # Why are these changes needed? Autoscaler event logs are prefixed with (scheduler) which is misleading. This PR changes the prefix to be (autoscaler) Tested building ray locally and running an application (see attached logs). Added unit tests. # Related issue number Closes ray-project#24807 Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…oject#31195) This took over ray-project#27578 We add an option to warn user when Deprecated API is used. Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
moved rl_optimizer logic into rl_trainer Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Preliminary PR for adding Python 3.11 support, mapping out various dependencies, fixing issues. ### Main changes - Upgrade cython to 0.29.32 - Add CI/CD steps for 3.11 wheel - Change cython code to not use `recursion_depth` - Update cloudpickle to latest - Use newer manylinux2014 which has python3.11 - Condition certain python packages in requirements.txt on <3.11 that don't yet have a 3.11 version ### Checklist: - cython - [x] remove deprecated `recursion_depth` - [x] exc_type cython/cython#4500 - [ ] package dependencies https://pyreadiness.org/3.11/ - [ ] llvmlite numba/llvmlite#869 - [ ] numba numba/numba#8304 - [ ] pyarrow - [ ] scikit-learn - [ ] pydantic - [x] cloudpickle - [x] upgrade cython to 0.29.32 - [ ] tensorflow tensorflow/tensorflow#58032 - [ ] torch pytorch/pytorch#86566 - [ ] miniconda conda/conda#11170 - [ ] `docker/base-deps/Dockerfile` - [x] claim to support 3.11 in setup.py - [ ] cicd - [ ] .buildkite/ - [ ] .buildkite/pipeline.build.yml - [ ] ci/ - [ ] ci/build/test-wheels.sh - [ ] ci/build/build-docker-images.py - [ ] release tests - [ ] docker/retag-lambda/python_versions.txt - [ ] download_wheels.sh - [ ] wheels - [ ] `python/build-wheel-macos.sh` - [ ] `python/build-wheel-windows.sh` - [ ] Tests - [ ] pytest ray/serve/tests - [ ] python python/ray/serve/examples/echo_full.py - [ ] bazel test //:core_worker_test - [ ] bazel test --build_tests_only //:all - [ ] //python/ray/tests:test_pydantic_serialization fastapi/fastapi#5048 - [ ] //python/ray/train:test_torch_utils - [ ] Documentation - [x] installation.rst Current status: Linux and mac wheels build in CICD. Docker images will come in a separate PR. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <julianhatzky@googlemail.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <julianhatzky@googlemail.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <JulianHatzky@googlemail.com> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
hey, this PR history seems pretty messed up -- can we reopen with a new PR? |
@richardliaw @tmynn Please see #32041 |
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com> Co-authored-by: Justin Yu <justinvyu@berkeley.edu> Closes ray-project#30537 Fixes ray-project#30674 Signed-off-by: Jack He <jackhe2345@gmail.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com> Co-authored-by: Justin Yu <justinvyu@berkeley.edu> Closes ray-project#30537 Fixes ray-project#30674 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com> Co-authored-by: Justin Yu <justinvyu@berkeley.edu> Closes ray-project#30537 Fixes ray-project#30674
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com> Co-authored-by: Justin Yu <justinvyu@berkeley.edu> Closes ray-project#30537 Fixes ray-project#30674 Signed-off-by: elliottower <elliot@elliottower.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com> Co-authored-by: Justin Yu <justinvyu@berkeley.edu> Closes ray-project#30537 Fixes ray-project#30674 Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
The default tensorboardx logger, though very powerful has some downsides to it like poor scalability (slow with many experiments). With this PR I want to give the option for an alternative (powerful) logger in form of the aim logger.
Related issue number
Closes #30537
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.