Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air][tune] Aim logger #30674

Closed
wants to merge 422 commits into from
Closed

[air][tune] Aim logger #30674

wants to merge 422 commits into from

Conversation

ju2ez
Copy link
Contributor

@ju2ez ju2ez commented Nov 26, 2022

Why are these changes needed?

The default tensorboardx logger, though very powerful has some downsides to it like poor scalability (slow with many experiments). With this PR I want to give the option for an alternative (powerful) logger in form of the aim logger.

Related issue number

Closes #30537

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • proper unit testing of the written aim metrics
    • Release tests

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your contribution!

I left a couple comments.

python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_logger.py Outdated Show resolved Hide resolved
@richardliaw
Copy link
Contributor

@SGevorg can you take a look to see if the aim usage looks right?

@richardliaw richardliaw changed the title Aim logger [air][tune] Aim logger Dec 14, 2022
@SGevorg
Copy link

SGevorg commented Dec 14, 2022

@richardliaw sure, thanks for the follow up.
Also adding @gorarakelyan and @alberttorosyan to this thread to take it away from here.

Copy link
Contributor Author

@ju2ez ju2ez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor improvements to the first implementation

python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@ju2ez ju2ez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are implemented.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ju2ez, thank you for the contribution, this is looking good!

I've left some comments, let me know if you need any clarification or some more guidance. I am also happy to help out if needed!

2 other things:

  1. We'll need to add aim as a Tune dependency for the CI. Add it here: https://github.com/ray-project/ray/blob/master/python/requirements/ml/requirements_tune.txt
  2. Another thing that we should add to increase the visibility of the Aim logger: add it to the docs here (I'm thinking right under the How to log to Tensorboard? section). Basically just showing the user how to add the callback and how to view logs locally with aim up at the experiment/repo directory.

python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
python/ray/tune/logger/aim.py Outdated Show resolved Hide resolved
Returns: Run
"""
run = self._run_cls(
repo=self._repo_path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like aim logs of all trials within a Tune experiment should be written to a shared .aim repo (concurrent jobs writing to the same repo is supported).

This works if the user specifies a _repo_path manually, but we should default to having it written to the experiment directory. Otherwise it would get written to the user's original working directory, which may be confusing.

Ex:

~/ray_results/experiment_name/
    .aim/
    trial0_logdir/
    trial1_logdir/
    ...

We should default the logs to be written to the experiment directory, and we can also fill in a default experiment name:

    experiment_dir = str(Path(trial.logdir).parent)
    run = self._run_cls(
            repo=self._repo_path or experiment_dir,
            experiment=self._experiment or trial.experiment_dir_name,
            **self._aim_run_kwargs,
    )

python/ray/tune/logger/aim.py Show resolved Hide resolved
python/ray/tune/tests/test_logger.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_logger.py Outdated Show resolved Hide resolved
Copy link

@alberttorosyan alberttorosyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving PR since all comments from my side are addressed.

@tamohannes
Copy link

Hey @ju2ez, I am Hovhannes from Aim 😊. I will love to help you finalize and push the changes. 🚀
Can you please add me as a collaborator with write permissions to your ju2ez/ray repo?

@tamohannes
Copy link

@ju2ez also we can add a flag: as_multirun, which will allow the users to track each trial as a single Aim.Run or to keep them all in the same Aim.Run. I already have a code snippet for this functionality, will love to share it with you!

@ju2ez
Copy link
Contributor Author

ju2ez commented Dec 28, 2022

Hey @ju2ez, I am Hovhannes from Aim 😊. I will love to help you finalize and push the changes. 🚀 Can you please add me as a collaborator with write permissions to your ju2ez/ray repo?

Hey @tmynn,

thanks for your support!
FYI: I'm on vacation until 5th of January and will work on finalizing this PR once I'm back.

I think the as_multirun flag is a great idea!

Best,
Julian

@justinvyu
Copy link
Contributor

@ju2ez @tmynn
Just bumping this thread to see if any assistance is needed on my end! Changes look good and I think what's mostly left is to finish writing the tests and to add an example to the docs. Let me know if there are any questions on what needs to be done, looking forward to this addition!

@ju2ez
Copy link
Contributor Author

ju2ez commented Jan 7, 2023

@ju2ez @tmynn Just bumping this thread to see if any assistance is needed on my end! Changes look good and I think what's mostly left is to finish writing the tests and to add an example to the docs. Let me know if there are any questions on what needs to be done, looking forward to this addition!

@justinvyu thanks for catching up.
Yes, you are right.

Let's try to get it done until the end of next week!

@tamohannes
Copy link

@ju2ez please let me know if you need my help with any of these tasks.

@ju2ez
Copy link
Contributor Author

ju2ez commented Jan 7, 2023

@ju2ez please let me know if you need my help with any of these tasks.

Thanks! @tmynn
It would be great if you could maybe create the docs since you probably much more familiar with aim and how to describe it.

Also it would be great if you could help with the tests, since i don't really know your best practices.

@tamohannes
Copy link

@ju2ez nice, I will start with the docs first. :)

csko and others added 16 commits January 25, 2023 16:26
There is no ubuntu20.04 image for cuda 10.2 and 10.1.

Signed-off-by: Kornél Csernai <749306+csko@users.noreply.github.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…ay-project#31493)

Signed-off-by: Kai Fricke <kai@anyscale.com>

Release tests are currently failing with an error on file upload (botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This is likely because some tests are using an anyscale push-based API to upload files. By switching to the job-based filemanager for all tests the upload issue should be mitigated.

Please note that execution will still happen with SDK commands for those tests that haven't specified to use jobs for execution, so actual test execution should be unaffected.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…oject#31468)

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] ray-project#31459
- [x] ray-project#31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] ray-project#31417
- [ ] Update documentation

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…r events (ray-project#31489)

Signed-off-by: praveeng <praveeng@anyscale.com>

# Why are these changes needed?

Autoscaler event logs are prefixed with (scheduler) which is misleading. This PR changes the prefix to be (autoscaler)

Tested building ray locally and running an application (see attached logs). Added unit tests.

# Related issue number

Closes ray-project#24807


Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…oject#31195)

This took over ray-project#27578 
We add an option to warn user when Deprecated API is used.

Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
moved rl_optimizer logic into rl_trainer

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Preliminary PR for adding Python 3.11 support, mapping out various dependencies, fixing issues.

### Main changes

- Upgrade cython to 0.29.32
- Add CI/CD steps for 3.11 wheel
- Change cython code to not use `recursion_depth`
- Update cloudpickle to latest
- Use newer manylinux2014 which has python3.11
- Condition certain python packages in requirements.txt on <3.11 that don't yet have a 3.11 version

### Checklist:

- cython
  - [x] remove deprecated `recursion_depth`
  - [x] exc_type cython/cython#4500
- [ ] package dependencies https://pyreadiness.org/3.11/
  - [ ] llvmlite numba/llvmlite#869
  - [ ] numba numba/numba#8304
  - [ ] pyarrow
  - [ ] scikit-learn
  - [ ] pydantic
  - [x] cloudpickle
  - [x] upgrade cython to 0.29.32
  - [ ] tensorflow tensorflow/tensorflow#58032
  - [ ] torch pytorch/pytorch#86566
  - [ ] miniconda conda/conda#11170
    - [ ] `docker/base-deps/Dockerfile`
- [x] claim to support 3.11 in setup.py
- [ ] cicd
  - [ ] .buildkite/
    - [ ] .buildkite/pipeline.build.yml
  - [ ] ci/
    - [ ] ci/build/test-wheels.sh
    - [ ] ci/build/build-docker-images.py
  - [ ] release tests
  - [ ] docker/retag-lambda/python_versions.txt
  - [ ] download_wheels.sh
  - [ ] wheels
    - [ ] `python/build-wheel-macos.sh`
    - [ ] `python/build-wheel-windows.sh`
- [ ] Tests
  - [ ] pytest ray/serve/tests
  - [ ] python python/ray/serve/examples/echo_full.py
  - [ ] bazel test //:core_worker_test
  - [ ] bazel test --build_tests_only //:all
  - [ ] //python/ray/tests:test_pydantic_serialization fastapi/fastapi#5048
  - [ ] //python/ray/train:test_torch_utils
- [ ] Documentation
  - [x] installation.rst

Current status: 
Linux and mac wheels build in CICD. Docker images will come in a separate PR.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <julianhatzky@googlemail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <julianhatzky@googlemail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Julian <JulianHatzky@googlemail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
@richardliaw
Copy link
Contributor

hey, this PR history seems pretty messed up -- can we reopen with a new PR?

@ju2ez ju2ez mentioned this pull request Jan 30, 2023
7 tasks
@ju2ez
Copy link
Contributor Author

ju2ez commented Jan 30, 2023

hey, this PR history seems pretty messed up -- can we reopen with a new PR?

@richardliaw @tmynn
Wow - don't know what happened there. Good idea - I opened a new PR to wrap things up there.

Please see #32041

@ju2ez ju2ez closed this Jan 30, 2023
richardliaw pushed a commit that referenced this pull request Mar 3, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes #30537
Fixes #30674
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes ray-project#30537
Fixes ray-project#30674

Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes ray-project#30537
Fixes ray-project#30674

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes ray-project#30537
Fixes ray-project#30674
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes ray-project#30537
Fixes ray-project#30674

Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: tmynn <hovhannes.tamoyan@gmail.com>
Co-authored-by: Justin Yu <justinvyu@berkeley.edu>
Closes ray-project#30537
Fixes ray-project#30674

Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration of aim metric tracker as an alternative to tensorboardx in ray tune