Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Enable cloud checkpointing. #47682

Merged
merged 20 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
dd040de
Interchanged local filesystem with PyArrow filesystem to be able to s…
simonsays1980 Sep 13, 2024
1ce7acf
Interchanged local filesystem with PyArrow filesystem to be able to r…
simonsays1980 Sep 13, 2024
0c51f7e
Merge branch 'master' into enable-cloud-checkpointing
simonsays1980 Sep 16, 2024
6416f08
Added filesystem to all subcomponent calls and added conversion to st…
simonsays1980 Sep 16, 2024
2367b56
Added pyarrow FileSystem to 'from_checkpoint' and 'get_checkpoint_info'.
simonsays1980 Sep 16, 2024
147f1ea
Added suggestions from @sven1977's review and fixed a small path error.
simonsays1980 Sep 16, 2024
39b5619
Fixed a unit test in 'checkpoint_utils'.
simonsays1980 Sep 16, 2024
8013d7b
Fixed bug in doctests of 'rllib-learner.rst'.
simonsays1980 Sep 17, 2024
9fa4ed9
[spark] Refine comment in Starting ray worker spark task (#47670)
WeichenXu123 Sep 16, 2024
8a4fe7a
[Data] Add `SERVICE_UNAVAILABLE` to list of retried transient errors …
bveeramani Sep 16, 2024
05fd902
[Data] Fix bug where Ray Data incorrectly emits progress bar warning …
bveeramani Sep 16, 2024
3efd47f
[serve] Additional metadata and context (#47652)
zcin Sep 16, 2024
2216f2d
[Core][aDAG] Set buffer size to 1 for regression (#47639)
rkooo567 Sep 16, 2024
3929ce6
[core][aDAG] Fix microbenchmark regression adag 2 (#47683)
rkooo567 Sep 16, 2024
b52a38f
[RLlib] Fix/remove some CI tests and many_ppo release test. (#47686)
sven1977 Sep 16, 2024
d5f1a01
Add perf metrics for 2.36.0 (#47574)
khluu Sep 16, 2024
05c866c
[Core] Added spaces to disallowed char for working dir (#46767)
prithvi081099 Sep 17, 2024
75ddea8
Indented code in docs as CI tests were raising an error.
simonsays1980 Sep 17, 2024
7d35bf7
Merged Master
simonsays1980 Sep 25, 2024
306df5b
Removed indentation in 'rllib-learner.rst'.
simonsays1980 Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions doc/source/rllib/rllib-learner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,12 +319,12 @@ Getting and setting state


.. testcode::
:hide:
:hide:

import tempfile
import tempfile

LEARNER_CKPT_DIR = str(tempfile.TemporaryDirectory())
LEARNER_GROUP_CKPT_DIR = str(tempfile.TemporaryDirectory())
LEARNER_CKPT_DIR = tempfile.mkdtemp()
LEARNER_GROUP_CKPT_DIR = tempfile.mkdtemp()


Checkpointing
Expand Down
6 changes: 5 additions & 1 deletion rllib/algorithms/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import os
from packaging import version
import pathlib
import pyarrow.fs
import re
import tempfile
import time
Expand Down Expand Up @@ -305,6 +306,7 @@ class Algorithm(Checkpointable, Trainable, AlgorithmBase):
def from_checkpoint(
cls,
path: Optional[Union[str, Checkpoint]] = None,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
*,
# @OldAPIStack
policy_ids: Optional[Collection[PolicyID]] = None,
Expand All @@ -324,6 +326,8 @@ def from_checkpoint(
Args:
path: The path (str) to the checkpoint directory to use
or an AIR Checkpoint instance to restore from.
filesystem: PyArrow FileSystem to use to access data at the `path`. If not
specified, this is inferred from the URI scheme of `path`.
policy_ids: Optional list of PolicyIDs to recover. This allows users to
restore an Algorithm with only a subset of the originally present
Policies.
Expand Down Expand Up @@ -371,7 +375,7 @@ def from_checkpoint(
)
# New API stack -> Use Checkpointable's default implementation.
elif checkpoint_info["checkpoint_version"] >= version.Version("2.0"):
return super().from_checkpoint(path, **kwargs)
return super().from_checkpoint(path, filesystem=filesystem, **kwargs)

# This is a msgpack checkpoint.
if checkpoint_info["format"] == "msgpack":
Expand Down
2 changes: 1 addition & 1 deletion rllib/algorithms/tests/test_algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
class TestAlgorithm(unittest.TestCase):
@classmethod
def setUpClass(cls):
ray.init()
ray.init(local_mode=True)
register_env("multi_cart", lambda cfg: MultiAgentCartPole(cfg))

@classmethod
Expand Down
Loading