All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Tests and documentation for
abstract_arrays
. - [Experimental Feature] Support
empty NamedTuple
leaf whenPyTreeMetadataOptions.support_rich_types=true
. - Enable
replica_parallel
saving. Uses cache to avoid recomputingdevices_indices_map
for arrays with identical shape/sharding. - Add Layout support to args.StandardRestore
- Fix namedtuple empty value typestr when experimental support_rich_types is disabled again after enabling it.
- Fix metadata ser/deser of jax registered container nodes like flax.struct.
- [emergency checkpoint] Fix local restore by re-mapping device ids directly instead of inferring them from how process indexes changed across restarts with some false assumptions.
- Coordination service now supports barrier reuse - eliminate some barrier name complexity, including counters.
- [emergency checkpoint] Add ReplicatorCheckpointManager implementation for interoperating with replicator service provided by GKE (or theoretically, any other similar service).
- Correct issue where emergency checkpoint debug logging was not OSS-ed.
- Add
RootMetadata
andStepMetadata
classes as ways for the user to interface with checkpoint metadata at various levels. - Add
root_metadata_serialization
, andstep_metadata_serialization
modules that contain utilities to perform de/serialization forRootMetadata
andStepMetadata
. ReplicaSlice
/ReplicaSlices
construct to facilitate saving replicated arrays.- Added restoring with custom jax.experimental.layout.Layout support
- Add experimental
PyTreeMetadataOptions
to manage rich types in pytree checkpoint metadata. - Create a separate namespace package
orbax.checkpoint.testing
for exporting test objects. replica_parallel
saving that allows arrays with replicated shards to be saved cooperatively by multiple hosts.- [Experimental Feature] Support
NamedTuple
andTuple
nodes in PyTree metadata. - Add validation to prevent loading an array index that was never written to.
- Add logging to detect missing chunks to emergency checkpointing to facilitate local checkpoint debugging.
- Refactor metadata/tree_test.py and move common test types to
test_tree_utils.py
for better reusability. - [emergency checkpoint] Break out mesh construction and process ID metadata utils into a separate file.
- Rename
CheckpointMetadataStore
toMetadataStore
, and change methods to accept and return metadata as dictionaries. - Move
Checkpointer
implementations to_src
. - Add/Update tests for
is_empty_or_leaf
andis_empty_node
. - Refactor and restructure constructs from
type_handlers.py
andmetadata
package to avoid circular dependencies.
- Introduce
CheckpointManagerOptions.should_keep_fn
as an alternative toCheckpointManagerOptions.keep_period
.
- Fix readthedoc build failures on source files in
_src
- Create
Composite
class, whichCompositeArgs
now subclasses. - Move
tree
to_src
. - Move
serialization
to_src
. - Move
type_handlers
to_src/serialization
- Add notes to Barrier error
XlaRuntimeError(DEADLINE_EXCEEDED)
with actionable info. - Make
NameFormat.find_all
impls concurrent. - Move
path
package under_src
package. - Updated readthedoc
- Emergency checkpoint: use JAX for global_max and combine multiple broadcasts
into one for
saved
bool broadcast. This should alleviate concerns about broadcasting using the distributed system at large scale. - Emergency checkpoint: compile broadcast function once at init.
- Fix emergency checkpointing issue arising when repeatedly restoring from a local checkpoint. Process ID remapping may happen in different ways on subsequent times, so we need to ensure new process metadata is saved with every checkpoint, so we can recover the process ID mapping used to save that checkpoint.
- bytes_per_sec calculation needs different values for read and write.
- Add a
SaveArgs
option that allows disabling pinned host transfer on a per-array basis. UPDATE: Modifyenable_pinned_host_transfer
option to be provided once for the entire pytree, since it's not really reasonable to customize this on a per-array level.
- Rename
CheckpointManager._single_item
toCheckpointManager._default_item
. - Add
strict
option inArrayRestoreArgs
, defaulting to True. This prevents arrays from being accidentally padded or truncated when restoring. - Move
multihost
implementations to_src
. Commonly used symbols are still exported in the same way. - Use
Fragments
for serialization. - Set
AsyncOptions.timeout_secs
default value to 10 minutes. - De-duplicate
get_ts_context
usages and move to ts_utils. - Move
logging
to_src
. - Move
metadata
to_src
.
- Support for Python 3.9.
- Modernize type annotations using
from __future__ import annotations
. - Adjust
CompositeCheckpointHandler
behavior for unregistered items and emptyargs
. Now,metadata
only returns entries for which an item actually exists in the checkpoint.restore
will raise an error if a requested item does not exist, and will attempt to restore all existing items if emptyargs
are provided. - Allow registering an item in
DefaultCheckpointHandlerRegistry
without providing the actual handler, as long as the provided args correspond to a globally registered handler. This allows for slightly reduced verbosity if we just want to ensure an association between an item name and args/handler. - Refactor to extract a separate module,
asyncio_utils
, for asyncio helper functions frompath/async_utils
module. - Rename
CheckpointMetadata
toStepMetadata
.
- Support nested
asyncio.run
withnest_asyncio
library. - Support empty tuple values in checkpointable params.
- Return empty dict (instead of None) as metadata for empty Mapping typed unregistered types.
- Restore to user mesh after emergency checkpoint local restoration.
- Removed
write_chunk_shape
andread_chunk_shape
fromSaveArgs
. - Rename
is_supported_empty_aggregation_type
andis_supported_aggregation_type
functions. - Adjust
CompositeCheckpointHandler
behavior for unregistered items and emptyargs
. Now,metadata
only returns entries for which an item actually exists in the checkpoint.restore
will raise an error if a requested item does not exist, and will attempt to restore all existing items if emptyargs
are provided. - Use
CheckpointHandlerRegistry
whenCheckpointManager
is default-constructed, e.g.CheckpointManager(directory)
. Previously this usage pattern would default to single-item mode, but the new behavior will default to lazy initialization. The change should be a no-op for existing users.
- Support for
replica_axis_index = 0,1
inbroadcast_one_replica_to_all
.
- Only allow main-thread to reset Save Finalize error.
- Allow Orbax save to run from a non-main thread but continue to support concurrent access to wait_until_finished().
- Refactor
merge_ocdbt_per_process_files()
to return explicitly without merging if none of the child subdir names matchocdbt.process_
pattern. - Moved
_choose_chunk_shape
fromtype_handlers.py
to a new internal module.
- Validate merged and saved checkpoint params by comparing it against its
.zarray
counterpart. - Added private utilities to work with array fragments (
_src/arrays
). - Log
orbax-checkpoint version
when Checkpointer instances are created. - NamedShardingMetadata includes full device mesh which will be used for restoration when sharding in restore_args is not provided.
- Common public types to annotate pytrees.
- Add utils that act as a heuristic for detecting standard Orbax checkpoints.
- Add memory-based rate limiting support during save.
- Metrics to track bytes/sec during save and restore.
- Added
handler_registration.HandlerRegistry
to store custom mappings between checkpointable items and checkpoint args to a checkpoint handler. - Added
handler_registration.HandlerRegistry
argument toCompositeCheckpointhandler
. This replaces theitem_names
anditems_and_handlers
arguments for configuring items and handlers. - Added
handler_registration.HandlerRegistry
argument toCheckpointManager
. This replaces thecheckpointers
,item_names
anditems_and_handlers
arguments for configuring items and handlers. - Raise error if
AsyncOptions.post_finalization_callback
is given butCheckpointManager._checkpinter
is not async. - Add Tensorstore memory debug logs when debug message is turned on, eg.
--vmodule=type_handlers=1
.
- Fix callsites of handler.async_save to handle returned None.
- Improve logging by adding jax_process, error logs in threads and more...
- Improvements to blocking save time, as a result of moving file open operations into the background.
- Consolidate usages of
MultiprocessingOptions
andAsyncOptions
. - Formalize
StandardCheckpointer
as anAsyncCheckpointer
that doesn't require aCheckpointArgs
object, and instead allows directly passing the state and extra args (breaking change). - Add log messages to improve debugging with jax process and threads.
- Improve logging to turn on DEBUG log per source file.
- Update bytes/sec metrics to be per-host.
- Parallelize many directory creations using asyncio. Also reduce the number
of
asyncio.run
calls by movingasync
functions higher in the stack. - Allow CheckpointManager.wait_until_finished to be called concurrently.
- Add utils that act as a heuristic for detecting standard Orbax checkpoints.
- Add memory-based rate limiting support during save.
- Metrics to track bytes/sec during save and restore.
- Preliminary concurrent creation of parameter directories from
PyTreeCheckpointHandler
(in non-OCDBT case).
- In
CheckpointManager.restore
, allow specifyingstep=None
, which automatically restores from latest. - Emergency checkpointing bug-fixes for CPU backend.
- Move
get_param_names
to tree utils. - Move all work for
_write_metadata_file
into a background thread to avoid O(n) computation in building metadata.
- Adjust user metadata construction so that older checkpoints with some values in the msgpack file can still return metadata (though the metadata for these entries will not return information about array properties).
- Rolled forward change to improve TensorStore I/O efficiency.
- Memory efficient broadcasting from one model replica to others.
- Allow one directory creation request per item rather than 1 per item per host.
- Make atomicity logic configurable, and encapsulate it within a class.
- Refactor ts.Context usage to be per-operation (save/restore) rather than a global object. This helps fix edge cases where a user repeatedly overwrites a single checkpoint.
- Move module-level counters to one place for greater compartmentalization. Add logic to the barrier-compatible test fixture that allows each test case to have its own module-level counter, to avoid problems that arise when multiprocess tests run in inconsistent orders.
- Ensure D2H transfers are parallelized.
- Fork JAX serialization library into orbax/checkpoint.
- updated required Jax version to fix PY3.9 build
- Earlier change relying on a flag that was not available in all environments.
- Rolled back change in previous release to improve TensorStore I/O efficiency. This change caused some unexpected failures on certain storage systems.
- Add memory-based rate limiting support during save.
- Checkpoint format guide for RTD page.
- Modify
_write_metadata_file
to Async. - Added
tree
package to contain tree-related utilities. - Improve TensorStore I/O efficiency through use of TensorStore
transactions for the OCDBT storage format, and specify the new
can_reference_source_data_indefinitely=True
option to avoid a redundant copy when writing into the TensorStore chunk cache. - Stop writing msgpack file for new checkpoints and update empty nodes handling so that it no longer depends on this file.
- Introduce
FileOptions
as CheckpointManagerOptions attribute. - Support non blocking CheckpointMetadataStore.write.
- Deadlock observed when using multiple AsyncCheckpointers at once.
- Delegate to BasePyTreeCheckpointHandler rather than inheriting from it.
- Emergency checkpointing bug-fixes
- Introduce
should_save_fn
inOrbaxCheckpointManagerOptions
. - Introduce
StepAlreadyExistsError
to be raised on save with existing step.
- Modify
JsonCheckpointHandler
to Async.
- Fix empty metadata file error: Expecting value: line 1 column 1 (char 0)
- Implement restoration in emergency.CheckpointManager.
- Separate
metadata
package to encapsulate metadata-related utils and constructs. - Add step utils to get metadata of latest checkpoint and from checkpoint path.
- Support checkpoint metadata at step level.
- Use checkpoint metadata module to store commit timestamp.
- In
checkpoints_iterator
/wait_for_new_checkpoint
, ensure that if steps are present, they will be yielded even iftimeout_fn
already returns True. - Refactor and move path and step utils to new path/utils and step modules respectively.
- Refactor Tensorstore-related codes in type_handlers.py.
- Update
NameFormat.find_step
logic to exclude uncommitted checkpoints. - Abstract
jax.process_index
intomultihost.process_index
. - Factor out core PyTree checkpointing logic into a
PyTreeCheckpointHandlerImpl
class. - Use unique count instead of timestamp in tmp directory construction.
- Tidy up
NameFormat
API: removebuild_metadata
andfind_metadata
methods from the public API.
ocdbt_merge
option and unusedrestore_with_serialized_types
option fromPyTreeCheckpointHandler
.- OCDBT coordinator code. These functions are no longer needed.
write_tree_metadata
option, as there is no real reason to disable this now.
- Add path package to export symbols, also add Step rst docs.
- Create composite step
NameFormat
. - Docs on working with PyTrees/arrays.
- Improve step lookup error message by adding expected names to it.
- Error messages when sharding is not specified.
- For
broadcast_one_replica_to_all
, delete input arrays as soon as possible to conserve memory.
- Added timeout_fn arg to
wait_for_new_checkpoint
and_wait_for_new_checkpoint
- Support for Python 3.9
- Add CheckpointManagerOptions.enable_background_delete to avoid blocking the manager.save() code path
broadcast_one_to_some
function.- Allow running Orbax code on a subset of processes. Note that this currently does not work on async code.
- Add
SingleReplicaArrayRestoreArgs.primary_replica_index
to select which replica to load checkpoint and broadcast whenSingleReplicaArrayHandler
is used.
- CheckpointManager is defined as a ContextManager directly
checkpoint_manager_context
is deprecated- Checkpointer is defined as a ContextManager directly
checkpointer_context
is deprecated- AsyncCheckpointer is defined as a ContextManager directly
async_checkpointer_context
is deprecated- Refactored to create
multihost_utils
module. - Remove unnecessary barriers in CheckpointHandlers.
- Added Zarr3 support for numpy array
- Added PyTreeSaveArgs.ocdbt_target_data_file_size to control the target_data_file_size when OCDBT is enabled
- Expanded skeleton of
emergency.CheckpointManager
.
- Fix TypeHandler Registry errors for
empty ([], {}, None)
values withSaveArgs(aggregate=False)
.
- Add new
step
module with step naming and query support. - Add new option
enable_background_delete
to CheckpointManager so that old checkpoints can be deleted in the background.
- Use step
NameFormat
andStandardNameFormat
in OrbaxCheckpointManager
,utils
andcheckpoint_utils
modules.
- Add JAX version guards to changes that require newer version of JAX
- Issue in which
CompositeCheckpointHandler
would create item directories repeatedly in parallel, resulting in multiple requests per host, for every host.
- Log
SaveArgs.aggregate
deprecation warning message once in 12 hours. - If CheckpointManagerOptions.read_only=True then automatically reset "write" options (instead of raising error).
- Modify
create_empty
to beerase_and_create_empty
instead to make its potential dangers a bit more apparent.
- Single replica broadcasting when training on multiple hosts/pods. By default, replica zero along the first axis dimension reads the checkpoint then broadcasts to other replicas.
- Added experimental_emergency_checkpoint
ShardingMetadata
class that representsjax.sharding.Sharding
properties but does not require accessing real devices.
- Fix broken logging issue.
- Add
reload
method inCheckpointManager
which resets internal properties. - Add checkpoint announcements RTD page.
- Deprecate
read
option inall_steps
. - Deprecate
SaveArgs.aggregate
. - Update copyright to use 2024. Also, allow notebook cells to continue if last cell raises error.
- Improve RTD by using
ocp.utils.to_shape_dtype_struct
instead ofjax.eval_shape
.
- Ensure timeout passed via
AsyncCheckpointer
inCheckpointManager
is propagated, for legacy compatibility. - Modified sharding file writes to use
tensorstore.Transaction
due to recent tensorstore change. This resolves a slowdown in save speed recently observed.
- Add documentation associated with
item_handlers
, the new arguments introduced inCheckpointManager
constructor.
- Update documentation of the new
CheckpointArgs
basedCheckpointHandler
API.
- Stop blocking on previous save when
should_save
is False.
- Expose
AsyncOptions
forCheckpointManager
users. - Introduce item_handlers to CheckpointManager ctor to allow configurable Handler setup.
- Add JaxRandomKeyCheckpointHandler to store Jax random key generated from jax.random.key() or jax.random.PRNGKey()
- Add NumpyRandomKeyCheckpointHandler to store Numpy random state from numpy.random.get_state()
- Deprecation warning for users of the old
CheckpointManager
API. CheckpointManager
API migration guide.- New backwards-compatible API for
CheckpointManager
making use ofCheckpointArgs
andCompositeCheckpointHandler
. - Documentation for new APIs.
- Provide better support for custom
CheckpointHandler
s without registeredCheckpointArgs
by providing a wrapperCheckpointHandler
as a fallback. This class is introduced for backwards compatibility, and will eventually be removed.
- New parameter
chunk_byte_size
inSaveArgs
. A convenient way to choose the write and read chunk shapes using Zarr3.
- Issues interpreting
ValueMetadata
inconstruct_restore_args
.
- Refactor bad
wait_until_finished
design inCheckpointManager
wherewait_until_finished
would try to join a thread, which itself calledwait_until_finished
.
CompositeCheckpointHandler
API. This will soon replaceCheckpointManager
's handling of distinct items, and allows users to work with separate items at theCheckpointer
layer.
use_ocdbt
PyTreeCheckpointHandler
option is no longer kept as a mutable global state on type handlers and is now passed around with the state to save or restore.
- Bug arising when an error occurs in the background thread, and we try to
remove a non-existent checkpoint from
_interval_preserved_checkpoints
inCheckpointManager
.
- Zarr Version 3 allowing custom write and read chunk configurations
- Bug where errors in the background thread of
AsyncCheckpointer
would not be raised inCheckpointManager
, causing later errors when trying to remove non- existent checkpoints.
- Support for
CheckpointArgs
in coreCheckpointHandler
implementations. Allowed specifying eitherCheckpointArgs
or keyword args inCheckpointer
. - Introduce CheckpointManagerOptions.read_only to control save/delete behaviors.
- Use
json
directly instead ofJsonCheckpointHandler
to write and readPyTreeCheckpointHandler
's metadata. - Enable OCDBT-Merge by default
- Introduce CheckpointManagerOptions.todelete_subdir option to rename deletable dirs.
- Support jax.sharding.SingleDeviceSharding in self-describing PyTree checkpoints.
name
anddirectory
properties in valueMetadata
.- Introduce AbstractCheckpointManager protocol for the already existing CheckpointManager concrete class.
- Turn on self-describing PyTree checkpoints by default.
CheckpointArgs
arguments for save and restore inStandardCheckpointHandler
andPyTreeCheckpointHandler
.- Return empty dict if the CheckpointManager level metadata is not available. Currently it raises error due to missing metadata dir.
- Remove unfinalized checkpoint directory from previous runs for GCS.
- Custom
finalize
callback forCheckpointHandler
. - Merge/finalize logic for Tensorstore when using OCDBT driver.
- Barrier synchronization in
AsyncCheckpointer
refactored to allow plugging in alternative implementations. - Added
parent_dir
underParamInfo
and modified sharding file to be saved underParamInfo.parent_dir
. - In
all_steps
, minimize disk queries on non-leader processes. - Put
all_steps
load on single host and broadcast feature behind an option that defaults to False. - Rename StandardCheckpointHandler.save/restore named arg, 'state' to 'item'.
- OCDBT coordinator.
CompositeCheckpointHandler
CheckpointArgs
- Forked
AsyncManager
into Orbax and replacedAsyncCheckpointer
's inheritance from it with composition for easier customization.
- Added support for automatically inferring sharding if not provided by
RestoreArgs
.
- Fix sync error with removing old checkpoints.
- Fix missing checkpoint benchmarks images in RTD page. https://orbax.readthedocs.io/en/latest/optimized_checkpointing.html#introducing-ocdbt
- Use
nest_asyncio
by default. This allows users to make calls to orbax from withinasync
functions. - Modified
StringHandler
serialization and deserialization to use Tensorstore Json driver for async file reads and writes. - Marked
transform_utils
as deprecated. - Change
PytreeCheckpointHandler
's parameteruse_ocdbt
default toTrue
- Modified sharding property in value_metadata to
jax.sharding.Sharding
- User metadata when dealing with empty nodes that follow non-empty nodes in a list.
- Modify
_get_user_metadata
to exclude empty nodes.
- Fix GCS issue where an error would be encountered when trying to save a step
with the same number as an existing tmp directory. This scenario arises when
restarting after preemption, without enabling
cleanup_tmp_directories
inCheckpointManager
. - Fix
create_coordinator_server_and_context()
breaking old codes that expect it to return a tuple. The function also prints a deprecation warning.
StandardCheckpointHandler
.
- Fix concurrency issue when
are_locked
check lags behind checkpoint deletion.
- Fully self-describing PyTree checkpoints with type information, stored in metadata using JSON format (not currently enabled by default).
- PyTree leaf metadata returned by
metadata
function (dependent on the above.)
- Correctly set TYPESTR_REGISTRY, to account for OCDBT option.
- Deprecated
restore_args_from_target
function.
- Option to not port
global_shape
toArrayRestoreArgs
.
- Protobuf metadata saved by PyTreeCheckpointHandler.
close
method for Checkpointer and CheckpointHandler.- Context manager helper functions for Checkpointer and CheckpointManager.
- Protobuf metadata saved by PyTreeCheckpointHandler.
- Allow calling
create_coordinator_server_and_context
without an initialized JAX coordinator server and create TS metadata for numpy arrays on a single process. - Refactor TypeHandler to operate over batches of values, rather than individual ones.
- Removed support for lazy restoration. Supported via transformations.
- ProtoCheckpointHandler.
- Fix issue with
multi_value_fn
in restoration where an input value could be loaded many times unnecessarily. - Eliminates hosts sync on background thread and fixes issue with reading lockfile before checkpoint is finalized.
- Fix unlocking function, which may fail if there are multiple evaluators running concurrently.
- Locking mechanism for
checkpoints_iterator
to prevent checkpoints from being cleaned up by aCheckpointManager
in another process while the are being read. - Allow saving the aggregated file asynchronously.
- Slightly different behavior for
wait_for_new_checkpoint
, allowing it to wait until a certain checkpoint, rather than strictly after a given step. - Allow value_fn and multi_value_fn to accept RestoreArgs as an argument when they are used during restore, so that the user may customize the returned value based on what is requested by RestoreArgs.
- Support creating sharded array when ArrayRestoreArgs is passed and the value was originally aggregated.
- Support
max_to_keep=0
. - Support for PyTree keys with '/' (à la Haiku).
- GCS error when
cleanup_tmp_directories=False
which caused an internal assertion to be raised when saving over an existing temporary directory.
merge_trees
function intransform_utils
.
- Explicit Python version support for 3.9, 3.10, 3.11
- Raise ValueError when trying to save jax.Array to the aggregate file if it is not fully replicated.
- Raise error message when the user tries to save host local arrays that are typically obtained using pmap.
- Option to allow users to disable automatic temporary directory cleanup upon CheckpointManager initialization.
- Error message when metadata file ('.zarray') is missing.
reached_preemption
function to allow the user to detect if a preemption signal has been received.
- Set
create
option to True by default.
- Msgpack encoding of tuples.
- Asyncio issue affecting python<=3.9.
- Tensorstore options to improve OCDBT performance.
- Add support for
value_fn
transformations during restore. - Support for
multi_value_fn
transformations during restore.
- Reworked transformations logic in restore to happen in a more intuitive order, with lazy loading to avoid materializing unnecessary arrays.
- Slow repeated calls to check whether a checkpoint is OCDBT format or not.
- Increased minimum tensorstore version to what's needed for OCDBT.
orbax-checkpoint
is introduced, a namespace package underorbax
. Importing this package takes the formimport orbax.checkpoint
or 'from orbax import checkpoint`.- Support for OCDBT driver in Tensorstore.
- Small bug fixes.
- Use a more precise timestamp when generating temporary directory names to permit more than one concurrent checkpointing attempt per second.
- Automatic import of nest_asyncio.
- Support for generic transformation function in PyTreeCheckpointHandler.
- Support n-digit checkpoint step format.
- Eliminate Flax dependency to fix circular dependency problem.
sharding
option on `ArrayRestoreArgs
- Add "standard user recipe" to documentation.
- Add unit tests using mock to simulate preemption.
- Logging to increase transparency around why checkpoints are kept vs. deleted.
- Expand on uses of restore_args in colab.
- Expose utils_test.
- Add msgpack_utils to move toward eliminating Flax dependency.
- CheckpointManager starts a background thread to finalize checkpoints so that checkpoints are finalized as soon as possible in async case.
- Remove CheckpointManager update API.
- Remove support for deprecated GDA.
- Add tmp suffix on step directory creation in CheckpointManager.save.
- Preemption when using keep_time_interval caused the most recent steps before preemption to be kept, despite not falling on the keep time interval.
- A util function that constructs restore_args from a target PyTree.
- CheckpointManager
delete
API, which allows deleting an existing step. - Made dev dependencies optional to minimize import overhead.
- Refactored higher-level utils in checkpoint_utils, which provides user-convenience functions.
- Guard option to create top-level directory behind
create
option. - Remove support for Python 3.7.
- Check for metric file in addition to item directory in CheckpointManager.
- Additional logs to indicate save/restore completion.
- Support for None leaves in PyTree save/restore.
- ArrayCheckpointHandler for individual arrays/scalars.
read: bool
option on all_steps to force read from storage location instead of using cached steps.- Simplified "Getting Started" section in the docs.
- CheckpointManager creates the top level directory if it does not yet exist.
- Write msgpack bytes asynchronously.
- Removed some unused test_utils methods for filtering empty nodes.
- Update docs on
PyTreeCheckpointHandler
. - Removed unneeded AbstractCheckpointManager.
- Usage of bytes_limiter to prevent too many bytes from being read during a single restore call.
- Temp checkpoint cleanup when using a step prefix (i.e. 'checkpoint_0').
- Option to customize metadata file name for Tensorstore.
- Restore failure on GCS due to misidentification of checkpoint as "not finalized".
- Added CHANGELOG.md for version updates (additions and changes), ingested by auto-publish functionality.
- Fix mistaken usages of placeholder "AGGREGATED" where "NOT-AGGREGATED" would be more appropriate. Ensure backwards compatibility is maintained.