23 Jan 10:02

aslonnie

021baf7

Ray-2.41.0 Latest

Latest

Highlights

Major update of RLlib docs and example scripts for the new API stack.

Ray Libraries

Ray Data

🎉 New Features:

Expression support for filters (#49016)
Support partition_cols in write_parquet (#49411)
Feature: implement multi-directional sort over Ray Data datasets (#49281)

💫 Enhancements:

Use dask 2022.10.2 (#48898)
Clarify schema validation error (#48882)
Raise ValueError when the data sort key is None (#48969)
Provide more messages when webdataset format is error (#48643)
Upgrade Arrow version from 17 to 18 (#48448)
Update hudi version to 0.2.0 (#48875)
webdataset: expand JSON objects into individual samples (#48673)
Support passing kwargs to map tasks. (#49208)
Add ExecutionCallback interface (#49205)
Add seed for read files (#49129)
Make select_columns and rename_columns use Project operator (#49393)

🔨 Fixes:

Fix partial function name parsing in map_groups (#48907)
Always launch one task for read_sql (#48923)
Reimplement of fix memory pandas (#48970)
webdataset: flatten return args (#48674)
Handle numpy > 2.0.0 behaviour in _create_possibly_ragged_ndarray (#48064)
Fix DataContext sealing for multiple datasets. (#49096)
Fix to_tf for List types (#49139)
Fix type mismatch error while mapping nullable column (#49405)
Datasink: support passing write results to on_write_completes (#49251)
Fix groupby hang when value contains np.nan (#49420)
Fix bug where file_extensions doesn't work with compound extensions (#49244)
Fix map operator fusion when concurrency is set (#49573)

Ray Train

🎉 New Features:

Output JSON structured log files for system and application logs (#49414)
Add support for AMD ROCR_VISIBLE_DEVICES (#49346)

💫 Enhancements:

Implement Train Tune API Revamp REP (#49376, #49467, #49317, #49522)

🏗 Architecture refactoring:

LightGBM: Rewrite get_network_params implementation (#49019)

Ray Tune

🎉 New Features:

Update optuna_search to allow users to configure optuna storage (#48547)

🏗 Architecture refactoring:

Make changes to support Train Tune API Revamp REP (#49308, #49317, #49519)

Ray Serve

💫 Enhancements:

Improved request_id generation to reduce proxy CPU overhead (#49537)
Tune GC threshold by default in proxy (#49720)
Use pickle.dumps for faster serialization from proxy to replica (#49539)

🔨 Fixes:

Handle nested ‘=’ in serve run arguments (#49719)
Fix bug when ray.init() is called multiple times with different runtime_envs (#49074)

🗑️ Deprecations:

Adds a warning that the default behavior for sync methods will change in a future release. They will be run in a threadpool by default. You can opt into this behavior early by setting RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1. (#48897)

RLlib

🎉 New Features:

Add support for external Envs to new API stack: New example script and custom tcp-capable EnvRunner. (#49033)

💫 Enhancements:

Offline RL:
- Add sequence sampling to EpisodeReplayBuffer. (#48116)
- Allow incomplete SampleBatch data and fully compressed observations. (#48699)
- Add option to customize OfflineData. (#49015)
- Enable offline training without specifying an environment. (#49041)
- Various fixes: #48309, #49194, #49195
APPO/IMPALA acceleration (new API stack):
- Add support for AggregatorActors per Learner. (#49284)
- Auto-sleep time AND thread-safety for MetricsLogger. (#48868)
- Activate APPO cont. actions release- and CI tests (HalfCheetah-v1 and Pendulum-v1 new in tuned_examples). (#49068)
- Add "burn-in" period setting to the training of stateful RLModules. (#49680)
Callbacks API: Add support for individual lambda-style callbacks. (#49511)
Other enhancements: #49687, #49714, #49693, #49497, #49800, #49098

📖 Documentation:

New example scripts:
- How to write a custom algorithm (VPG) from scratch. (#49536)
- How to customize an offline data pipeline. (#49046)
- GPUs on EnvRunners. (#49166)
- Hierarchical training. (#49127)
- Async gym vector env. (#49527)
- Other fixes and enhancements: #48988, #49071
New/rewritten html pages:
- Rewrite checkpointing page. (#49504)
- New scaling guide. (#49528)
- New callbacks page. (#49513)
- Rewrite RLModule page. (#49387)
- New AlgorithmConfig page and redo package_ref page for algo configs. (#49464)
- Rewrite offline RL page. (#48818)
- Rewrite “key concepts" rst page. (#49398)
- Rewrite RL environments pages. (#49165, #48542)
- Fixes and enhancements: #49465, #49037, #49304, #49428, #49474, #49399, #49713, #49518

🔨 Fixes:

Add on_episode_created callback to SingleAgentEnvRunner. (#49487)
Fix train_batch_size_per_learner problems. (#49715)
Various other fixes: #48540, #49363, #49418, #49191

🏗 Architecture refactoring:

RLModule: Introduce Default[algo]RLModule classes (#49366, #49368)
Remove RLlib dependencies from setup.py; add ormsgpack (#49489)

🗑️ Deprecations:

#49488, #49144

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

Add task_name, task_function_name and actor_name in Structured Logging (#48703)
Support redis/valkey authentication with username (#48225)
Add v6e TPU Head Resource Autoscaling Support (#48201)
compiled graphs: Support all driver and actor read combinations (#48963)
compiled graphs: Add ascii based CG visualization (#48315)
compiled graphs: Add ray[cg] pip install option (#49220)
Allow uv cache at installation (#49176)
Support != Filter in GCS for Task State API (#48983)
compiled graphs: Add CPU-based NCCL communicator for development (#48440)
Support gcs and raylet log rotation (#48952)
compiled graphs: Support nsight.nvtx profiling (#49392)

🔨 Fixes:

autoscaler: Health check logs are not visible in the autoscaler container's stdout (#48905)
Only publish WORKER_OBJECT_EVICTION when the object is out of scope or manually freed (#47990)
autoscaler: Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state (#48909)
autoscaler: Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 (#48519)
compiled graphs: Fix the missing dependencies when num_returns is used (#49118)
autoscaler: Fuse scaling requests together to avoid overloading the Kubernetes API server (#49150)
Fix bug to support S3 pre-signed url for .whl file (#48560)
Fix data race on gRPC client context (#49475)
Make sure draining node is not selected for scheduling (#49517)

Ray Clusters

💫 Enhancements:

Azure: Enable accelerated networking as a flag in azure vms (#47988)

📖 Documentation:

Kuberay: Logging: Add Fluent Bit DaemonSet and Grafana Loki to "Persist KubeRay Operator Logs" (#48725)
Kuberay: Logging: Specify the Helm chart version in "Persist KubeRay Operator Logs" (#48937)

Dashboard

💫 Enhancements:

Add instance variable to many default dashboard graphs (#49174)
Display duration in milliseconds if under 1 second. (#49126)
Add RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49353)
Document about the RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49700)

🏗 Architecture refactoring:

Move memray dependency from default to observability (#47763)
Move StateHead's methods into free functions. (#49388)

Thanks

@raulchen, @alanwguo, @omatthew98, @xingyu-long, @tlinkin, @yantzu, @alexeykudinkin, @andrewsykim, @win5923, @csy1204, @dayshah, @richardliaw, @stephanie-wang, @gueraf, @rueian, @davidxia, @fscnick, @wingkitlee0, @KPostOffice, @GeneDer, @MengjinYan, @simonsays1980, @pcmoritz, @petern48, @kashiwachen, @pfldy2850, @zcin, @scottjlee, @Akhil-CM, @Jay-ju, @JoshKarpel, @edoakes, @ruisearch42, @gorloffslava, @jimmyxie-figma, @bthananjeyan, @sven1977, @bnorick, @jeffreyjeffreywang, @ravi-dalal, @matthewdeng, @angelinalg, @ivanthewebber, @rkooo567, @srinathk10, @maresb, @gvspraveen, @akyang-anyscale, @mimiliaogo, @bveeramani, @ryanaoleary, @kevin85421, @richardsliu, @hartikainen, @coltwood93, @mattip, @Superskyyy, @justinvyu, @hongpeng-guo, @ArturNiederfahrenhorst, @jecsand838, @Bye-legumes, @hcc429, @WeichenXu123, @martinbomio, @HollowMan6, @MortalHappiness, @dentiny, @zhe-thoughts, @anyadontfly, @smanolloff, @richo-anyscale, @khluu, @xushiyan, @rynewang, @japneet-anyscale, @jjyao, @sumanthratna, @saihaj, @aslonnie

Many thanks to all those who contributed to this release!

Contributors

pcmoritz, bnorick, and 78 other contributors

Assets 2

04 Dec 00:01

dayshah

ray-2.40.0

22541c3

Ray-2.40.0

Ray Libraries

Ray Data

🎉 New Features:

Added read_hudi (#46273)

💫 Enhancements:

Improved performance of DelegatingBlockBuilder (#48509)
Improved memory accounting of pandas blocks (#46939)

🔨 Fixes:

Fixed bug where you can’t specify a schema with write_parquet (#48630)
Fixed bug where to_pandas errors if your dataset contains Arrow and pandas blocks (#48583)
Fixed bug where map_groups doesn’t work with pandas data (#48287)
Fixed bug where write_parquet errors if your data contains nullable fields (#48478)
Fixed bug where “Iteration Blocked Time” charts looks incorrect (#48618)
Fixed bug where unique fails with null values (#48750)
Fixed bug where “Rows Outputted” is 0 in the Data dashboard (#48745)
Fixed bug where methods like drop_columns cause spilling (#48140)
Fixed bug where async map tasks hang (#48861)

🗑️ Deprecations:

Deprecated read_parquet_bulk #48691
Deprecated iter_tf_batches #48693
Deprecated meta_provider parameter of read functions (#48690)
Deprecated to_torch (#48692)

Ray Train

🔨 Fixes:

Fix StartTracebackWithWorkerRank serialization (#48548)

📖 Documentation:

Add example for fine-tuning Llama3.1 with AWS Trainium (#48768)

Ray Tune

🔨 Fixes:

Remove the clear_checkpoint function during Trial restoration error handling. (#48532)

Ray Serve

🎉 New Features:

Initial version of local_testing_mode (#48477)

💫 Enhancements:

Handle multiple changed objects per LongPollHost.listen_for_change RPC (#48803)
Add more nuanced checks for http proxy status errors (#47896)
Improve replica access log messages to include HTTP status info and better resemble standard log format (#48819)
Propagate replica constructor error to deployment status message and print num retries left (#48531)

🔨 Fixes:

Pending requests that are cancelled before they were assigned to a replica now also return a serve.RequestCancelledError (#48496)

RLlib

💫 Enhancements:

Release test enhancements. (#45803, #48681)
Make opencv-python-headless default over opencv-python (#48776 )
Reverse learner queue behavior of IMPALA/APPO (consume oldest batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)

🔨 Fixes:

Fix torch scheduler stepping and reporting. (#48125 )
Fix accumulation of results over n training_step calls within same iteration (new API stack). (#48136)
Various other fixes: #48563, #48314, #48698, #48869.

📖 Documentation:

Upgrade examples script overview page (new API stack). (#48526 )
Enable RLlib + Serve example in CI and translate to new API stack. (#48687)

🏗 Architecture refactoring:

Switch new API stack on by default, APPO, IMPALA, BC, MARWIL, and CQL. (#48516, #48599 )
Various APPO enhancements (new API stack): Circular buffer (#48798), minor loss math fixes (#48800), target network update logic (#48802), smaller cleanups (#48844).
Remove rllib_contrib from repo. (#48565 )

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

[Core] uv runtime env support (#48479, #48486, #48611, #48619, #48632, #48634, #48637, #48670, #48731)
[Core] GCS FT with redis sentinel (#47335)

💫 Enhancements:

[CompiledGraphs] Refine schedule visualization (#48594)

🔨 Fixes:

[CompiledGraphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs (#48463)
[Core] Fix Ascend NPU discovery to support 8+ cards per node (#48543)
[Core] Make Placement Group Wildcard and Indexed Resource Assignments Consistent (#48088)
[Core] Stop the GRPC server before Shut down the Object Store (#48572)

Ray Clusters

🔨 Fixes:

[KubeRay]: Fix ConnectionError on Autoscaler CR lookups in K8s clusters with custom DNS for Kubernetes API. (#48541)

Dashboard

💫 Enhancements:

Add global UTC timezone button in navbar with local storage (#48510)
Add memory graphs optimized for OOM debugging (#48530)
Improve tasks/actors metric naming and add graph for running tasks (#48528)
add actor pid to dashboard (#48791)

🔨 Fixes:

Fix Placement Group Table table cells overflow (#47323)
Fix Rows Outputted being zero on Ray Data Dashboard (#48745)
fix confusing dataset operator name (#48805)

Thanks

Thanks to all those who contributed to this release!
@rynewang, @rickyyx, @bveeramani, @marwan116, @simonsays1980, @dayshah, @dentiny, @KepingYan, @mimiliaogo, @kevin85421, @SeaOfOcean, @stephanie-wang, @mohitjain2504, @azayz, @xushiyan, @richardliaw, @can-anyscale, @xingyu-long, @kanwang, @aslonnie, @MortalHappiness, @jjyao, @SumanthRH, @matthewdeng, @alexeykudinkin, @sven1977, @raulchen, @andrewsykim, @zcin, @nadongjun, @hongpeng-guo, @miguelteixeiraa, @saihaj, @khluu, @ArturNiederfahrenhorst, @ryanaoleary, @ltbringer, @pcmoritz, @JoshKarpel, @akyang-anyscale, @frances720, @BeingGod, @edoakes, @Bye-legumes, @Superskyyy, @liuxsh9, @MengjinYan, @ruisearch42, @scottjlee, @angelinalg

Contributors

pcmoritz, alexeykudinkin, and 48 other contributors

Assets 2

13 Nov 19:50

jjyao

ray-2.39.0

5a6c335

Ray-2.39.0

Ray Libraries

Ray Data

🔨 Fixes:

Fixed InvalidObjectError edge case with Dataset.split() (#48130)
Made Concatenator preserve order of concatenated columns (#47997)

📖 Documentation:

Improved documentation around Parquet column and predicate pushdown (#48095)
Marked num_rows_per_file parameter of write APIs as experimental (#48208)
One hot encoder now returns an encoded vector (#48173)
transform_batch no longer fails on missing columns (#48137)

🏗 Architecture refactoring:

Dataset.count() now uses a Count logical operator (#48126)

🗑 Deprecations:

Removed long-deprecated set_progress_bars (#48203)

Ray Train

🔨 Fixes:

Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Tune

🔨 Fixes:

Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Serve

💫 Enhancements:

Cancelled requests now return a serve.RequestCancelledError (#48444)
Exposed application source in app details model (#45522)

🔨 Fixes:

Basic HTTP deployments will now return “Internal Server Error” instead of a traceback to match FastAPI behavior (#48491)
Fixed an issue where high values of max_ongoing_requests couldn’t be reached due to an interaction with core’s max_concurrency (#48274)
Fixed an edge case where pending requests were not canceled properly (#47873)
Removed deprecated API to set route_prefix per-deployment (#48223)

📖 Documentation:

Added ProxyStatus model to reference docs (#48299)
Added ApplicationStatus model to reference docs (#48220)

RLlib

💫 Enhancements:

Upgrade to gymnasium==1.0.0 (support new API for vector env resets). (#48443, #45328)
Add off-policy'ness metric to new API stack. (#48227)
Validate episodes before adding them to the buffer. (#48083)

📖 Documentation:

New example script for custom metrics on EnvRunners (using MetricsLogger API on the new stack). (#47969)
Do-over: New RLlib index page. (#48285, #48442)
Do-over: Example script for AutoregressiveActionsRLM. (#47972)

🏗 Architecture refactoring:

New API stack on by default for PPO. (#48284)
Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). (#48286)

🔨 Fixes:

Various bug and CI fixes: #47993, #48450, #48213
Cleanup evaluation folder (#48493)

Ray Core

🎉 New Features:

[CompiledGraphs] Support all reduce collective in aDAG (#47621)
[CompiledGraphs] Add visualization of compiled graphs (#47958)

💫 Enhancements:

[Distributed Debugger] The distributed debugger can now be used without having to set RAY_DEBUG=1, see #48301 and https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html. If you want to restore the previous behavior and use the CLI based debugger, you need to set RAY_DEBUG=legacy.
[Core] Add more infos to each breakpoint for ray debug CLI (#48202)
[Core] Add demands info to GCS debug state (#48115)
[Core] Add PENDING_ACTOR_TASK_ARGS_FETCH and PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus (#48242)
[Core] Add metrics ray_io_context_event_loop_lag_ms. (#47989)
[Core] Better log format when show the disk size (#46869)
[CompiledGraphs] Support asyncio.gather on multiple CompiledDAGFutures (#47860)
[CompiledGraphs] Raise an exception if a leaf node is found during compilation (#47757)

🔨 Fixes:

[Core] Posts CoreWorkerMemoryStore callbacks onto io_context to fix deadlock (#47833)

Dashboard

🔨 Fixes:

[Dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count (#48229)

Thanks

Many thanks to all those who contributed to this release!

@akyang-anyscale, @rkooo567, @bveeramani, @dayshah, @martinbomio, @khluu, @justinvyu, @slfan1989, @alexeykudinkin, @simonsays1980, @vigneshka, @ruisearch42, @rynewang, @scottjlee, @jjyao, @JoshKarpel, @win5923, @MengjinYan, @MortalHappiness, @ujjawal-khare-27, @zcin, @ccoulombe, @Bye-legumes, @dentiny, @stephanie-wang, @LeoLiao123, @dengwxn, @richo-anyscale, @pcmoritz, @sven1977, @omatthew98, @GeneDer, @srinathk10, @can-anyscale, @edoakes, @kevin85421, @aslonnie, @jeffreyjeffreywang, @ArturNiederfahrenhorst

Contributors

pcmoritz, alexeykudinkin, and 37 other contributors

Assets 2

23 Oct 21:57

aslonnie

ray-2.38.0

385ee46

Ray-2.38.0

Ray Libraries

Ray Data

🎉 New Features:

Add Dataset.rename_columns (#47906)
Basic structured logging (#47210)

💫 Enhancements:

Add partitioning parameter to read_parquet (#47553)
Add SERVICE_UNAVAILABLE to list of retried transient errors (#47673)
Re-phrase the streaming executor current usage string (#47515)
Remove ray.kill in ActorPoolMapOperator (#47752)
Simplify and consolidate progress bar outputs (#47692)
Refactor OpRuntimeMetrics to support properties (#47800)
Refactor plan_write_op and Datasinks (#47942)
Link PhysicalOperator to its LogicalOperator (#47986)
Allow specifying both num_cpus and num_gpus for map APIs (#47995)
Allow specifying insertion index when registering custom plan optimization Rules (#48039)
Adding in better framework for substituting logging handlers (#48056)

🔨 Fixes:

Fix bug where Ray Data incorrectly emits progress bar warning (#47680)
Yield remaining results from async map_batches (#47696)
Fix event loop mismatch with async map (#47907)
Make sure num_gpus provide to Ray Data is appropriately passed to ray.remote call (#47768)
Fix unequal partitions when grouping by multiple keys (#47924)
Fix reading multiple parquet files with ragged ndarrays (#47961)
Removing unneeded test case (#48031)
Adding in better json checking in test logging (#48036)
Fix bug with inserting custom optimization rule at index 0 (#48051)
Fix logging output from write_xxx APIs (#48096)

📖 Documentation:

Add docs section for Ray Data progress bars (#47804)
Add reference to parquet predicate pushdown (#47881)
Add tip about how to understand map_batches format (#47394)

Ray Train

🏗 Architecture refactoring:

Remove deprecated mosaic and sklearn trainer code (#47901)

Ray Tune

🔨 Fixes:

Fix WandbLoggerCallback to reuse actors upon restore (#47985)

Ray Serve

🔨 Fixes:

Stop scheduling task early when requests have been canceled (#47847)

RLlib

🎉 New Features:

Enable cloud checkpointing. (#47682)

💫 Enhancements:

PPO on new API stack now shuffles batches properly before each epoch. (#47458)
Other enhancements: #47705, #47501, #47731, #47451, #47830, #47970, #47157

🔨 Fixes:

Fix spot node preemption problem (RLlib now run stably with EnvRunner workers on spot nodes) (#47940)
Fix action masking example. (#47817)
Various other fixes: #47973, #46721, #47914, #47880, #47304, #47686

🏗 Architecture refactoring:

Switch on new API stack by default for SAC and DQN. (#47217)
Remove Tf support on new API stack for PPO/IMPALA/APPO (only DreamerV3 on new API stack remains with tf now). (#47892)
Discontinue support for "hybrid" API stack (using RLModule + Learner, but still on RolloutWorker and Policy) (#46085)
RLModule (new API stack) refinements: #47884, #47885, #47889, #47908, #47915, #47965, #47775

📖 Documentation:

Add new API stack migration guide. (#47779)
New API stack example script: BC pre training, then PPO finetuning using same RLModule class. (#47838)
New API stack: Autoregressive actions example. (#47829)
Remove old API stack connector docs entirely. (#47778)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

CompiledGraphs: support multi readers in multi node when DAG is created from an actor (#47601)

💫 Enhancements:

Add a flag to raise exception for out of band serialization of ObjectRef (#47544)
Store each GCS table in its own Redis Hash (#46861)
Decouple create worker vs pop worker request. (#47694)
Add metrics for GCS jobs (#47793)

🔨 Fixes:

Fix broken dashboard cluster page when there are dead nodes (#47701)
Fix the ray_tasks{State="PENDING_ARGS_FETCH"} metric counting (#47770)
Separate the attempt_number with the task_status in memory summary and object list (#47818)
Fix object reconstruction hang on arguments pending creation (#47645)
Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end() (#47861)
Fix check failure RAY_CHECK(it != current_tasks_.end()); (#47659)

📖 Documentation:

KubeRay docs: Add docs for YuniKorn Gang scheduling #47850

Dashboard

💫 Enhancements:

Performance improvements for large scale clusters (#47617)

🔨 Fixes:

Placement group and required resources not showing correctly in dashboard (#47754)

Thanks

Many thanks to all those who contributed to this release!
@GeneDer, @rkooo567, @dayshah, @saihaj, @nikitavemuri, @bill-oconnor-anyscale, @WeichenXu123, @can-anyscale, @jjyao, @edoakes, @kekulai-fredchang, @bveeramani, @alexeykudinkin, @raulchen, @khluu, @sven1977, @ruisearch42, @dentiny, @MengjinYan, @Mark2000, @simonsays1980, @rynewang, @PatricYan, @zcin, @sofianhnaide, @matthewdeng, @dlwh, @scottjlee, @MortalHappiness, @kevin85421, @win5923, @aslonnie, @prithvi081099, @richardsliu, @milesvant, @omatthew98, @Superskyyy, @pcmoritz

Contributors

dlwh, pcmoritz, and 36 other contributors

Assets 2

24 Sep 23:37

khluu

ray-2.37.0

1b620f2

Ray-2.37.0

Ray Libraries

Ray Data

💫 Enhancements:

Simplify custom metadata provider API (#47575)
Change counts of metrics to rates of metrics (#47236)
Throw exception for non-streaming HF datasets with "override_num_blocks" argument (#47559)
Refactor custom optimizer rules (#47605)

🔨 Fixes:

Remove ineffective retry code in plan_read_op (#47456)
Fix incorrect pending task size if outputs are empty (#47604)

Ray Train

💫 Enhancements:

Update run status and add stack trace to TrainRunInfo (#46875)

Ray Serve

💫 Enhancements:

Allow control of some serve configuration via env vars (#47533)
[serve] Faster detection of dead replicas (#47237)

🔨 Fixes:

[Serve] fix component id logging field (#47609)

RLlib

💫 Enhancements:

New API stack:
- Add restart-failed-env option to EnvRunners. (#47608 )
- Offline RL: Store episodes in state form. (#47294 )
- Offline RL: Replace GAE in MARWILOfflinePreLearner with GeneralAdvantageEstimation connector in learner pipeline. (#47532)
- Off-policy algos: Add episode sampling to EpisodeReplayBuffer. (#47500)
- RLModule APIs: Add SelfSupervisedLossAPI for RLModules that bring their own loss and InferenceOnlyAPI. (#47581, #47572)

Ray Core

💫 Enhancements:

[aDAG] Allow custom NCCL group for aDAG (#47141)
[aDAG] support buffered input (#47272)
[aDAG] Support multi node multi reader (#47480)
[Core] Make is_gpu, is_actor, root_detached_id fields late bind to workers. (#47212)
[Core] Reconstruct actor to run lineage reconstruction triggered actor task (#47396)
[Core] Optimize GetAllJobInfo API for performance (#47530)

🔨 Fixes:

[aDAG] Fix ranks ordering for custom NCCL group (#47594)

Ray Clusters

📖 Documentation:

[KubeRay] add a guide for deploying vLLM with RayService (#47038)

Thanks

Many thanks to all those who contributed to this release!
@ruisearch42, @andrewsykim, @timkpaine, @rkooo567, @WeichenXu123, @GeneDer, @sword865, @simonsays1980, @angelinalg, @sven1977, @jjyao, @woshiyyya, @aslonnie, @zcin, @omatthew98, @rueian, @khluu, @justinvyu, @bveeramani, @nikitavemuri, @chris-ray-zhang, @liuxsh9, @xingyu-long, @peytondmurray, @rynewang

Contributors

sword865, jjyao, and 23 other contributors

Assets 2

23 Sep 18:47

khluu

ray-2.36.1

999f766

Ray-2.36.1

Ray Core

🔨 Fixes:

Fix broken dashboard cluster page when there are dead nodes (#47701)
Fix broken dashboard worker page (#47714)

Assets 2

17 Sep 18:30

GeneDer

ray-2.36.0

85d98e1

Ray-2.36.0

Ray Libraries

Ray Data

💫 Enhancements:

Remove limit on number of tasks launched per scheduling step (#47393)
Allow user-defined Exception to be caught. (#47339)

🔨 Fixes:

Display pending actors separately in the progress bar and not count them towards running resources (#46384)
Fix bug where arrow_parquet_args aren't used (#47161)
Skip empty JSON files in read_json() (#47378)
Remove remote call for initializing Datasource in read_datasource() (#47467)
Remove dead from_*_operator modules (#47457)
Release test fixes
Add AWS ACCESS_DENIED as retryable exception for multi-node Data+Train benchmarks (#47232)
Get AWS credentials with boto (#47352)
Use worker node instead of head node for read_images_comparison_microbenchmark_single_node release test (#47228)

📖 Documentation:

Add docstring to explain Dataset.deserialize_lineage (#47203)
Add a comment explaining the bundling behavior for map_batches with default batch_size (#47433)

Ray Train

💫 Enhancements:

Decouple device-related modules and add Huawei NPU support to Ray Train (#44086)

🔨 Fixes:

Update TORCH_NCCL_ASYNC_ERROR_HANDLING env var (#47292)

📖 Documentation:

Add missing Train public API reference (#47134)

Ray Tune

📖 Documentation:

Add missing Tune public API references (#47138)

Ray Serve

💫 Enhancements:

Mark proxy as unready when its routers are aware of zero replicas (#47002)
Setup default serve logger (#47229)

🔨 Fixes:

Allow get_serve_logs_dir to run outside of Ray's context (#47224)
Use serve logger name for logs in serve (#47205)

📖 Documentation:

[HPU] [Serve] [experimental] Add vllm HPU support in vllm example (#45893)

🏗 Architecture refactoring:

Remove support for nested DeploymentResponses (#47209)

RLlib

🎉 New Features:

New API stack: Add CQL algorithm. (#47000, #47402)
New API stack: Enable GPU and multi-GPU support for DQN/SAC/CQL. (#47179)

💫 Enhancements:

New API stack: Offline RL enhancements: #47195, #47359
Enhance new API stack stability: #46324, #47196, #47245, #47279
Fix large batch size for synchronous algos (e.g. PPO) after EnvRunner failures. (#47356)
Add torch.compile config options to old API stack. (#47340 )
Add kwargs to torch.nn.parallel.DistributedDataParallel (#47276)
Enhanced CI stability: #47197, #47249

📖 Documentation:

New API stack example scripts:
- Float16 training example script. (#47362)
- Mixed precision training example script (#47116)
- ModelV2 -> RLModule wrapper for migrating to new API stack. (#47425)
Remove "new API stack experimental" hint from docs. (#47301)

🏗 Architecture refactoring:

Remove 2nd Learner ConnectorV2 pass from PPO (#47401)
Add separate learning rates for policy and alpha to SAC. (#47078)

🔨 Fixes:

Various bug fixes: #47401, #47194, #47259, #47271, #47277, #47382

Ray Core

💫 Enhancements:

[ADAG] Raise proper error message for nccl within the same actor (#47250)
[ADAG] Support multi-read of the same shm channel (#47311 )
Log why core worker is not idle during HandleExit (#47300 )
Add PREPARED state for placement groups in GCS for better fault tolerance. (#46858)

🔨 Fixes:

Fix ray_unintentional_worker_failures_total to only count unintentional worker failures (#47368)
Fix runtime env race condition when uploading the same package concurrently (#47482)

Dashboard

🔨 Fixes:

Performance optimizations for dashboard backend logic (#47392) (#47367) (#47160) (#47213)
Refactor to simplify dashboard backend logic (#47324)

Docs

💫 Enhancements:

Add sphinx-autobuild and documentation for make local (#47275): Speed up of local docs builds with make local.
Add Algolia search to docs (#46477)
Update PyTorch Mnist Training doc for KubeRay 1.2.0 (#47321)
Life-cycle of documentation policy of Ray APIs

Thanks

Many thanks to all those who contributed to this release!
@GeneDer, @Bye-legumes, @nikitavemuri, @kevin85421, @MortalHappiness, @LeoLiao123, @saihaj, @rmcsqrd, @bveeramani, @zcin, @matthewdeng, @raulchen, @mattip, @jjyao, @ruisearch42, @scottjlee, @can-anyscale, @khluu, @aslonnie, @rynewang, @edoakes, @zhanluxianshen, @venkatram-dev, @c21, @allenyin55, @alexeykudinkin, @snehakottapalli, @BitPhinix, @hongchaodeng, @dengwxn, @liuxsh9, @simonsays1980, @peytondmurray, @KepingYan, @bryant1410, @woshiyyya, @sven1977

Contributors

alexeykudinkin, mattip, and 35 other contributors

Assets 2

28 Aug 00:11

khluu

ray-2.35.0

c5d536d

Ray-2.35.0

Notice: Starting from this release, pip install ray[all] will not include ray[cpp], and will not install the respective ray-cpp package. To install everything that includes ray-cpp, one can use pip install ray[cpp-all] instead.

Ray Libraries

Ray Data

🎉 New Features:

Upgrade supported Arrow version from 16 to 17 (#47034)
Add support for reading from Iceberg (#46889)

💫 Enhancements:

Various Progress Bar UX improvements (#46816, #46801, #46826, #46692, #46699, #46974, #46928, #47029, #46924, #47120, #47095, #47106)
Try get size_bytes from metadata and consolidate metadata methods (#46862)
Improve warning message when read task is large (#46942)
Extend API to enable passing sample weights via ray.dataset.to_tf (#45701)
Add a parameter to allow overriding LanceDB scanner options (#46975)
Add failure retry logic for read_lance (#46976)
Clarify warning for reading old Parquet data (#47049)
Move datasource implementations to _internal subpackage (#46825)
Handle logs from tensor extensions (#46943)

🔨 Fixes:

Change type of DataContext.retried_io_errors from tuple to list (#46884)
Make Parquet tests more robust and expose Parquet logic (#46944)
Change pickling log level from warning to debug (#47032)
Add validation for shuffle arg (#47055)
Fix validation bug when size=0 in ActorPoolStrategy (#47072)
Fix exception in async map (#47110)
Fix wrong metrics group for Object Store Memory metrics on Ray Data Dashboard (#47170)
Handle errors in SplitCoordinator when generating a new epoch (#47176)

📖 Documentation:

Auto-gen GroupedData api (#46925)
Fix signature of Rule.plan (#47094)

Ray Train

💫 Enhancements:

[train] Updates to support xgboost==2.1.0 (#46667)
[train] Add hardware stats (#46719)

Ray Tune

🔨 Fixes:

[RLlib; Tune] Fix WandB metric overlap after restore from checkpoint. (#46897)

Ray Serve

💫 Enhancements:

Improved handling of replica death and replica unavailability in deployment handle routers before controller restarts replica (#47008)
Eagerly create routers in proxy for better GCS fault tolerance (#47031)
Immediately send ping in router when receiving new replica set (#47053)

🏗 Architecture refactoring:

Deprecate passing arguments that contain DeploymentResponses in nested objects to downstream deployment handle calls (#46806)

RLlib

🎉 New Features:

Offline RL on the new API stack:
- Record offline data (#46818, #47046, #47133, #47155) and support to directly read from episodes. (#46865)
- RLUnplugged example. (#46792)
- Progress on BC/MARWIL migration: #44970, #47154, #46799
- Progress on CQL migration: #46969, #47105

💫 Enhancements:

Add ObservationPreprocessor (ConnectorV2). (#47077)

🔨 Fixes:

New API stack: Fix IMPALA/APPO + LSTM for single- and multi-GPU. (#47132, #47158)
Various bug fixes: #46898, #47047, #46963, #47021, #46897
Add more control to Algorithm.add_module/policy methods. (#46932, #46836)

📖 Documentation:

Example scripts for new API stack:
- Curiosity (inverse dynamics model-based) RLModule example. (#46841)
- Add example script for Env with protobuf observation space. (#47071)
New API stack documentation:
- Cleanup old API stack docs (rllib-dev.rst). (#47172)
- Episodes (SingleAgentEpisode). (#46985)
- Redo rllib-algorithms.rst page. (#46916)

🏗 Architecture refactoring:

Rename MultiAgent...RLModule... into MultiRL...Module for more generality. (#46840)
Add learner_only flag to RLModuleConfig/Spec and simplify creation of RLModule specs from algo-config. (#46900)

Ray Core

💫 Enhancements:

Emit total lineage bytes metrics (#46725)
Adding accelerator type H100 (#46823)
More structured logging in core worker (#46906)
Change all callbacks to move to save copies. (#46971)
Add ray[adag] option to pip install (#47009)

🔨 Fixes:

Fix dashboard process reporting on windows (#45578)
Fix Ray-on-Spark cluster crashing bug when user cancels cell execution (#46899)
Fix PinExistingReturnObject segfault by passing owner_address (#46973)
Fix raylet CHECK failure from runtime env creation failure. (#46991)
Fix typo in memray command (#47006)
[ADAG] Fix for asyncio outputs (#46845)

📖 Documentation:

Clarify behavior of placement_group_capture_child_tasks in docs (#46885)
Update ray.available_resources() docstring (#47018)

🏗 Architecture refactoring:

Async APIs for the New GcsClient. (#46788)
Replace GCS stubs in the dashboard to use NewGcsAioClient. (#46846)

Dashboard

💫 Enhancements:

Polish and minor improvements to the Serve page (#46811)

🔨 Fixes:

Fix CPU/GPU/RAM not being reported correctly on Windows (#44578)

Docs

💫 Enhancements:

Add more information about developer tooling for docs contributions (#46636), including esbonio section

🔨 Fixes:

Use PyData Sphinx theme version switcher (#46936)

Thanks

Many thanks to all those who contributed to this release!
@simonsays1980, @bveeramani, @tungh2, @zcin, @xingyu-long, @WeichenXu123, @aslonnie, @MaxVanDijck, @can-anyscale, @galenhwang, @omatthew98, @matthewdeng, @raulchen, @sven1977, @shrekris-anyscale, @deepyaman, @alexeykudinkin, @stephanie-wang, @kevin85421, @ruisearch42, @hongchaodeng, @khluu, @alanwguo, @hongpeng-guo, @saihaj, @Superskyyy, @tespent, @slfan1989, @justinvyu, @rynewang, @nikitavemuri, @amogkam, @mattip, @dev-goyal, @ryanaoleary, @peytondmurray, @edoakes, @venkatajagannath, @jjyao, @cristianjd, @scottjlee, @Bye-legumes

Contributors

alexeykudinkin, alanwguo, and 40 other contributors

Assets 2

31 Jul 18:02

can-anyscale

ray-2.34.0

fc87217

Release 2.34.0 Notes

Ray Libraries

Ray Data

💫 Enhancements:

Add better support for UDF returns from list of datetime objects (#46762)

🔨 Fixes:

Remove read task warning if size bytes not set in metadata (#46765)

📖 Documentation:

Fix read_tfrecords() docstring to display tfx-bsl tip (#46717)
Update Dataset.zip() docs (#46757)

Ray Train

🔨 Fixes:

Sort workers by node ID rather than by node IP (#46163)

🏗 Architecture refactoring:

Remove dead RayDatasetSpec (#46764)

RLlib

🎉 New Features:

Offline RL support on new API stack:
- Initial design for Ray-Data based offline RL Algos (on new API stack). (#44969)
- Add user-defined schemas for data loading. (#46738)
- Make data pipeline better configurable and tuneable for users. (#46777)

💫 Enhancements:

Move DQN into the TargetNetworkAPI (and deprecate RLModuleWithTargetNetworksInterface). (#46752)

🔨 Fixes:

Numpy version fix: Rename all np.product usage to np.prod (#46317)

📖 Documentation:

Examples for new API stack: Add 2 (count-based) curiosity examples. (#46737)
Remove RLlib CLI from docs (soon to be deprecated and replaced by python API). (#46724)

🏗 Architecture refactoring:

Cleanup, rename, clarify: Algorithm.workers/evaluation_workers, local_worker(), etc.. (#46726)

Ray Core

🏗 Architecture refactoring:

New python GcsClient binding (#46186)

Many thanks to all those who contributed to this release! @KyleKoon, @ruisearch42, @rynewang, @sven1977, @saihaj, @aslonnie, @bveeramani, @akshay-anyscale, @kevin85421, @omatthew98, @anyscalesam, @MaxVanDijck, @justinvyu, @simonsays1980, @can-anyscale, @peytondmurray, @scottjlee

Contributors

simonsays1980, justinvyu, and 15 other contributors

Assets 2

25 Jul 20:28

jjyao

ray-2.33.0

914af09

Ray-2.33.0

Ray Libraries

Ray Core

💫 Enhancements:

Add "last exception" to error message when GCS connection fails in ray.init() (#46516)

🔨 Fixes:

Add object back to memory store when object recovery is skipped (#46460)
Task status should start with PENDING_ARGS_AVAIL when retry (#46494)
Fix ObjectFetchTimedOutError (#46562)
Make working_dir support files created before 1980 (#46634)
Allow full path in conda runtime env. (#45550)
Fix worker launch time formatting in state api (#43516)

Ray Data

🎉 New Features:

Deprecate Dataset.get_internal_block_refs() (#46455)
Add read API for reading Databricks table with Delta Sharing (#46072)
Add support for objects to Arrow blocks (#45272)

💫 Enhancements:

Change offsets to int64 and change to LargeList for ArrowTensorArray (#45352)
Prevent from_pandas from combining input blocks (#46363)
Update Dataset.count() to avoid unnecessarily keeping BlockRefs in-memory (#46369)
Use Set to fix inefficient iteration over Arrow table columns (#46541)
Add AWS Error UNKNOWN to list of retried write errors (#46646)
Always print traceback for internal exceptions (#46647)
Allow unknown estimate of operator output bundles and ProgressBar totals (#46601)
Improve filesystem retry coverage (#46685)

🔨 Fixes:

Replace lambda mutable default arguments (#46493)

📖 Documentation:

Auto-generate Dataset API documentation (#46557)
Update outdated ExecutionPlan docstring (#46638)

Ray Train

💫 Enhancements:

Update run status and actor status for train runs. (#46395)

🔨 Fixes:

Replace lambda default arguments (#46576)

📖 Documentation:

Add MNIST training using KubeRay doc page (#46123)
Add example of pre-training Llama model on Intel Gaudi (#45459)
Fix tensorflow example by using ScalingConfig (#46565)

Ray Tune

🔨 Fixes:

Replace lambda default arguments (#46596)

Ray Serve

🎉 New Features:

Fully deprecate target_num_ongoing_requests_per_replica and max_concurrent_queries, respectively replaced by max_ongoing_requests and target_ongoing_requests (#46392 and #46427)
Configure the task launched by the controller to build an application with Serve’s logging config (#46347)

RLlib

💫 Enhancements:

Moving sampling coordination for batch_mode=complete_episodes to synchronous_parallel_sample. (#46321)
Enable complex action spaces with stateful modules. (#46468)

🏗 Architecture refactoring:

Enable multi-learner setup for hybrid stack BC. (#46436)
Introduce Checkpointable API for RLlib components and subcomponents. (#46376)

🔨 Fixes:

Replace Mapping typehint with Dict: #46474

📖 Documentation:

More example scripts for new API stack: Two separate optimizers (w/ different learning rates). (#46540) and custom loss function. (#46445)

Dashboard

🔨 Fixes:

Task end time showing the incorrect time (#46439)
Events Table rows having really bad spacing (#46701)
UI bugs in the serve dashboard page (#46599)

Thanks

Many thanks to all those who contributed to this release!

@alanwguo, @hongchaodeng, @anyscalesam, @brucebismarck, @bt2513, @woshiyyya, @terraflops1048576, @lorenzoritter, @omrishiv, @davidxia, @cchen777, @nono-Sang, @jackhumphries, @aslonnie, @JoshKarpel, @zjregee, @bveeramani, @khluu, @Superskyyy, @liuxsh9, @jjyao, @ruisearch42, @sven1977, @harborn, @saihaj, @zcin, @can-anyscale, @veekaybee, @chungen04, @WeichenXu123, @GeneDer, @sergey-serebryakov, @Bye-legumes, @scottjlee, @rynewang, @kevin85421, @cristianjd, @peytondmurray, @MortalHappiness, @MaxVanDijck, @simonsays1980, @mjovanovic9999

Contributors

omrishiv, davidxia, and 40 other contributors

Assets 2

Releases: ray-project/ray

Ray-2.41.0

Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Thanks

Contributors

Ray-2.40.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

Thanks

Contributors

Ray-2.39.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core

Dashboard

Thanks

Contributors

Ray-2.38.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Dashboard

Thanks

Contributors

Ray-2.37.0

Ray Libraries

Ray Data

Ray Train

Ray Serve

RLlib

Ray Core

Ray Clusters

Thanks

Contributors

Ray-2.36.1

Ray Core

Ray-2.36.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core

Dashboard

Docs

Thanks

Contributors

Ray-2.35.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve