[Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas #2385

HarryCaveMan · 2024-09-16T18:09:10Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

What Happened

Ray cluster has worker max replicas set to 10, each pod takes up an entire GPU instance with 1 GPU. Submit two jobs, each one:

PG with a bundle [{"CPU":1,"GPU":1}]*10 and strategy STRICT_SPREAD, then call pg.wait(timeout_seconds=600) before calling map_batches
Dataset.map_batches with num_gpus set to 1, this results in 20 GPU nodes being requested, PG scheduling strategy. The autoscaler container fails and the entire head pod crashes bring down the whole cluster.

What I expect to happen

The autoscaler request fails to retrieve the resources but does not crash, So the placement group will continue waiting until the resoruces are available on the cluster.

Reproduction script

Set up kuberay cluster with workergroupspec.maxreplicas=2 where each node has only one available GPU

Run the following:

import ray
from ray.util.placement_group import (
    placement_group,
    remove_placement_group,
)
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
# concurrency only used to simulate two jobs being submitted at once asynchronously
from concurrent.futures import ThreadPoolExecutor,wait


pool = ThreadPoolExecutor(max_workers=2)


ds1 = ray.data.from_items(
    [{"col1": i, "col2": i * 2} for i in range(1000)]
)

ds2 = ray.data.from_items(
    [{"col1": i, "col2": i * 3} for i in range(1000)]
)

def run_transform(batch):
    batch['col2'] *= 2
    return batch

pg_bundle_1 = [{"CPU":1,"GPU"}]*1
pg1 = placement_group(pg_bundle_1,strategy="STRICT_SPREAD")
scheduling_strategy_1 = PlacementGroupSchedulingStrategy(pg1)

jobs = []

jobs.append(pool.submit(
   ds1.map_batches,
    run_transform,
    batch_size=1000
    num_cpus=1,
    num_gpus=1,
    concurrency=3
    scheduling_strategy=scheduling_strategy_1
))

pg_bundle_2= [{"CPU":1,"GPU"}]*2
pg2 = placement_group(pg_bundle_2,strategy="STRICT_SPREAD")
scheduling_strategy_2 = PlacementGroupSchedulingStrategy(pg2)

jobs.append(pool.submit(
    ds2.map_batches,
    run_transform,
    batch_size=500
    num_cpus=1,
    num_gpus=1,
    concurrency=3
    scheduling_strategy=scheduling_strategy_2
)

wait(jobs)

They key is that your bundle items must use the whole node, no one bundle can request more nodes than max_replicas, but the sum of all bundles exceeds it.

Anything else

I did notice that kuberay does not seem to use the max_workers cluster config anywhere, perhaps it needs to be added to the operator when launching the head pod and exposed into the manifest.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

ysfess22 · 2024-10-21T23:36:33Z

Hello @HarryCaveMan , I'm trying to replicate this issue. Can you tell me which version of Ray you used? And how did you run the reproduction script? Did you use something like "kubectl exec". Thanks a lot!

Superskyyy · 2024-10-22T22:21:07Z

@kevin85421 FYI if you have any insights to this issue? Thanks

andrewsykim · 2024-10-23T00:53:05Z

I'll try to reproduce this issue this week

andrewsykim · 2024-10-23T16:25:22Z

@HarryCaveMan can you share the Ray version you used when reproducing the issue?

HarryCaveMan · 2024-10-24T01:56:11Z

Hello @HarryCaveMan , I'm trying to replicate this issue. Can you tell me which version of Ray you used? And how did you run the reproduction script? Did you use something like "kubectl exec". Thanks a lot!
I will try to get you a minimal repo with a minimal manifest you can use kubectl apply to run.

kevin85421 · 2024-12-18T18:58:00Z

@HarryCaveMan is this related to ray-project/ray#48924?

HarryCaveMan added bug Something isn't working triage labels Sep 16, 2024

andrewsykim self-assigned this Oct 23, 2024

kevin85421 added autoscaler and removed triage labels Dec 18, 2024

kevin85421 mentioned this issue Dec 18, 2024

[Umbrella] Autoscaler improvements #2600

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas #2385

[Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas #2385

HarryCaveMan commented Sep 16, 2024 •

edited

Loading

ysfess22 commented Oct 21, 2024 •

edited

Loading

Superskyyy commented Oct 22, 2024

andrewsykim commented Oct 23, 2024

andrewsykim commented Oct 23, 2024

HarryCaveMan commented Oct 24, 2024

kevin85421 commented Dec 18, 2024

[Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas #2385

[Bug] Autoscaler sideacr crashes, bringing down head pod, if request exceeds max pod replicas #2385

Comments

HarryCaveMan commented Sep 16, 2024 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

What Happened

What I expect to happen

Reproduction script

Anything else

Are you willing to submit a PR?

ysfess22 commented Oct 21, 2024 • edited Loading

Superskyyy commented Oct 22, 2024

andrewsykim commented Oct 23, 2024

andrewsykim commented Oct 23, 2024

HarryCaveMan commented Oct 24, 2024

kevin85421 commented Dec 18, 2024

HarryCaveMan commented Sep 16, 2024 •

edited

Loading

ysfess22 commented Oct 21, 2024 •

edited

Loading