You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
What Happened
Ray cluster has worker max replicas set to 10, each pod takes up an entire GPU instance with 1 GPU. Submit two jobs, each one:
PG with a bundle [{"CPU":1,"GPU":1}]*10 and strategy STRICT_SPREAD, then call pg.wait(timeout_seconds=600) before calling map_batches
Dataset.map_batches with num_gpus set to 1, this results in 20 GPU nodes being requested, PG scheduling strategy. The autoscaler container fails and the entire head pod crashes bring down the whole cluster.
What I expect to happen
The autoscaler request fails to retrieve the resources but does not crash, So the placement group will continue waiting until the resoruces are available on the cluster.
Reproduction script
Set up kuberay cluster with workergroupspec.maxreplicas=2 where each node has only one available GPU
Run the following:
importrayfromray.util.placement_groupimport (
placement_group,
remove_placement_group,
)
fromray.util.scheduling_strategiesimportPlacementGroupSchedulingStrategy# concurrency only used to simulate two jobs being submitted at once asynchronouslyfromconcurrent.futuresimportThreadPoolExecutor,waitpool=ThreadPoolExecutor(max_workers=2)
ds1=ray.data.from_items(
[{"col1": i, "col2": i*2} foriinrange(1000)]
)
ds2=ray.data.from_items(
[{"col1": i, "col2": i*3} foriinrange(1000)]
)
defrun_transform(batch):
batch['col2'] *=2returnbatchpg_bundle_1= [{"CPU":1,"GPU"}]*1pg1=placement_group(pg_bundle_1,strategy="STRICT_SPREAD")
scheduling_strategy_1=PlacementGroupSchedulingStrategy(pg1)
jobs= []
jobs.append(pool.submit(
ds1.map_batches,
run_transform,
batch_size=1000num_cpus=1,
num_gpus=1,
concurrency=3scheduling_strategy=scheduling_strategy_1
))
pg_bundle_2= [{"CPU":1,"GPU"}]*2pg2=placement_group(pg_bundle_2,strategy="STRICT_SPREAD")
scheduling_strategy_2=PlacementGroupSchedulingStrategy(pg2)
jobs.append(pool.submit(
ds2.map_batches,
run_transform,
batch_size=500num_cpus=1,
num_gpus=1,
concurrency=3scheduling_strategy=scheduling_strategy_2
)
wait(jobs)
They key is that your bundle items must use the whole node, no one bundle can request more nodes than max_replicas, but the sum of all bundles exceeds it.
Anything else
I did notice that kuberay does not seem to use the max_workers cluster config anywhere, perhaps it needs to be added to the operator when launching the head pod and exposed into the manifest.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Hello @HarryCaveMan , I'm trying to replicate this issue. Can you tell me which version of Ray you used? And how did you run the reproduction script? Did you use something like "kubectl exec". Thanks a lot!
Hello @HarryCaveMan , I'm trying to replicate this issue. Can you tell me which version of Ray you used? And how did you run the reproduction script? Did you use something like "kubectl exec". Thanks a lot!
I will try to get you a minimal repo with a minimal manifest you can use kubectl apply to run.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
What Happened
Ray cluster has worker max replicas set to 10, each pod takes up an entire GPU instance with 1 GPU. Submit two jobs, each one:
[{"CPU":1,"GPU":1}]*10
and strategySTRICT_SPREAD
, then callpg.wait(timeout_seconds=600)
before calling map_batchesDataset.map_batches
with num_gpus set to 1, this results in 20 GPU nodes being requested, PG scheduling strategy. The autoscaler container fails and the entire head pod crashes bring down the whole cluster.What I expect to happen
The autoscaler request fails to retrieve the resources but does not crash, So the placement group will continue waiting until the resoruces are available on the cluster.
Reproduction script
Set up kuberay cluster with workergroupspec.maxreplicas=2 where each node has only one available GPU
Run the following:
They key is that your bundle items must use the whole node, no one bundle can request more nodes than max_replicas, but the sum of all bundles exceeds it.
Anything else
I did notice that kuberay does not seem to use the
max_workers
cluster config anywhere, perhaps it needs to be added to the operator when launching the head pod and exposed into the manifest.Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: