Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] GPU placement group doesn't honor the bundle index #29811

Closed
jjyao opened this issue Oct 28, 2022 · 7 comments · Fixed by #48088
Closed

[Core] GPU placement group doesn't honor the bundle index #29811

jjyao opened this issue Oct 28, 2022 · 7 comments · Fixed by #48088
Assignees
Labels
api-bug Bug in which APIs behavior is wrong bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-scheduler P2 Important issue, but not time-critical

Comments

@jjyao
Copy link
Collaborator

jjyao commented Oct 28, 2022

What happened + What you expected to happen

For the reproduction script below, I would expect that each bundle is tied to a GPU so the output should be "0, 1, 2, 3" but now it's "0, 0, 1, 1"

Versions / Dependencies

master

Reproduction script

import ray
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

ray.init(num_gpus=4)

@ray.remote(num_gpus=0.5, num_cpus=0)
def gpu_task():
    import os
    print(f"task is running on GPU {os.environ['CUDA_VISIBLE_DEVICES']}")

pg = placement_group([{'GPU': 1}] * 4, strategy="SPREAD")
ray.get(pg.ready())
futures = [gpu_task.options(
    scheduling_strategy=PlacementGroupSchedulingStrategy(
        placement_group=pg, placement_group_bundle_index=i)).remote() for i in range(4)]
ray.get(futures)

Issue Severity

No response

@jjyao jjyao added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Oct 28, 2022
@jjyao
Copy link
Collaborator Author

jjyao commented Oct 28, 2022

cc @rkooo567

@cadedaniel
Copy link
Member

cadedaniel commented Oct 28, 2022

Isn't this expected unless using STRICT_SPREAD? i.e. no strong guarantees on spread, just an atttempt?

edit: I see the slack discussion now, will read there.

@wuisawesome
Copy link
Contributor

Btw how are we supposed to interpret the @ray.remote(num_gpus=0.5, num_cpus=0)? Is that supposed to be overridden by the placement group? I actually would've thought that this should hang forever because there aren't enough resources (We ask for 0.5 gpu's, in addition to the placement group but all the gpu's are reserved)

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@cadedaniel cadedaniel removed their assignment Oct 28, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Oct 28, 2022

Right now, this means the task needs 0.5 GPU from a given placement group bundle (if specified). If not, take 0.5 GPU from any random bundles

@rkooo567 rkooo567 added api-bug Bug in which APIs behavior is wrong P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed P1 Issue that should be fixed within a few weeks labels Mar 24, 2023
@rkooo567 rkooo567 changed the title [Core] GPU placement group doesn't honer the bundle index [Core] GPU placement group doesn't honor the bundle index Apr 6, 2023
@stale
Copy link

stale bot commented Aug 10, 2023

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
@Basasuya
Copy link
Contributor

@rkooo567 any progression for this problem? we have the same problem
But we have no idea about why this would happen

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels Aug 23, 2024
@rkooo567 rkooo567 self-assigned this Sep 15, 2024
@MengjinYan
Copy link
Collaborator

Update: I'm actively working on the issue right now. We have a path forward to fix the bug and I'm currently finishing up on the code changes. I'll have a PR soon.

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-bug Bug in which APIs behavior is wrong bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-scheduler P2 Important issue, but not time-critical
Projects
None yet
7 participants