Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] fix a bug that may cause async map tasks to hang #48861

Merged
merged 4 commits into from
Nov 23, 2024

Conversation

raulchen
Copy link
Contributor

Why are these changes needed?

Fix a bug that may cause async map tasks to hang. See code comments for details.

This issue can be reproduced with an existing test test_map_batches_async_generator on slow machines.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen raulchen force-pushed the fix-async-map-stuck branch from dab4793 to 5af4576 Compare November 22, 2024 06:58
loop = ray.data._map_actor_context.udf_map_asyncio_loop
tasks = [loop.create_task(process_batch(x)) for x in input_iterable]
try:
loop = ray.data._map_actor_context.udf_map_asyncio_loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but do you know why we need to use a global variable for the loop? Seems like the actor context is specific to a UDF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's because we already have a global var _MapActorContext to cache other stuff.
technically the loop can also be bound to the actor object.

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen please hold on merging

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -389,6 +394,12 @@ async def process_all_batches():
# from the async generator, corresponding to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify the conditional as well to just

while True:
  # Blocking
  batch = q.get() 
  if sentinel:
    break

@@ -352,6 +352,8 @@ def transform_fn(
# generators, and in the main event loop, yield them from
# the queue as they become available.
output_batch_queue = queue.Queue()
# Use a special object to signal the end of the queue.
end_of_queue = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Sentinel is a pretty common term for it

@raulchen raulchen enabled auto-merge (squash) November 23, 2024 00:21
@github-actions github-actions bot disabled auto-merge November 23, 2024 00:21
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 23, 2024
@raulchen raulchen merged commit 4b4f3c6 into ray-project:master Nov 23, 2024
5 of 7 checks passed
@raulchen raulchen deleted the fix-async-map-stuck branch November 23, 2024 01:24
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
…48861)

## Why are these changes needed?

Fix a bug that may cause async map tasks to hang. See code comments for
details.

This issue can be reproduced with an existing test
`test_map_batches_async_generator` on slow machines.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…48861)

## Why are these changes needed?

Fix a bug that may cause async map tasks to hang. See code comments for
details.

This issue can be reproduced with an existing test
`test_map_batches_async_generator` on slow machines.

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants