[core] Fault tolerance for compiled DAGs #41943

ericl · 2023-12-15T01:27:25Z

Why are these changes needed?

This adds fault tolerance and a teardown method for compiled DAGs.

Application-level exceptions are propagated to callers via the channel
To handle system-level errors, we add the notion of a "closed" channel, which is one that has an error bit set. The error bit can be set without acquiring a mutex, which means it works even if crashed processes are potentially holding the channel mutex. Processes trying to use a closed channel get RaySystemError("channel closed").
Upon a crash of the compiled background task, we close all DAG channels to cancel the computation.
The user can also explicitly teardown the DAG via compiled_dag.teardown()

Note: cancellation is best-effort and currently requires running a task on the actor's main concurrency group. If the actor is busy with some other task submitted by the user, cancellation will be delayed.

Related issue number

#41769

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2023-12-15T02:43:40Z

python/ray/_private/worker.py

-            debugger_breakpoint,
-        )
+        values = self.deserialize_objects(data_metadata_pairs, object_refs)
+        for i, value in enumerate(values):


Copied from non-experimental path.

src/ray/object_manager/plasma/client.cc

ericl · 2023-12-15T02:56:53Z

src/ray/object_manager/plasma/client.cc

@@ -751,8 +756,14 @@ Status PlasmaClient::Impl::EnsureGetAcquired(
  }

  int64_t version_read = 0;
+
+  // Need to unlock the client mutex since ReadAcquire() is blocking.
+  // TODO(ekl) is this entirely thread-safe?


We are probably assured the object cannot be deallocated while a reader has the reference. @stephanie-wang @rkooo567 does this seem right?

Right now mutable objects are never deallocated :D

But yeah in general this should be okay as long as we make sure to increment the PlasmaClient's local ref count for the object before we unlock.

stephanie-wang

I think there is an edge case here where an actor dies after WriteAcquire but before WriteRelease. For that case, we would need to make sure to write the exception and have one process Release (or make Release safe for multiple writers).

stephanie-wang · 2023-12-15T04:12:31Z

python/ray/experimental/channel.py

+                try_wait=True,
+            )
+        except Exception as e:
+            if not _is_write_acquire_failed_error(e):


Hmm I didn't quite understand this condition. It seems to fail silently if we fail to acquire?

Also, could you comment on what cases we expect to fail to acquire?

stephanie-wang · 2023-12-15T04:15:27Z

python/ray/dag/compiled_dag_node.py

@@ -75,10 +73,21 @@ def do_exec_compiled_task(
                channel.end_read()

    except Exception as e:
-        logging.warn(f"Compiled DAG task aborted with exception: {e}")
+        logging.info(f"Compiled DAG task exited with exception: {e}")


For non-Ray exceptions, I wonder if we should instead store the error and keep looping?

stephanie-wang · 2023-12-15T04:20:16Z

python/ray/dag/tests/test_accelerated_dag.py

@@ -176,6 +192,32 @@ def f(x):
        dag.experimental_compile()


+@pytest.mark.parametrize("num_actors", [1, 4])
+def test_dag_fault_tolerance(ray_start_regular, num_actors):


Can you also add a test for worker process dying?

ericl · 2023-12-15T06:48:34Z

I think there is an edge case here where an actor dies after WriteAcquire but before WriteRelease

I see, would we need a timeout here to force release of the lock?

python/ray/_raylet.pyx

rkooo567 · 2023-12-15T06:09:41Z

python/ray/dag/compiled_dag_node.py

@@ -75,10 +73,21 @@ def do_exec_compiled_task(
                channel.end_read()

    except Exception as e:
-        logging.warn(f"Compiled DAG task aborted with exception: {e}")
+        logging.info(f"Compiled DAG task exited with exception: {e}")


src/ray/object_manager/plasma/client.cc

python/ray/dag/tests/test_accelerated_dag.py

python/ray/dag/compiled_dag_node.py

python/ray/dag/tests/test_accelerated_dag.py

src/ray/object_manager/plasma/client.cc

src/ray/object_manager/common.h

python/ray/experimental/channel.py

stephanie-wang · 2023-12-15T16:35:05Z

I think there is an edge case here where an actor dies after WriteAcquire but before WriteRelease

I see, would we need a timeout here to force release of the lock?

If we know that the original writer has definitely died, it should be okay to directly write the plasma buffer with the exception object and WriteRelease.

But actually yeah it is a bit tricky if the process fails while holding the pthread_mutex in WriteAcquire or WriteRelease. I think we need to rethink that concurrency mechanism...

stephanie-wang · 2023-12-15T16:35:51Z

A long enough timeout on pthread_mutex_lock seems okay for now; we can probably improve it later.

stephanie-wang · 2023-12-15T19:12:46Z

By the way, seems like we need something like this to support multi-node too, so that we have a way to signal that we should stop waiting for values to send to the other node. I think it'd be best if we can send a special value like "EOF" instead of storing an exception, so that way it works for both python and C++ readers.

ericl · 2023-12-15T22:52:28Z

But actually yeah it is a bit tricky if the process fails while holding the pthread_mutex in WriteAcquire or WriteRelease. I think we need to rethink that concurrency mechanism...

So how about this, we can switch the error writing path of WriteAcquire to sem_timedwait and pthread_mutex_timedlock with a 10 second timeout for now?

By the way, seems like we need something like this to support multi-node too, so that we have a way to signal that we should stop waiting for values to send to the other node. I think it'd be best if we can send a special value like "EOF" instead of storing an exception, so that way it works for both python and C++ readers.

Can you explain more?

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-12-18T19:56:32Z

From offline discussion, this should be refactored so that

app-level exception -> errors returned via application level writes
worker crash -> caught / logged by monitor and dag torn down via EOS
teardown -> dag torn down via EOS

EOS will be implemented as an "error bit" that can be set on the channel without locking. This allows error handling to avoid race conditions, while still preserving exception messages in most cases.

rkooo567 · 2023-12-19T01:12:58Z

This allows error handling to avoid race conditions, while still preserving exception messages in most cases.

What's missing to support "all errors" in this case?

Signed-off-by: Eric Liang <ekhliang@gmail.com>

src/ray/object_manager/common.cc

src/ray/object_manager/common.h

src/ray/object_manager/plasma/client.cc

python/ray/experimental/channel.py

python/ray/dag/compiled_dag_node.py

stephanie-wang · 2023-12-19T16:48:59Z

python/ray/dag/compiled_dag_node.py

@@ -68,17 +69,41 @@ def do_exec_compiled_task(
            for idx, channel in input_channel_idxs:
                resolved_inputs[idx] = channel.begin_read()


It might be good to explicitly try-catch the channel calls so that we can differentiate between expected errors (channel closed), application code errors, and anything else that might error in this loop (most likely system bugs). The try-catch at the end can be for system errors only.

Hmm I played around with this and deciding the semantics is tricky, so I think we should tackle this later on for productionization.

python/ray/dag/tests/test_accelerated_dag.py

Signed-off-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Eric Liang <ekhliang@gmail.com>

…dag-ft

ericl · 2023-12-20T01:32:28Z

Comments addressed

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl · 2023-12-21T01:24:33Z

Tests look good, retrying flaky one.

This adds fault tolerance and a teardown method for compiled DAGs.

ericl added 2 commits December 14, 2023 17:26

initial port

0780ed4

unit tests passing

13bd97f

ericl commented Dec 15, 2023

View reviewed changes

ericl force-pushed the compiled-dag-ft branch from fd3459a to 477cadf Compare December 15, 2023 02:51

fix prev tests

13e8bd7

ericl force-pushed the compiled-dag-ft branch from 477cadf to 13e8bd7 Compare December 15, 2023 02:51

ericl changed the title ~~[WIP] Fault tolerance for compiled DAGs~~ [core] Fault tolerance for compiled DAGs Dec 15, 2023

ericl assigned stephanie-wang and rkooo567 Dec 15, 2023

ericl commented Dec 15, 2023

View reviewed changes

src/ray/object_manager/plasma/client.cc Outdated Show resolved Hide resolved

ericl commented Dec 15, 2023

View reviewed changes

stephanie-wang reviewed Dec 15, 2023

View reviewed changes

rkooo567 reviewed Dec 15, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 15, 2023

comments 1

b2e493b

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl force-pushed the compiled-dag-ft branch from 96b78b6 to d12ef99 Compare December 15, 2023 23:46

handle cancellation of ongoing dags

94c5ba4

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl force-pushed the compiled-dag-ft branch from d12ef99 to 94c5ba4 Compare December 15, 2023 23:47

rkooo567 approved these changes Dec 18, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into compiled-dag-ft

a6762a4

ericl added 3 commits December 18, 2023 17:25

wip set error strategy

a15f318

set error method

6ca9b43

Signed-off-by: Eric Liang <ekhliang@gmail.com>

wip

f353c76

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl added 3 commits December 18, 2023 23:34

wip

069dc45

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix test

082c6b2

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix tests

5fab84d

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl force-pushed the compiled-dag-ft branch from 0a2c8fd to 2e7d2a4 Compare December 19, 2023 08:33

stephanie-wang reviewed Dec 19, 2023

View reviewed changes

use shared fixture

482296d

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl force-pushed the compiled-dag-ft branch from 2e7d2a4 to 482296d Compare December 20, 2023 00:02

ericl and others added 3 commits December 19, 2023 16:26

Update src/ray/object_manager/plasma/client.cc

04ad6b4

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Eric Liang <ekhliang@gmail.com>

address comments

de7156f

Merge branch 'compiled-dag-ft' of github.com:ericl/ray into compiled-…

9923a3d

…dag-ft

ericl added 2 commits December 19, 2023 17:45

remove cond var entirely

dd5558e

Signed-off-by: Eric Liang <ekhliang@gmail.com>

fix build

1e19048

Signed-off-by: Eric Liang <ekhliang@gmail.com>

ericl force-pushed the compiled-dag-ft branch from 1c32efb to 1e19048 Compare December 20, 2023 22:24

stephanie-wang approved these changes Dec 21, 2023

View reviewed changes

ericl merged commit f295e94 into ray-project:master Dec 21, 2023
9 of 10 checks passed

vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024

[core] Fault tolerance for compiled DAGs (ray-project#41943)

9969eaf

This adds fault tolerance and a teardown method for compiled DAGs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fault tolerance for compiled DAGs #41943

[core] Fault tolerance for compiled DAGs #41943

ericl commented Dec 15, 2023 •

edited

Loading

ericl Dec 15, 2023

ericl Dec 15, 2023

stephanie-wang Dec 15, 2023

stephanie-wang left a comment

stephanie-wang Dec 15, 2023

stephanie-wang Dec 15, 2023

rkooo567 Dec 15, 2023

stephanie-wang Dec 15, 2023

ericl commented Dec 15, 2023

rkooo567 Dec 15, 2023

stephanie-wang commented Dec 15, 2023

stephanie-wang commented Dec 15, 2023

stephanie-wang commented Dec 15, 2023

ericl commented Dec 15, 2023

ericl commented Dec 18, 2023

rkooo567 commented Dec 19, 2023

stephanie-wang Dec 19, 2023

ericl Dec 20, 2023

ericl commented Dec 20, 2023

ericl commented Dec 21, 2023

		@@ -68,17 +69,41 @@ def do_exec_compiled_task(
		for idx, channel in input_channel_idxs:
		resolved_inputs[idx] = channel.begin_read()

[core] Fault tolerance for compiled DAGs #41943

[core] Fault tolerance for compiled DAGs #41943

Conversation

ericl commented Dec 15, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Dec 15, 2023

Choose a reason for hiding this comment

stephanie-wang commented Dec 15, 2023

stephanie-wang commented Dec 15, 2023

stephanie-wang commented Dec 15, 2023

ericl commented Dec 15, 2023

ericl commented Dec 18, 2023

rkooo567 commented Dec 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Dec 20, 2023

ericl commented Dec 21, 2023

ericl commented Dec 15, 2023 •

edited

Loading