Graceful failure for parallelism #1017

JamesArruda · 2024-07-05T18:17:04Z

This PR updates the GracefulErrorAdapter to handle several situations for Parallelizable nodes, as described in #1009.

It also includes a wish-list item (sentinel input) for a broader use case, which is motivated here: #742

This PR allows for sentinel injection into tagged functions, which will support pre-Collect blocks that use hamilton's resolve to allow runtime-definable DAGs that still aggregate as expected.

Changes

Using the newly available is_expand input, modify the adapter's run logic to:

Iterate over the node's generator function
Gather successful generator results
On an error, push one sentinel into the results list and return
Or, treat the generator as pass/fail and any failure returns a single sentinel only
Allow sentinel value injection to a node, rather than skipping a node entirely. This is keyed on a node tag.

How I tested this

The test_parallel_graceful.py test file checks for:

Two kinds of failure in the Parallelizable node (early failure and failure with successful yields)
Setting fail_all_parallel even with successful yields
No Parallelizable failure, but checks that failures within the sub-dag are collected.
Sentinel injection rules are followed

Notes

The Parallelizable node gets a pre-treatment from _modify_callable that handles the list creation from the generator. This makes it hard to specifically handle per-iteration success/fail in the adapter. I opted to pull the original node callable from the partial object out to avoid touching the base implementation. This seems reasonable since it's an optional adapter, but if _modify_callable gets more advanced for an EXPAND node, this may break the expectations of the adapter.

It also only pulls out the callable if the user wants to get each successful output, otherwise it goes back to the original behavior.

If "inject" is a loaded term (such as in the resolve feature), please suggest a better one!

My intent is that, once this is accepted, to make a more advanced version that has a custom class for the sentinel that stores error traceback information and is easily type-checked for when the sentinel is injected.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

…de settings folder.

elijahbenizzy · 2024-07-05T23:58:35Z

This is looking great, I'll be able to look in detail tomorrow. If you feel like adding an example (in the examples/ directory), that would be much appreciated! Otherwise no worries, I'm happy to add one later. I also think this would make a nice blog post (we are doing guest authors), so let me know if you're interested!).

Going to download and play around a bit to get a feel for it.

elijahbenizzy · 2024-07-07T18:26:55Z

OK, played around, this is really cool. Here's my testing script btw -- you'll have to replace the project/name with your own and run hamilton ui. It's a fun way to explore what they'e doing:

from hamilton.htypes import Parallelizable, Collect
from hamilton import driver
from hamilton.lifecycle import GracefulErrorAdapter
from hamilton_sdk import adapters

import logging

logger = logging.getLogger(__name__)


class SpecificException(Exception):
    pass


def first_node(fail_first_node: bool = False) -> int:
    # First node that can fail
    if fail_first_node:
        raise SpecificException()
    return 1


def parallelizable_block(
        first_node: int,
        total_iterations: int,
        fail_iterations: list[int] = None) -> Parallelizable[int]:
    if fail_iterations is None:
        fail_iterations = []
    # Parallelizable block that can fail
    for i in range(total_iterations):
        if i in fail_iterations:
            raise SpecificException
        yield i


def computed_1(parallelizable_block: int, fail_computed_1: bool = False) -> int:
    # Computed node that can fail
    if fail_computed_1:
        raise SpecificException()
    import time
    time.sleep(1)
    return parallelizable_block + 1


def computed_2(computed_1: int, fail_computed_2: bool = False) -> int:
    # Computed node that can fail
    if fail_computed_2:
        raise SpecificException()
    import time
    time.sleep(1)
    return computed_1 + 2


def computed_3(computed_2: int, fail_computed_3: bool = False) -> int:
    # Computed node that can fail
    if fail_computed_3:
        raise SpecificException()
    import time
    time.sleep(1)
    return computed_2 + 3


def collected(computed_3: Collect[int], fail_collected: bool = False) -> int:
    filtered=[item for item in computed_3 if item is not None]
    if fail_collected:
        raise SpecificException()
    # fail in advance
    return sum(filtered)


def final_node(collected: int, fail_final: bool = False) -> int:
    if fail_final:
        raise SpecificException()
    return collected


if __name__ == "__main__":
    import __main__

    TOTAL_ITERATIONS = 10
    for params in [
        {"fail_first_node": True},
        {"fail_iterations": [3, 5]},
        {"fail_computed_1": True},
        {"fail_computed_2": True},
        {"fail_computed_3": True},
        {"fail_collected": True},
        {"fail_final": True},
    ]:
        try:
            fail_all_parallel = True
            dag_name = str(params.items()).replace("[", "").replace("]", "").replace(", ", "_").replace("'", "")
            tags = {
                **{key: str(value) for key, value in params.items()},
                **{"test_run": "2", "with_adapter": "true", "fail_all_parallel": str(fail_all_parallel).lower()}
            }
            print(f"Running with params: {params}")
            params["total_iterations"] = TOTAL_ITERATIONS
            tracker = adapters.HamiltonTracker(
                project_id=10,  # modify this as needed
                username="elijah@dagworks.io",
                dag_name="initial_test",
                tags=tags
            )
            dr = (
                driver
                .Builder()
                .enable_dynamic_execution(allow_experimental_mode=True)
                .with_adapters(
                    tracker,
                    GracefulErrorAdapter(
                        error_to_catch=SpecificException
                    )
                )
                .with_modules(__main__)
                .build()
            )
            dr.execute(["final_node"], inputs=params)
        except Exception as e:
            logger.exception(e)

elijahbenizzy

This looks really great, thank you. Made quite a few comments on ideas/clarifications. You've already done a ton, so let me know if there's anything you want to hand off!

Otherwise I'd love to co-author a blog post on this, this is very cool and a really fun technical story.

examples/parallelism/graceful_running/README.md

elijahbenizzy · 2024-07-07T18:30:08Z

hamilton/lifecycle/default.py

@@ -563,11 +569,19 @@ def never_reached(wont_proceed: int) -> int:

        Note you can customize the error you want it to fail on and the sentinel value to use in place of a node's result if it fails.

+        For Parallelizable nodes, this adapter will attempt to iterate over the node outputs. If an error occurs, the sentinel value is
+        returned and no more iterations over the node will occur. If you set ``fail_all_parallel`` to be True, it only sends on sentinel


on -> one

Suggested change

returned and no more iterations over the node will occur. If you set ``fail_all_parallel`` to be True, it only sends on sentinel

returned and no more iterations over the node will occur. If you set ``fail_all_parallel`` to be True, it only sends one sentinel

elijahbenizzy · 2024-07-07T18:30:44Z

hamilton/lifecycle/default.py

        :param error_to_catch: The error to catch
        :param sentinel_value: The sentinel value to use in place of a node's result if it fails
+        :param fail_all_parallel: Treat a Parallelizable as 1 failure (True) or allow the successful ones to go through (False).
+        :param sentinel_injection_tags: Node tag key:value pairs that allow sentinel injection


A little unclear what these tags do to me, mind expanding a bit on why they're necessary/how to use them?

Also, is there any reason you would want to set specific ones? Why not just have a fixed tag? Then you could create a simple decorator that calls out to it:

@drop_errors(...)

(still not 100% sure how to use these so this might be wrong...)

Attempted to make a decorator that creates a specific tag to allow sentinel injection.

4308d0a

Awesome, thanks! Would you mind adding a little documentation on that decorator with an example? I think its a little confusing what it does.

hamilton/lifecycle/default.py

elijahbenizzy · 2024-07-07T18:35:45Z

hamilton/lifecycle/default.py

+
+        # Grab the partial-ized function that is a parallelizable.
+        # Be very specific...
+        if len(node_callable.keywords) == 1 and "_callable" in node_callable.keywords:


This is specifically coupling to this, right?

hamilton/hamilton/execution/executors.py

Line 63 in 87c5de0

def new_callable(*args, _callable=None, **kwargs):

? Would it be better to just check if its an expand and make the assumption? Not perfect, but I think it cleans a bit of the coupling up/makes it clearer. Still coupled, but slightly cleaner conditions.

hamilton/lifecycle/default.py

tests/test_parallel_graceful.py

…xample. Added DAG image to example. Typo.

elijahbenizzy · 2024-07-08T18:25:31Z

@JamesArruda nice work, almost there! Added some comments. Do let me know if you want me to take over getting it the last little bit (don't want to bog you down with nitpicks...).

hamilton/lifecycle/default.py

elijahbenizzy · 2024-07-08T18:22:05Z

hamilton/lifecycle/default.py

+def accept_error_sentinels(func: Callable):
+    """Tag a function to allow passing in error sentinels.
+
+    For use with ``GracefulErrorAdapter``.


Nit -- it's nice to add this to the docs -- here (

hamilton/docs/reference/lifecycle-hooks/GracefulErrorAdapter.rst

Line 4 in 87c5de0

). I'm happy to add it in afterwards cause it takes a bit of dev work to test out. Plus side is we can also cross-link the reference.

If you don't mind adding that, I would appreciate the help. You'll get it done much faster, I think!

You got it!

elijahbenizzy · 2024-07-08T18:23:00Z

hamilton/function_modifiers/metadata.py

@@ -57,6 +57,7 @@ def my_function(...) -> ...:
        "ccpa",
        "dag",
        "module",
+        "ERROR_SENTINEL",


I don't think this is a namespace? I'd set it as hamilton.error_sentinel=True) then keep the ypassing of reserved namespaces. That said, we can always change it later as its an internal contract...

elijahbenizzy · 2024-07-08T18:25:02Z

hamilton/lifecycle/default.py

        :param error_to_catch: The error to catch
        :param sentinel_value: The sentinel value to use in place of a node's result if it fails
+        :param fail_all_parallel: Treat a Parallelizable as 1 failure (True) or allow the successful ones to go through (False).
+        :param sentinel_injection_tags: Node tag key:value pairs that allow sentinel injection


Awesome, thanks! Would you mind adding a little documentation on that decorator with an example? I think its a little confusing what it does.

elijahbenizzy · 2024-07-08T21:22:26Z

@JamesArruda -- turns out the test failure is coming from the inclusion of @tag as an import in default.py (yes, really, I downloaded it and messed around).

I think this is due to a circular dependency, but I haven't been able to dig in. So, the cleanest way to do this is to move the decorator to somewhere in the function_modifiers package, and it'll be loosely coupled. Could also do a dynamic (inline) import. I think that might be cleaner with a TODO + reference to this conversation.

That said, I think this is looking pretty much ready, so I'm happy to merge this and then I'll take this the last mile. Sounds good?

JamesArruda · 2024-07-08T23:47:55Z

@elijahbenizzy Sounds great, thanks!

There's an issue that we have not solved yet -- we'll need to audit circular references. For now an inline import is OK, but see PR + comment here for more details #1017. We'll need to fix this later.

Initial issue description as a test.t Modifying GracefulErrorAdpater for Parallelizable blocks. Added docs to adapter. Updated for simpler passthrough. Added sentinel injection feature and test. Updated gitignore for vscode settings folder. Fix for tag error Updating adapter test for proper tags Fixing types for 3.8 Type fixing for 3.8 Fixing sentinel equality testing to avoid incomparible types. Adding example of paralellism and GracefulErrorAdapter Added decorator for sentinel acceptance as input. Updated tests and example. Added DAG image to example. Typo. Simplifying node callable retrieval. Adding docstring clarification on try_all_parallel. Parametrizing tests. Moving test module to resources. Added docs to decorator. Changed node tag key.

There's an issue that we have not solved yet -- we'll need to audit circular references. For now an inline import is OK, but see PR + comment here for more details #1017. We'll need to fix this later.

Initial issue description as a test.t Modifying GracefulErrorAdpater for Parallelizable blocks. Added docs to adapter. Updated for simpler passthrough. Added sentinel injection feature and test. Updated gitignore for vscode settings folder. Fix for tag error Updating adapter test for proper tags Fixing types for 3.8 Type fixing for 3.8 Fixing sentinel equality testing to avoid incomparible types. Adding example of paralellism and GracefulErrorAdapter Added decorator for sentinel acceptance as input. Updated tests and example. Added DAG image to example. Typo. Simplifying node callable retrieval. Adding docstring clarification on try_all_parallel. Parametrizing tests. Moving test module to resources. Added docs to decorator. Changed node tag key.

There's an issue that we have not solved yet -- we'll need to audit circular references. For now an inline import is OK, but see PR + comment here for more details #1017. We'll need to fix this later.

JamesArruda added 8 commits July 5, 2024 12:20

Initial issue description as a test.

c24d2d7

Modifying GracefulErrorAdpater for Parallelizable blocks.

17ffb36

Added docs to adapter. Updated for simpler passthrough.

60d1787

Added sentinel injection feature and test. Updated gitignore for vsco…

e2f8e62

…de settings folder.

Fix for tag error

04ef17c

Updating adapter test for proper tags

775f5ed

Fixing types for 3.8

bd7d6b3

Type fixing for 3.8

f836643

JamesArruda added 2 commits July 5, 2024 23:14

Fixing sentinel equality testing to avoid incomparible types.

43b4037

Adding example of paralellism and GracefulErrorAdapter

6a08e98

elijahbenizzy reviewed Jul 7, 2024

View reviewed changes

JamesArruda added 4 commits July 7, 2024 21:38

Added decorator for sentinel acceptance as input. Updated tests and e…

4308d0a

…xample. Added DAG image to example. Typo.

Simplifying node callable retrieval.

f6c7339

Adding docstring clarification on try_all_parallel.

f7c0e94

Parametrizing tests. Moving test module to resources.

a299289

elijahbenizzy reviewed Jul 8, 2024

View reviewed changes

Added docs to decorator. Changed node tag key.

96d9c47

elijahbenizzy merged commit 683bc3a into DAGWorks-Inc:graceful-failure-parallelism Jul 9, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful failure for parallelism #1017

Graceful failure for parallelism #1017

JamesArruda commented Jul 5, 2024

elijahbenizzy commented Jul 5, 2024

elijahbenizzy commented Jul 7, 2024

elijahbenizzy left a comment

elijahbenizzy Jul 7, 2024

JamesArruda Jul 8, 2024

elijahbenizzy Jul 7, 2024

elijahbenizzy Jul 7, 2024

JamesArruda Jul 8, 2024

elijahbenizzy Jul 8, 2024

elijahbenizzy Jul 7, 2024

JamesArruda Jul 8, 2024

elijahbenizzy commented Jul 8, 2024

elijahbenizzy Jul 8, 2024

JamesArruda Jul 8, 2024

elijahbenizzy Jul 8, 2024

elijahbenizzy Jul 8, 2024

JamesArruda Jul 8, 2024

elijahbenizzy Jul 8, 2024

elijahbenizzy commented Jul 8, 2024 •

edited

Loading

JamesArruda commented Jul 8, 2024

	returned and no more iterations over the node will occur. If you set ``fail_all_parallel`` to be True, it only sends on sentinel
	returned and no more iterations over the node will occur. If you set ``fail_all_parallel`` to be True, it only sends one sentinel

Graceful failure for parallelism #1017

Graceful failure for parallelism #1017

Conversation

JamesArruda commented Jul 5, 2024

Changes

How I tested this

Notes

Checklist

elijahbenizzy commented Jul 5, 2024

elijahbenizzy commented Jul 7, 2024

elijahbenizzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elijahbenizzy commented Jul 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elijahbenizzy commented Jul 8, 2024 • edited Loading

JamesArruda commented Jul 8, 2024

elijahbenizzy commented Jul 8, 2024 •

edited

Loading