Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override nodes at module import - Feature Sketch #1134

Conversation

jernejfrank
Copy link
Contributor

@jernejfrank jernejfrank commented Sep 11, 2024

This addresses #1120 and adds the capability to override nodes from later imported modules. It also suggests how to handle the case when we import the same module twice.

Changes

Adds allow_module_overrides as a method in the Builder and as an attribute in the Driver. If set true, it allows the node with the same name to be overwritten.

How I tested this

There are only two small mock scripts to have a running sketch; I will add unit tests.

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@zilto
Copy link
Collaborator

zilto commented Sep 12, 2024

Sharing Slack comments for lineage

Thanks for the contribution! I see the thread of issues and Slack convo.
I think it's worth discussing the feature a bit more before you commit additional time to it!
For one, relying on ordering of arguments can be brittle. For example .with_modules(foo, bar, allow_override=True) might not equal .with_modules(bar, foo, allow_override=True). It's good to use the flag allow_override to prevent users accidentally overriding functions though
Since this change involves many core internals (FunctionGraph, Driver , Builder) I'm wondering if there are other ways to support this feature. Hamilton couples a lot of thing to the module Python construct (for better or worse). But concretely, the issue seems to be overriding functions. It might be a valuable change to catch duplicate functions "closer to the user" (e.g., Builder.build() rather than in internal components like FunctionGraph

An idea would be to use find_functions() , which is a public-facing API part of hamilton.graph_utils. Ideally, it would allow us to only make "surface level" changes closer to the Builder.

Default behavior

We can keep .with_modules() as-is and avoid changes to other internals (Driver, FunctionGraph). This ensures safety for all Hamilton users

Override behavior

When specifying .with_modules(..., allow_overrides=True),
the logic of .with_modules() could directly call find_functions() and dynamically removed duplicate functions

def with_modules(*modules):
   if allow_overrides:
      functions_seen = set()
      for module in modules:
         functions = graph_utils.find_functions(module)
         for name, callabl in functions:
            if name in functions_seen
               del getattr(module, name)
            else:
               functions_seen.add(name)
   
    self.modules.extend(modules)

@elijahbenizzy
Copy link
Collaborator

This addresses #1120 and adds the capability to override nodes from later imported modules. It also suggests how to handle the case when we import the same module twice.

@zilto this specifically doesn't allow for node overriding, this is only function overriding. Burt also, there is no inherent problem changing the core functionality -- it shouldn't be static. In this case the graph is responsible for assembling the nodes. It also has the current error message that breaks, so it makes sense to change that. We have more flexibility if we adjust the internal API to change later than using user-facing functions, which tie us to the external API.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left some comments on specifics. High-level:

  1. You'll want docs (see slack for how to add/develop). Make sure to reference caveats (why you might/might not want to do this)
  2. Wondering if we want more in the example -- a README/notebook might be nice. If you have a compelling version of why this feature is useful that woudl be nice, otherwise don't want to upscope this! You're not the one that asked for this feature, you're the one building it :)

@@ -0,0 +1,22 @@
import module_a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a README in the examples module with how to run/what we learn? Maybe a notebook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on the fence about having this in the examples as I don't really have a use case for it. Maybe we remove it for now and ask Yijun Tang who opened the issue if he can give a rough sketch of how he will use it. Happy to implement it, just don't want it to be a very contrived example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! So just a simple README then might be nice -- one that says what this does/a quick description of the feature. Can ignore a notebook until we have it ready.

examples/module_overrides/z_double_import_problem.py Outdated Show resolved Hide resolved
hamilton/driver.py Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
@@ -2010,13 +2031,25 @@ def build(self) -> Driver:
adapter=lifecycle_base.LifecycleAdapterSet(*adapter),
)

if not self.module_overrides: # if override on than this doesn't matter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting... This is actually a different type of module override -- those that are the same. Can you elaborate more on the case that it broke? Feels like an upstream thing... I'm not 100% sure I trust the module in module_set -- identity is a little weird/what happens if you make an update but keep an old reference around?

Copy link
Contributor Author

@jernejfrank jernejfrank Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if you do something like

.with_modules(module_A, module_A)

hamilton will just add it to the list of modules and treat them as two different ones. If you print the attribute, you get two pointers in the list

[<module 'module_A' from 'path to module A'>,<module 'module_A' from 'path to module A'>, <module 'module_A' from 'path to module A'>,<module 'module_A' from 'path to module A'>]

Then when you try to build the driver, because they are treated as separate you run into the issue that you have same-named functions and get an error that functions have to have different names.

The module_set tries to get create an ordered set out of the list we use for storing the module references.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, what I"m saying is thyis feels like a different use-case, and has a different problem that informed it?

I guess I'm just worried that it might hide unintended behavior...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I admit it's probably a niche thing that may not come up again, so I'm happy to take it out. In case someone else flags it, it's not hard to add it later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean, maybe a separate PR? Or an issue with the proposed fix?

hamilton/graph.py Show resolved Hide resolved
@jernejfrank
Copy link
Contributor Author

I feel like the README is the most concise description of this functionality, so I am wondering if we should link to it more?

@elijahbenizzy
Copy link
Collaborator

Looking good -- to get it out:

  • Let's decide whether we want the additional clashing modules in -- if so, let's add a driver level test
  • Add a driver-level test folr this feature (we have driver code)
    See this file for E2E tests -- good enough for this IMO:
    def test_driver_validate_with_overrides():

@elijahbenizzy
Copy link
Collaborator

I feel like the README is the most concise description of this functionality, so I am wondering if we should link to it more?

Yep, agreed. Can link from docstring and/or from concepts section.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you!

@elijahbenizzy elijahbenizzy merged commit d878595 into DAGWorks-Inc:main Sep 13, 2024
24 checks passed
@jernejfrank jernejfrank deleted the feature/override_nodes_at_module_import branch September 13, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants