-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutate Decorator #1160
Mutate Decorator #1160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to 754964f in 12 seconds
More details
- Looked at
500
lines of code in9
files - Skipped
2
files when reviewing. - Skipped posting
4
drafted comments based on config settings.
1. hamilton/function_modifiers/macros.py:1167
- Draft comment:
Ensure that thetransform
attribute is initialized for the target function if it doesn't already exist before appending to it. - Reason this comment was not posted:
Comment did not seem useful.
2. hamilton/function_modifiers/macros.py:1168
- Draft comment:
Ensure that thetransform
attribute is initialized for the target function if it doesn't already exist before appending to it. - Reason this comment was not posted:
Marked as duplicate.
3. docs/reference/decorators/pipe.rst:30
- Draft comment:
Ensure that themutate
decorator is fully documented here, as it is a new addition. The PR already adds documentation for it, which is good. - Reason this comment was not posted:
Confidence changes required:20%
The mutate decorator is being introduced, and it should be documented in the sphinx documentation. The PR already adds documentation for the mutate decorator in thedocs/reference/decorators/pipe.rst
file, which is appropriate.
4. hamilton/function_modifiers/macros.py:1070
- Draft comment:
Themutate
class is a new addition and should be documented in the sphinx documentation. Ensure that it is included in the relevant documentation files. - Reason this comment was not posted:
Confidence changes required:20%
The mutate decorator is being introduced, and it should be documented in the sphinx documentation. The PR already adds documentation for the mutate decorator in thedocs/reference/decorators/pipe.rst
file, which is appropriate.
Workflow ID: wflow_nny9Up2c9Q0OqvXU
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
Just documenting some thoughts while implementing to keep a record:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun! Great stuff!
Edge cases / things to double check or think through:
- Does this work with both variants of using Ray? i.e. checking SERDE.
- What should we do if someone does
@mutate
def no_leading_underscore(...) -> ...:
- I assume the Hamilton UI has no issue with this?
- How could someone screw this up? Do we need anything for error handling?
- What guardrails do we have? or should we think about adding?
Otherwise the commit for this should include any assumptions/constraints so we can catalog design decisions and anything that might help our future selves --> so if it's in this PR message it's easy then to squash merge with it...
|
||
def __init__( | ||
self, | ||
*target_functions: Callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to support a string -- since it could come from extract_fields for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an interesting edge case and also one for pipe_output
that we haven't addressed; right now if we try
def _pre_step(something:int)->int:
return something + 10
def _post_step(something:int)->int:
return something + 100
def a()->int:
return 10
@pipe_output(
step(_pre_step)
)
@extract_fields(
{"field_1":int, "field_2":int}
)
@pipe_input(
step(_pre_step)
)
def foo(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2}
we run into an InvalidDecoratorException: Expected a single node to transform, but got 2.
when trying to use post_pipe
. Since mutate
is an extension of post_pipe
we get the same issue.
I think this should be solved upstream in post_pipe
and added as as allowed option, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also opens up the discussion from my notes, using with other decorators we need to decide on an individual basis when mutate
is meant to be applied. For example check_output
should be applied to the result of mutate
, but here mutate
should be applied to the result of exteact_fields
. Sometimes you may even want to mutate
the function first and then extract fields?
This ordering is somewhat tricky since we are not stacking decorators in the usual way on top of a single function where the order is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is difficult to get right and I'd prefer not to support it for that reason. @jernejfrank your original implementation would have made this easier (it worked on strings), but we're really just reusing the pieces we have to not add additional concerns inside the function graph (as we have not really thought about he implication).
Specifically:
- Decorators operate on a per-function basis
- This adds the function-level decorators to functions
- We don't know the node until after we process the. functions
While we could change that, it gets really tricky, and I think it'll spaghettify the code.
To unblock this I'd say we add a target for @mutate
-- simliar to how we do others:
@mutate(other_function, target_=...)
def ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm just to make sure we're on the same wave length I'm thinking about this:
@extract_columns("foo", "bar")
def data_set(...) -> pd.DataFrame:
...
# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate("foo", "bar")
def _normalize(col: pd.Series) -> pd.Series:
...
@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
...
# but I want to apply the same transforms to each data set
@mutate("train", "test")
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
...
That is -- you want to apply mutate to some prior valid Hamilton function/node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @skrawcz's suggestion: @mutate((funtion_pointer, node_string),...)
.
It allows us to get out of the spaghetti mess with pure strings. I couldn't find a reasonable way of doing it without either adding another mutate_functions
collector at the graph.py
level or letting the DAG be generated as is and then adding a second transversal through nodes searching for the specific string names. The main problem is that without the function pointer, we lose any locality where the node will be generated, so it is impossible to add them to the current node resolution without the above ways that get around that problem.
API-wise we can have @mutate(Callable | Tuple(Callable, str))
and if it is only a function pointer it gets added as a classical pipe_output
, if it is a Tuple it delays construction until we reach the node level that can be added to it (now we get this info, because the function_pointer has it stored!)
Re: config -- I think that's fine. We just say this is the API
# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate(
(data_set__y, "foo"),
(data_set__y, "bar"),
(data_set__z, "foo"),
(data_set__z, "bar")
)
def _normalize(col: pd.Series) -> pd.Series:
return ...
or to make it less verbose:
@mutate(
(data_set__y, ["foo","bar"], some_dataset_specif_variable=value(10) ),
(data_set__z, ["foo","bar"], some_dataset_specif_variable=source('upstream_node'))
)
def _normalize(col: pd.Series, some_dataset_specif_variable:int) -> pd.Series:
return col*some_dataset_specif_variable
and the fully flexible option:
@mutate(
(data_set__y, "foo", some_dataset_specif_variable=value(10)),
(data_set__y, "bar", some_dataset_specif_variable=value(20)),
(data_set__z, "foo", some_dataset_specif_variable=source('upstream_node1')),
(data_set__z, "bar", some_dataset_specif_variable=source('upstream_node2'))
)
def _normalize(col: pd.Series, some_dataset_specif_variable:int) -> pd.Series:
return ...
Practically, I think this would then add a tag on the decorator level and we just need to be sure that it is the last NodeTransformer
decorator during the node resolution. Then the above use-cases can be handles easily since we get nodes passed into and have all the info we need / the config stuff has already been resolved.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elijahbenizzy
target
has other meaning with other decorators, so I wouldn't use that here.One attempt at your point:
@extract_columns("foo", "bar") def data_set(...) -> pd.DataFrame: ... # but I want to use mutate on `foo` and ` bar` -- how do I do that? @mutate((dataset, "foo"), (dataset, "bar")) def _normalize(col: pd.Series) -> pd.Series: ... @extract_fields({"train": pd.DataFrame, "test": pd.DataFrame}) def training_data( ...) -> dict: ... # but I want to apply the same transforms to each data set @mutate((training_data, "train"), (training_data, "test")) def _transforms(df: pd.DataFrame) -> pd.DataFrame: ...but function pointers break down with
@config.when
which is a common decorator:@extract_columns("foo", "bar") @config.when(x="y") def data_set__y(...) -> pd.DataFrame: ... @extract_columns("foo", "bar") @config.when(x="z") def data_set__z(...) -> pd.DataFrame: ... # but I want to use mutate on `foo` and ` bar` -- how do I do that? @mutate((???, "foo"), (???, "bar")) def _normalize(col: pd.Series) -> pd.Series: ...Otherwise Hamilton should process the nodes in order they were encountered. So I don't think it needs to be two pass if that assumption is true. This feature is only 50% useful to Hamilton if it can only depend on function pointers.
Edge cases to consider:
- what happens if node it depends on doesn't exist? e.g. via
@config.when
or user error, etc.
@skrawcz @jernejfrank target
is explicitly meant for this. Semantically it's identical. If we append a chain, target
provides the node on which the chain is appended/nodes are modified.
Furthermore, it's partially wired in. @post_pipe
is a SingleNodeNodeTransformer
. This, by default, means that it will apply to all the "sink" nodes (target=None
). We limit it to just a single sink node, so we don't have to have target.
Instead, we should be able to alter these decorators to be a NodeTransformer
(superclass of SingleNodeNodeTransformer
) and thus take in target. And wiring through the target is then the exact same parameter. See this.
So IMO target is more consistent, and pretty much the same as tuples above. Should be straightforward to get this to work, if we want to support this.
Then there's the question of what to do if a target isn't applied -- my guess (and we'd have to test) is that if we make pipe_input
and pipe_output
a NodeTransformer
, and use the default target (of None
-- implies all "sink" nodes get decorated"), then we'll actually be able to just apply the same pipe operation to every column. Cool!
E.G.
@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame})
def training_data( ...) -> dict:
...
# but I want to apply the same transforms to each data set
@mutate(training_data) # will, by default, apply to each field
def _transforms(df: pd.DataFrame) -> pd.DataFrame:
...
Regarding config
, I think we just need to specify both functions. We may be able to use a string for a function and look for anything in the same module to see if it matches (E.G. without the double underscore), but we should save this edge case for a bit later.
Otherwise, re: second pass, it's subtle as to why it has to be done. It's not because it carries references to other nodes, it's because running this decorator requires changing the other node's names. E.G. you have a node foo
. Then you pipe_output
foo
so you have foo_raw
-> foo
. Changing the name of foo
to foo_raw
is something that has to be done outside of the current decorator framework.
Conceptually, this is equivalent to allowing decorators to modify any part of the entire DAG, rather than individual subdags produced by functions, which makes it tougher to reason about, and requires n passes (one for each decorator). I'd prefer to avoid this for now, and I actually quite like the function-pointer approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought -- target
is actually overused in Hamilton. We have:
target
being the parameter we apply to (see data loaders)target
being the output node we use -- see data savers, data validators, check_output, tag, etc...)
(2) is the more used case, and corresponds internally, but we don't have to name it target
if it's confusing. Just that we should use the internal concept of target
to implement, and that'll make it easier.
An idea, re: config. I think this should work:
@extract_columns("foo", "bar")
@config.when(x="y")
def data_set__y(...) -> pd.DataFrame:
...
@extract_columns("foo", "bar")
@config.when(x="z")
def data_set__z(...) -> pd.DataFrame:
...
# but I want to use mutate on `foo` and ` bar` -- how do I do that?
@mutate('data_set', "foo", "bar") # however you want to specify foo/bar, we have a few different patterns
def _normalize(col: pd.Series) -> pd.Series:
...
Specifically, if we have a string, it can refer to:
- The name of the function in the same module as the current one
- The name of the function without
__
(I think we might even track it...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skrawcz @jernejfrank
target
is explicitly meant for this. Semantically it's identical. If we append a chain,target
provides the node on which the chain is appended/nodes are modified.Furthermore, it's partially wired in.
@post_pipe
is aSingleNodeNodeTransformer
. This, by default, means that it will apply to all the "sink" nodes (target=None
). We limit it to just a single sink node, so we don't have to have target.Instead, we should be able to alter these decorators to be a
NodeTransformer
(superclass ofSingleNodeNodeTransformer
) and thus take in target. And wiring through the target is then the exact same parameter. See this.So IMO target is more consistent, and pretty much the same as tuples above. Should be straightforward to get this to work, if we want to support this.
Awesome, thanks @elijahbenizzy for drawing my attention to the superclass. That is a very clean way to resolve this dilemma and does exactly what we want to achieve with tuples. I'll go ahead and implement it quickly to have a concrete case that we can further discuss if something feels off and we initially missed.
Then there's the question of what to do if a target isn't applied -- my guess (and we'd have to test) is that if we make
pipe_input
andpipe_output
aNodeTransformer
, and use the default target (ofNone
-- implies all "sink" nodes get decorated"), then we'll actually be able to just apply the same pipe operation to every column. Cool!E.G.
@extract_fields({"train": pd.DataFrame, "test": pd.DataFrame}) def training_data( ...) -> dict: ... # but I want to apply the same transforms to each data set @mutate(training_data) # will, by default, apply to each field def _transforms(df: pd.DataFrame) -> pd.DataFrame: ...
pipe_input I don't think we need to touch (it's a NodeInjector now) and I am not sure how we would use it with mutate, can you expand where it comes into play @elijahbenizzy ?
But I did change pipe_output to the superclass NodeTransformer and happy to report that your above hunch is correct. In case no target is specified, at least for extract_field, it gets applied to both "train" and "test" (some minor caveats around adopting the name from the node instead of the function in case namespace is "...", but that should be ok).
Regarding
config
, I think we just need to specify both functions. We may be able to use a string for a function and look for anything in the same module to see if it matches (E.G. without the double underscore), but we should save this edge case for a bit later.
I will start off simple that you need to input both functions and if everything works/we are happy with the API, this can be changed at the end if that's alright with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! You are correct, @pipe_input
does not apply to this, although it does apply to the other definition of target
-- adding a parameter target
would make sense here IMO:
def __init__( |
But that's a separate issue, so I created this: #1161.
LMK what you need/when you want a review!
Re Edge cases -- 1., 3. are ok! For 2.,4.,5. -- we could add the flag Basic:
Advanced
|
I think it's OK to document caveats here. This is new functionality, and as long as the API is stable I'm happy. Cross-module, to me, might be another decorator Agreed that transforms should be automatically hidden, whether you use underscores or not (E.G. marked to ignore by hamilton). Re: additional config, that should be easy enough to add, we do the same thing with This should work the same way pipe_output should layer decorators -- but re: decorating the actual |
Nice work! Given that this is an in-progress PR, I'll leave the following feedback:
|
Ok, this is very much WIP and my stab at this -- very open to just hearing what I did is ridiculous and I shouldn't complicate things so much :) Effectively, I tried to mimick the pipe API and wiring things in the background. Before I go deeper down the rabbit hole with edge cases, tests, docs, etc., would be good to hear your thoughts on it. No real decision around naming yet, just put some placeholders that allow me to keep things apart. |
@jernejfrank happy to look at the implenentation next, first took a look at the API. I really like it! Only question I have -- the To me, it would be clearer as two separate decorators (this should be allowed, right?). Think you had it. OTOH, it's a bit weird if it's split out between decorators like it is here ( Thoughts? @pipe_output(
step(_pre_step).named(name="b")
target=["field_1"]
)
@pipe_output(
step(_pre_step).named(name="b").targeted("field_1"),
target=["field_3", "field_2"]
)
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} |
Nice, glad it is going in the right direction.
Re: two separate decorators -- actually we cannot stack two Also note that this would not work (added comment to your code): @pipe_output(
step(_pre_step).named(name="b")
target=["field_1"]
)
"field_1")
@pipe_output(
step(_pre_step).named(name="b").targeted("field_1"),
target=["field_3", "field_2"] #this make the global subset to "field_2", "field_3" which means that "_pre_step.target("field_1")" gets ignored (no global
)
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} The functionality ATM is the following: global @pipe_output(
step(_foo),
step(_bar),
# no global target here means `foo`, `bar` get applied to all output nodes "field_1", "field_2", "field_3"
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} @pipe_output(
step(_foo),
step(_bar),
target=["field_1","field_2"]
# global target here means `foo`, `bar` get applied to "field_1" and "field_2"
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} @pipe_output(
step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1"
step(_bar), # this gets applied to "field_1", "field_2", "field_3"
# no global target here means `foo`, `bar` get applied to all output nodes "field_1", "field_2", "field_3"
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} @pipe_output(
step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1"
step(_bar).targeted("field_2"), # this makes `bar` only applied to "field_2"
target=["field_1","field_2"]
# global target here means `foo`, `bar` get applied to "field_1" and "field_2"
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} @pipe_output(
step(_foo).targeted("field_1"), # this makes `foo` only applied to "field_1", but there is not "field_1" since only "field_3" available, so it is skipped
step(_bar).targeted("field_2"), # this makes `bar` only applied to "field_2", but there is not "field_2" since only "field_3" available, so it is skipped
target="field_3"
# global target here means `foo`, `bar` get applied to "field_3"
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def fn(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3} I hope this is not too confusing. |
These two functionalities achieve the same thing %%cell_to_module -m pipe_output_with_target --display
from typing import Dict
from hamilton.function_modifiers import extract_fields, pipe_input, pipe_output, step
def _pre_step(something:int)->int:
return something + 10
def _post_step(something:int)->int:
return something + 100
def _something_else(something:int)->int:
return something + 1000
def a()->int:
return 10
@pipe_output(
step(_something_else),
step(_pre_step).named(name="b").targeted("field_1"),
step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
)
@extract_fields(
{"field_1":int, "field_2":int, "field_3":int}
)
def foo(a:int)->Dict[str,int]:
return {"field_1":1, "field_2":2, "field_3":3}
@pipe_output(
step(_something_else),
target="b_field_2"
)
@pipe_output(
step(_something_else),
step(_post_step).named(name="a"),
target="b_field_3"
)
@pipe_output(
step(_something_else),
step(_pre_step).named(name="b"),
step(_post_step).named(name="a"),
target="b_field_1"
)
@extract_fields(
{"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
return {"b_field_1":1, "b_field_2":2, "b_field_3":3} Either we have one |
Interesting... So I'm a fan of the second, but I'm curious if we can do these modifications:
I'm not 100% following the point about erroring out with duplicates. Do you have an example of where that would happen? |
Funny, I prefer the first one :) I would rename I also added a restriction that we can either use target the level of pipe and it applies it globally or if we apply it for a step, then we cannot have the global target anymore.
This is already allowed -- you can pass in list of nodes or if target=None it gets applied to all sink nodes. The same rules apply as NodeTransformer
I am not sure what you mean by that. I did have to change namespace already a bit (not fully tested) and use node_.name to avoid collisions (maybe we are thinking the same thing here). I think there re two collision issues, one on the level of nodes and the other if we call multiple pipe_outputs how to handle the original_node_raw --> this one I find problematic because now you don't know that there is another pipe coming because tey get applied sequentially.
This breaks down with extract_field because of type mismatch, the original node will give you dict and the extracted field is a different type.
This would not work wherever we place @pipe_output(
step(_something_else),
)
@pipe_output(
step(_post_step).named(name="a"),
target="b_field_3"
)
@pipe_output(
step(_pre_step).named(name="b"),
step(_post_step).named(name="a"),
target="b_field_1"
)
# @pipe_output(
# step(_something_else),
# )
@extract_fields(
{"b_field_1":int, "b_field_2":int, "b_field_3":int}
)
def bar(a:int)->Dict[str,int]:
return {"b_field_1":1, "b_field_2":2, "b_field_3":3} But here you get it applied to all fields without needing repetition and then the other ones just limit to nodes they should target @pipe_output(
step(_something_else),
step(_pre_step).named(name="b").targeted("field_1"),
step(_post_step).named(name="a").targeted(["field_1", "field_3"]),
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good -- I think it's worth thinking about how to subclass the Applicable
stuff so we can share code -- a lot of commonalities between them.
""" | ||
|
||
@classmethod | ||
def _single_target_level(cls, target: base.TargetType, transforms: Tuple[Applicable]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this validate_single_target_level
applicable = step(mutating_fn, **remote_applicable_builder.mutating_fn_kwargs) | ||
|
||
if remote_applicable_builder.when_key_value_pairs is not None: | ||
applicable = applicable.when(**remote_applicable_builder.when_key_value_pairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, so I think if we make applicable allowed to have a None
function it might just solve this? Otherwise it's the same, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take another look at the implementation with fresh eyes, but it's looking good. My general take is:
- The example is a bit abstract -- it serves more as documentation. Not sure what time you have, but if you come up with a motivating example (E.G. feature engineering) it might ring true. It's a good test/dev env/smoke screen, but I think that you may want to come up with a narrative that helps justify it.
- Worth documenting assumptions -- the code is clean but there are some clever things (E.G. the
Applicable
stuff) that we want to make sure is clear as to why - Related -- I'd try to highlight the value in the comments/docstring/have them relate to the code examples (E.G. draw on your own experience). I think you can use hte scikit-learn example you did for post-pipe
- Also, focus on the fact that we can do cool inetractive, ad-hoc stuff now!
Should this just be added into the code via comments/docstrings?
Here is a crazy idea: I feel
I played around a bit with cross-module and I think it will take a bit longer to get there (some hickups around namespace/naming stage of nodes in pipe_output) and a bit more thought how we want that API to be (like would mutate from another module always be appended at the end or can we push it somewhere between other existing pipe steps). What are your thought on limiting this to same module and having the cross-module a seperate PR? (We alrady tinkered with |
@elijahbenizzy @jernejfrank my thoughts:
So commit messages & doc strings -- where appropriate. Commits are nice because they can cover multiple places in code, where as doc strings cover just that point.
Yes more dev --> but the idea is that we can provide prod introspection, so ideally you wouldn't need to change things. So I wouldn't expect people to translate it to pipe_output. So any example I think would just focus on how you might use this functionality to get to a place you're happy with quickly and easily. Side note: we need to think through how this would impact being used with caching -- I don't think it should be just something to test.
Agreed. Let's limit scope to just within a module. We can always expand it later; easier to add things than to remove them. |
@skrawcz I mostly agree with it, but you have some limits with mutate that made me think it might not be a bad idea to suggest that once mutate achieves a satisfactory result, it is rational to refactor it into pipe_output to remain flexible; two cases come to mind:
@pipe_output(
normalise,
something_funky,
normalise
)
def foo():
...
@pipe_output(
something_funky,
normalise
)
def bar():
... |
Right -- but the following should work though right? @mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
...
@mutate(foo)
def _something_funky(df: pd.DataFrame) -> pd.DataFrame:
...
@mutate(foo)
def _normalize(df: pd.DataFrame) -> pd.DataFrame:
... If you haven't been keeping track already (haven't reviewed the code sorry), there should be a set of tests/examples showing edge cases and how we want to handle them... |
So I think the edge-case of the same can be supported -- happy to help you figure this out. But I would argue that this is genearlly a good dev feature (for the use-case of mutating inside a notebook), but also should work for prod as well. That said, for anything that we don't expect to change, it's good to push people towards |
Oh wow @skrawcz, this actually works, nice!
There's an abstract example under new folder mutate at the moment where all these edge cases are in seperate scripts / notebook and I add to it as we go along. There's a more realistic example where I converted the pipe_output to mutate from the sci-kit script I used last time. Also planning to play around more with this on some ML things, I'll open another PR to get it added once finished/polished. |
When multiple end nodes are availible we can choose onto which we will apply pipe_output. Can be multiple. Changes: - Applicable gets the method on_output that allows to specify target nodes - pipe_output inherits from NodeTransformer - pipe_output the namespace is resolved in Node level name to avoid collisions - if on_output is global it limits the all the steps to the same subset, we prohibit additionally for steps to have clear specification of the end nodes Tested and documented.
We can apply functions to nodes without touching it Mutate recieves functions as arguments and builds for those target functions a pipe_output. If the target function already has a pipe_output it will append the mutate function as the last step.
fdaab48
to
325e613
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a quick header to this notebook to dmeo what you'll learn (also saying it's a bit more abstract). That's repeated info from before, but definitely valuable to have.
@@ -883,7 +922,11 @@ def __init__( | |||
# super(flow, self).__init__(*transforms, collapse=collapse, _chain=False) | |||
|
|||
|
|||
class pipe_output(base.SingleNodeNodeTransformer): | |||
class SingleTargetError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on this to know when it should be raised
return Applicable(fn=None, args=(), kwargs=mutating_fn_kwargs, target_fn=fn_, _resolvers=[]) | ||
|
||
|
||
class NotSameModuleError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto -- docstring sto show where it's being used
.. code-block:: python | ||
:name: Simple @mutate example with multiple arguments | ||
|
||
@mutate(A,B,arg2=step('upstream_node'),arg3=value(some_literal),...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit -- spaces in parameters for formatting for @mutate
call
_chain: bool = False, | ||
**mutating_function_kwargs: Union[SingleDependency, Any], | ||
): | ||
"""Instantiates a `\@mutate` decorator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that RST wants double-backtick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few minor points but let's merge tomorrow + ship in the next release.
Something to think about/a caveat:
- If
@mutate
runs once it's good - If you run that again, it might double-up
So in the juptyer notebook/cross-module case (which, admittedly, is not supported now), we'll want to guard against it.
The problem is how we remove the old one once we change it... Thoughts? Not critical (although document now), but will be for the cool ad-hoc stuff we want to do.
325e613
to
683fec0
Compare
So just to clarify, will this impact the notebook dev experience? E.g. I use the same cell with multiple mutates and continually add to it -- would this impact things? If so I think that should be addressed before we release, since that's a core way people use Hamilton... |
I tried this and didn't run into any duplication issues, but I don't dev in Jupyter so not sure I replicated a typical workflow. I tried the following: %%cell_to_module -m mutate_on_output_2
from typing import Any,Dict, List
import pandas as pd
from hamilton.function_modifiers import extract_fields, extract_columns, apply_to, mutate, value, source
def data_1() -> pd.DataFrame:
df = pd.DataFrame.from_dict({"col_1": [3, 2, pd.NA, 0], "col_2": ["a", "b", pd.NA, "d"]})
return df
# # data1 and data2
# @mutate(data_1)
# def _filter(some_data:pd.DataFrame)->pd.DataFrame:
# return some_data.dropna()
# @mutate(data_1, missing_row=value(['c', 145]))
# def add_missing_value(some_data:pd.DataFrame, missing_row:List[Any])->pd.DataFrame:
# some_data.loc[-1] = missing_row
# return some_data
# @mutate(data_1,)
# def sort(some_data:pd.DataFrame)->pd.DataFrame:
# columns = some_data.columns
# return some_data.sort_values(by=columns[0]) Uncommented one at a time and re-run the cell (didn't restart notebook or re-import module) and re-built the driver in the next cell. This makes me think it is contained to cross module. If I try to have mutate in a different cell (like it is a different module) I get blocked by the error that it has to be in the same module. |
Yep, agreed, definitely contained to cross-module. The reason that works is because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Only nit is that the docs formatting is a bit off, DMed @jernejfrank
Testing with unit tests inidividual pieces and examples have a new smoke test / abstract exampe how to use mutate. In that example are also edge cases and how they are handled or if they are supported. Added a more user friendly example by converting the existing pipe_output example and replaced the pipe_output implementation by mutate. Light refactoring / improved naming and added TODOs.
683fec0
to
6f35ab9
Compare
So I figured out that if we use function pointers to pass into the decorator we can append to them within this decorator.
Addressing: #922
I think this makes life much easier, but let me know if I simplified my life too much and this can lead to potential troubles down the line.
Changes
How I tested this
species_distribution_modeling
Checklist