Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[speculative idea] enable node redefinitions #65

Closed
HamiltonRepoMigrationBot opened this issue Feb 26, 2023 · 3 comments
Closed

[speculative idea] enable node redefinitions #65

HamiltonRepoMigrationBot opened this issue Feb 26, 2023 · 3 comments
Labels
enhancement New feature or request migrated-from-old-repo Migrated from old repository product idea

Comments

@HamiltonRepoMigrationBot
Copy link
Collaborator

Issue by skrawcz
Sunday Jan 29, 2023 at 06:49 GMT
Originally opened as stitchfix/hamilton#290


Is your feature request related to a problem? Please describe.
it is common outside of hamilton for people to "redefine columns" without mutation in value, e.g. in the case of filtering & joining data sets, the values wont change, but the indexes might. This right now forces one to rename things in hamilton if you want
to stitch together a single DAG.

Describe the solution you'd like
Idea:

  1. module crawling order matters. e.g. data_loaders, transforms, model_pipelines, etc. vs model_pipelines, transforms, data_loaders...
  2. enable one to explicitly "redefine" columns via some sort of decorator. If it's not explicit, Hamilton should error.
  3. then any downstream functions that depend on those columns, would depend on the least previously defined version.
  4. this means you can effectively also treat your modules as stages in a pipeline too.
@redefinition
def date(date:pd.Series) -> pd.Series:
     return pd.datetime(date)

@redefinition
@extract_columns("date", "bar")
def some_filter_step(date:pd.Series, bar: pd.Series, filter: str) -> pd.Series:
     df = pd.DataFrame({"date": date, "bar" : bar})
     df = df[data < filter]
     return df

Describe alternatives you've considered
You rename columns in these cases.

Additional context
Not sure how good an idea this is. But it's something to potentially try out and see how it looks.

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by skrawcz
Sunday Jan 29, 2023 at 07:05 GMT


Some of this could also be mitigated by reusing a driver/sub-dag reuse e.g.
stitchfix/hamilton#86 (comment)

@HamiltonRepoMigrationBot
Copy link
Collaborator Author

Comment by elijahbenizzy
Sunday Jan 29, 2023 at 20:52 GMT


OK, so this makes implementation/onboarding easier, but doesn't help readability/maintainability of pipeline... However, its nice to not have to draw up a bunch of different functions.

The cool thing about this is this could actually be syntactic sugar for merging nodes together/composing -- doesn't even have to be its own nodes.

Implementation could be tricky -- with the current decorator model (decorator => multiple nodes, no context), we'd have to have some notion of state, otherwise we'd have no idea what to call them. OTOH, we could take a click-like approach, which might make this easier to implement:

What this does is it allows us to attach this function to the end of the date function

def date() -> pd.Series:
    return ['20230101', '20230102', ...']

@date.redefine
def date(date:pd.Series) -> pd.Series:
     return pd.datetime(date)

@date.redefine
@extract_columns("date", "bar")
def some_filter_step(date:pd.Series, bar: pd.Series, filter: str) -> pd.Series:
     df = pd.DataFrame({"date": date, "bar" : bar})
     df = df[data < filter]
     return df

This also clarifies what is being redefined...
The way this would work is:
(1) the date function (all but the last) don't actually get outputted by the module -- they turn into None or get name-mangled in some way
(2) Each date refers to the prior one -- we could use __ if we want them around for later
(3) Each date effectively prepends the DAG created by the upstream node to the overall DAG created by that node.

This gets ugly when we have extract_columns/multiple outputs that get redefined, but I'm happy for that not to be allowed. That said, we could always do:

@some_filter_step.bar.redefine
// or
@some_filter_step.redefine('bar')
def bar(bar: pd.Series) -> pd.Series:
    return bar*2

Which would give us the information we need.

@elijahbenizzy
Copy link
Collaborator

Closing this (for now) as pipe does this in, what I believe to be, a much cleaner way. It gaurentees immutability, but allows chaining/redefinition. see #398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request migrated-from-old-repo Migrated from old repository product idea
Projects
None yet
Development

No branches or pull requests

2 participants