Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subdag decorator: Variable not recognized #115

Closed
gms101 opened this issue Mar 18, 2023 · 5 comments
Closed

subdag decorator: Variable not recognized #115

gms101 opened this issue Mar 18, 2023 · 5 comments
Labels
triage label for issues that need to be triaged.

Comments

@gms101
Copy link

gms101 commented Mar 18, 2023

When I add a subdag decorator to a function, the function itself is not able to recognise variables that comes prior in the dag(as declared by the main driver).

Stack Traces

https://fdm-bm77109.slack.com/archives/C04RWSYME79/p1679167875228639

Screenshots

Assume this the main driver function, here config_fetching module fetches a variable called backtesting_lookback_period

if __name__ == "__main__":
    adapter = base.SimplePythonGraphAdapter(base.DictResult())
    app_config = AppConfig(".FDm/config.toml")
    strategy = "L1R"
    app_config.set_selected_strategy(strategy)
    dr = driver.Driver({"app_config":app_config}, config_fetching,
                       construct_optimisation_window_data,
                       adapter=adapter)
    result = dr.execute(["optimisation_window_data"])
    print(result)
    # dr.display_all_functions("graph.dot", {})

And the construct_optimisation_window_data module looks like this

@subdag(load_price_data,inputs={})
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

When I run the driver module, I get this error

ValueError: 1 errors encountered:
  Error: Required input optimisation_window_data.backtesting_lookback_period not provided for nodes: ['optimisation_window_data'].

So, essentially, the decorated function is not able to read backtesting_lookback_period node - which comes before in the DAG compilation.

@gms101 gms101 added the triage label for issues that need to be triaged. label Mar 18, 2023
@elijahbenizzy
Copy link
Collaborator

OK, so I think this is something that one should be allowed to do, but I think we want to be careful about how we enable it.

Reasons to do it

  • intuitively it should work
  • this will match the notion of "adapters" we're planning (where we inject some inputs based on some set of external nodes)
  • reduce verbosity

Reasons not to do it

  • Not super readable as to where nodes come from -- hard to tell if they come from inside or outside the DAG
  • There could be ambiguity. E.G. if the subdag is also part of the DAG (probably not good practice), its doubly unclear which one its accessing

Possibilities

  1. Allow it by default
  2. Allow it with a flag (search_externally_for_inputs, but with a better name)
  3. Allow it with a set of sources flagged for external_inputs (@subdag(..., external_inputs=["backtesting_lookback_period"])
  4. Allow an inject parameter that can be used as one would inject: @subdag(..., inject={'backtest_lookback_period' : source('backtest_lookback_period')

I'm voting for (3), as it doesn't require much complexity, adds significant readability, and unblocks the user.

@skrawcz
Copy link
Collaborator

skrawcz commented Mar 18, 2023

Just to clarify we're talking about how one would interpret this function -- where would the declared function dependencies come from, right? In particular as the error states "backtesting_lookback_period" right?

now:

@subdag(load_price_data,inputs={})
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

This assumes both declared dependencies come from the subdag.

pass it in as an input

@subdag(load_price_data,inputs={"backtesting_lookback_period": source("backtesting_lookback_period")})
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

This I think works today, right?

allow by default

Hard to know what comes from where. What if the subdag has one defined and the current one also has backtesting_lookback_period defined?

allow with flag

@subdag(load_price_data,inputs={}, fill_with_external_sources=True)
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

This is clear -- but not clear which one is satisfied. Would silently change behavior if the underlying DAG suddenly included such a node.

specify what parameter(s) do not come from the subdag

@subdag(load_price_data,inputs={}, external=["backtest_lookback_period"])
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

This seems pretty clear. It will get as long as the argument list though.

@subdag(load_price_data,inputs={}, inject={"backtest_lookback_period": source("backtest_lookback_period")])
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

This seems like inputs above. So likely not useful to have separate.

@skrawcz
Copy link
Collaborator

skrawcz commented Mar 19, 2023

More ideas:

Use * to delineate what's from the sub-DAG and what's from the outer DAG.

@subdag(load_price_data,inputs={})
def optimisation_window_data(
                             end_to_end_price_for_all_ticker: pd.DataFrame, *, 
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

We don't have this construct now. So this would be new. But at least from a signature it would be clear. It would mean
that you couldn't request two things that are named the same from the sub-DAG and outer-DAG, but that seems fine?

Don't allow "joining" of DAGs via the parameters if @subdag is used.

Make another function that combines/transforms outputs from the sub-DAG.

@subdag(load_price_data,inputs={})
def optimisation_window_data_full(
                             end_to_end_price_for_all_ticker: pd.DataFrame) -> pd.DataFrame:
    return end_to_end_price_for_all_ticker

# second function to do the joining (could also have this function within the subdag too)
def optimisation_window_data(
                             optimisation_window_data_full: pd.DataFrame, backtesting_lookback_period: int) -> int:
    return optimisation_window_data_full[backtesting_lookback_period]  # or whatever the logic should be

This is more explicit. Maybe this is the simpler way to go for now?

@elijahbenizzy
Copy link
Collaborator

More ideas:

Use * to delineate what's from the sub-DAG and what's from the outer DAG.

@subdag(load_price_data,inputs={})
def optimisation_window_data(
                             end_to_end_price_for_all_ticker: pd.DataFrame, *, 
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

We don't have this construct now. So this would be new. But at least from a signature it would be clear. It would mean that you couldn't request two things that are named the same from the sub-DAG and outer-DAG, but that seems fine?

Don't allow "joining" of DAGs via the parameters if @subdag is used.

Make another function that combines/transforms outputs from the sub-DAG.

@subdag(load_price_data,inputs={})
def optimisation_window_data_full(
                             end_to_end_price_for_all_ticker: pd.DataFrame) -> pd.DataFrame:
    return end_to_end_price_for_all_ticker

# second function to do the joining (could also have this function within the subdag too)
def optimisation_window_data(
                             optimisation_window_data_full: pd.DataFrame, backtesting_lookback_period: int) -> int:
    return optimisation_window_data_full[backtesting_lookback_period]  # or whatever the logic should be

This is more explicit. Maybe this is the simpler way to go for now?

* for delineation is cute but I think a little too magical. Not allowing joins is the current approach, but as @gms101 points out, there's a subtle internal inconsistency... E.G. the internal nodes can refer to something outside the subDAG, but the final subDAG node can't (which is, in essence, part of the DAG). That said, I think that, given that we're using the parameters to specify the outputs of the subDAG about which we care, its worth being explicit. The other obvious workaround is to refer to that node inside the DAG -- E.G. create the joining node.

I'd almost rather flip the script (although I'm not sure this makes sense now that we've already done it):

@subdag(load_price_data,inputs={}, from_subdag=["end_to_end_price_for_all_ticker"])
def optimisation_window_data(end_to_end_price_for_all_ticker: pd.DataFrame,
                             backtesting_lookback_period: int) -> int:
    return backtesting_lookback_period

I find this clearer to read, as the non-subdag components will have obvious sources, and the subDAG components will still be easily identifiable. It could be either/or (from_subdag/from_external), allowing one or the other? Another reason I like this is that a lot of decorators serve the purpose of injecting inputs into a function -- subDAG does this, @parameterize, and @inject all do this. So maybe this is the consistent way? The vars from within the subDAG are the weird ones (they refer to a namespace...).

@elijahbenizzy
Copy link
Collaborator

This is now complete, released in 1.21.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage label for issues that need to be triaged.
Projects
None yet
Development

No branches or pull requests

3 participants