-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for exogenous variables in utils.aggregate #294
Conversation
…riables in creation of Y and Summation dataframes
@KuriaMaingi Thanks for your work! I'll happily take a look :) We use nbdev, which means changes to the code should be made in source notebooks - in your case To set your environment best up to work on this, I'd advise to:
Now make your changes to the notebook, in your case
|
@KuriaMaingi to add to the great summary above, you can also use the commands below before exporting.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for helping with the project, it is nice to have more contributors!
I left a few thoughts before you make the changes in the notebooks and export them.
# Add exog_vars to the aggregation dictionary if it is not None | ||
if exog_vars is not None: | ||
agg_dict.update({key: (key, exog_vars[key]) for key in exog_vars.keys()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give an example usage of exog_vars
in this context?
I have not used pandas much lately, but I think that given your type signature of Dict[str, str]
, you intend to have exog_vars = {"col_a": "sum", "col_b": "sum"}
. However, this does not support multiple functions to aggregate a particular column as you are going down the named aggregation route.
A way around this will be either exog_vars = {"col_a": ("sum", "mean")}
which will create a MultiIndex
, or alternatively something like exog_vars = {"col_a_sum": ("col_a", "sum"), "col_a_mean": ("col_a", "mean")}
. Either way, the distinction between the output column name for the aggregation and the column name to be aggregated will need to be made when inserting into agg_dict
to avoid overwriting anything in this case.
# Define acceptable aggregation functions | ||
acceptable_aggregations = { | ||
'sum', 'mean', 'median', 'min', 'max', 'count', 'std', 'var', 'first', 'last' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is needed—we can just let pandas raise an AttributeError
when aggregating rather than raising our own ValueError
.
Plus, this gives us the flexibility to use custom (anonymous) functions rather than just string function names.
Thanks all for the comments, I will close this and replace with a new PR following the preferred approach. Thanks |
This change to the utility function will assist in instances where you need to generate your summation and Y_df but also want to retain any exogenous vars required for your forecast.
You will need to pass in a dictionary containing your exogenous vars and the Pandas agg functions you want applied against them.
I have currently hardcoded the list of acceptable agg_funcs but open to hear if there's a better way