-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector functions vs elementwise function #94
Comments
This came up in slack the other day and some of the discussion there helped me clarify my thinking a bit so I wanted to preserve it for posterity. The underlying issue is that 1) we want to support functinos like So, if the design of the API precludes treating column-level transformations as "standard" julia functions, then we're left with a design choice:
The status quo is 2., which I still prefer because I think users would be very confused if some functions operated "normally" but others were "special", and get frustrated when they write a function that takes a whole column and try to use it in a formula and get weird (or even worse, invisibly wrong) results when they e.g. try to generate model predictions on held-out data. And even though the conventions I've adopted in #71 differ from idiomatic julia, the formula DSL is a DSL, and it differs from idiomatic julia in many other ways (e.g., #99). I think the ergonomics of the current situation could be much improved, by e.g. adding a "column-level" wrapper term type that would encapsulate a function call that can safely be applied to a whole column, and even maybe define an API for how column-level terms that need data invariants (like cateogrical, which needs to know the unique levels, or |
Note that we could also go with 1., but throw an error when a non-elementwise function is called, unless it's special-cased. That would probably be the clearest and the safest solution: otherwise some functions may appear to work, but be applied to each element instead of the whole vector (e.g. But of course the drawback of that approach is that it would be inconvenient: |
Patsy (a python "formula" package) makes a distinction between functions (which can be applied elementwise) and "stateful transforms", which need to know some invariants of the data (e.g., center or standardize/scale). There's a good dicusssion of why this distinction is necessary in the docs which is much clearer than my arguments above: https://patsy.readthedocs.io/en/latest/stateful-transforms.html That might be a useful abstraction, especially for developing first-class support for streaming data while still being extensible. Note that |
In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows
StatModels
to work with any streaming interface, not justDataFrame
. However,this has two drawbacks:
log.(x)
This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also:
#75 (comment)_
#71 (comment)_
The text was updated successfully, but these errors were encountered: