Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector functions vs elementwise function #94

Open
matthieugomez opened this issue Mar 22, 2019 · 3 comments
Open

Vector functions vs elementwise function #94

matthieugomez opened this issue Mar 22, 2019 · 3 comments

Comments

@matthieugomez
Copy link
Contributor

matthieugomez commented Mar 22, 2019

In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows StatModels to work with any streaming interface, not just DataFrame. However,
this has two drawbacks:

  • a lot of useful transformations, such as creating lagged variables (Implementing First-Difference #86) or converting continuous variables to categorical variables (Specify certain variables as categorical #93), cannot be done in the current implementation, because they require the entire vector.
  • the syntax is somewhat at odds with the rest of Julia, where one would typically write log.(x)

This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also:
#75 (comment)_
#71 (comment)_

@kleinschmidt
Copy link
Member

This came up in slack the other day and some of the discussion there helped me clarify my thinking a bit so I wanted to preserve it for posterity. The underlying issue is that 1) we want to support functinos like categorical, scale, lag/lead/diff which operate at the level of an entire column, but 2) there are a number of contexts where we can't just apply a standard julia function to the entire column: row-wise (or even batched streaming) data are one obvious case (since you only have a subset of the observations at once), but another one is predict, since you're getting new data. One of the primary design motivations for #71 Terms 2.0 Son of Terms was to abstract information about the column-level transformations into terms (<:AbstractTerm), which could then work for any amount of data (hence the separation of syntax/schema/data time in the API). This allows us to e.g., avoid special-casing handling of categorical values in predict, instead providing an API that anyone can plug into to support arbitrary transformations.

So, if the design of the API precludes treating column-level transformations as "standard" julia functions, then we're left with a design choice:

  1. special case some column-level functions like scale, categorical, etc., that depend on invariants of the data available at schema time, and change the "function term" syntax to require broadcasting for elementwise application, making column-wise application of normal julia functions the standard or
  2. require that all column-level transformations be special-cased using custom terms (as categorical data currently are handled), and interpret function calls in formulae as element-wise.

The status quo is 2., which I still prefer because I think users would be very confused if some functions operated "normally" but others were "special", and get frustrated when they write a function that takes a whole column and try to use it in a formula and get weird (or even worse, invisibly wrong) results when they e.g. try to generate model predictions on held-out data. And even though the conventions I've adopted in #71 differ from idiomatic julia, the formula DSL is a DSL, and it differs from idiomatic julia in many other ways (e.g., #99).

I think the ergonomics of the current situation could be much improved, by e.g. adding a "column-level" wrapper term type that would encapsulate a function call that can safely be applied to a whole column, and even maybe define an API for how column-level terms that need data invariants (like cateogrical, which needs to know the unique levels, or scale, which needs to know summary statistics to use in scaling) can store, extract, and access them. But I don't have time at the moment to really push on that (maybe this summer?).

@nalimilan
Copy link
Member

Note that we could also go with 1., but throw an error when a non-elementwise function is called, unless it's special-cased. That would probably be the clearest and the safest solution: otherwise some functions may appear to work, but be applied to each element instead of the whole vector (e.g. scale would do that without special casing, and that can be the case for any custom function).

But of course the drawback of that approach is that it would be inconvenient: y ~ x + x^2 + log(z) would become y ~ x + x.^2 + log.(z). Also the mix between + and .^ would be weird.

@kleinschmidt
Copy link
Member

Patsy (a python "formula" package) makes a distinction between functions (which can be applied elementwise) and "stateful transforms", which need to know some invariants of the data (e.g., center or standardize/scale). There's a good dicusssion of why this distinction is necessary in the docs which is much clearer than my arguments above: https://patsy.readthedocs.io/en/latest/stateful-transforms.html

That might be a useful abstraction, especially for developing first-class support for streaming data while still being extensible. Note that ContinuousTerm is basically a function, while CategoricalTerm is stateful (since it needs to know the number and values of the unique levels of the data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants