Vector functions vs elementwise function #94

matthieugomez · 2019-03-22T15:52:44Z

In the current implementation, transformation (such as log) are applied elementwise. AFAIU, this allows StatModels to work with any streaming interface, not just DataFrame. However,
this has two drawbacks:

a lot of useful transformations, such as creating lagged variables (Implementing First-Difference #86) or converting continuous variables to categorical variables (Specify certain variables as categorical #93), cannot be done in the current implementation, because they require the entire vector.
the syntax is somewhat at odds with the rest of Julia, where one would typically write log.(x)

This may be fine, but I just wanted to have a discussion on whether it was the right path going forward. See also:
#75 (comment)_
#71 (comment)_

The text was updated successfully, but these errors were encountered:

kleinschmidt · 2019-04-10T14:52:46Z

This came up in slack the other day and some of the discussion there helped me clarify my thinking a bit so I wanted to preserve it for posterity. The underlying issue is that 1) we want to support functinos like categorical, scale, lag/lead/diff which operate at the level of an entire column, but 2) there are a number of contexts where we can't just apply a standard julia function to the entire column: row-wise (or even batched streaming) data are one obvious case (since you only have a subset of the observations at once), but another one is predict, since you're getting new data. One of the primary design motivations for #71 Terms 2.0 Son of Terms was to abstract information about the column-level transformations into terms (<:AbstractTerm), which could then work for any amount of data (hence the separation of syntax/schema/data time in the API). This allows us to e.g., avoid special-casing handling of categorical values in predict, instead providing an API that anyone can plug into to support arbitrary transformations.

So, if the design of the API precludes treating column-level transformations as "standard" julia functions, then we're left with a design choice:

special case some column-level functions like scale, categorical, etc., that depend on invariants of the data available at schema time, and change the "function term" syntax to require broadcasting for elementwise application, making column-wise application of normal julia functions the standard or
require that all column-level transformations be special-cased using custom terms (as categorical data currently are handled), and interpret function calls in formulae as element-wise.

The status quo is 2., which I still prefer because I think users would be very confused if some functions operated "normally" but others were "special", and get frustrated when they write a function that takes a whole column and try to use it in a formula and get weird (or even worse, invisibly wrong) results when they e.g. try to generate model predictions on held-out data. And even though the conventions I've adopted in #71 differ from idiomatic julia, the formula DSL is a DSL, and it differs from idiomatic julia in many other ways (e.g., #99).

I think the ergonomics of the current situation could be much improved, by e.g. adding a "column-level" wrapper term type that would encapsulate a function call that can safely be applied to a whole column, and even maybe define an API for how column-level terms that need data invariants (like cateogrical, which needs to know the unique levels, or scale, which needs to know summary statistics to use in scaling) can store, extract, and access them. But I don't have time at the moment to really push on that (maybe this summer?).

nalimilan · 2019-04-10T15:57:35Z

Note that we could also go with 1., but throw an error when a non-elementwise function is called, unless it's special-cased. That would probably be the clearest and the safest solution: otherwise some functions may appear to work, but be applied to each element instead of the whole vector (e.g. scale would do that without special casing, and that can be the case for any custom function).

But of course the drawback of that approach is that it would be inconvenient: y ~ x + x^2 + log(z) would become y ~ x + x.^2 + log.(z). Also the mix between + and .^ would be weird.

kleinschmidt · 2019-06-12T14:46:46Z

Patsy (a python "formula" package) makes a distinction between functions (which can be applied elementwise) and "stateful transforms", which need to know some invariants of the data (e.g., center or standardize/scale). There's a good dicusssion of why this distinction is necessary in the docs which is much clearer than my arguments above: https://patsy.readthedocs.io/en/latest/stateful-transforms.html

That might be a useful abstraction, especially for developing first-class support for streaming data while still being extensible. Note that ContinuousTerm is basically a function, while CategoricalTerm is stateful (since it needs to know the number and values of the unique levels of the data).

nalimilan mentioned this issue Mar 24, 2019

Specify certain variables as categorical #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector functions vs elementwise function #94

Vector functions vs elementwise function #94

matthieugomez commented Mar 22, 2019 •

edited

Loading

kleinschmidt commented Apr 10, 2019

nalimilan commented Apr 10, 2019

kleinschmidt commented Jun 12, 2019

Vector functions vs elementwise function #94

Vector functions vs elementwise function #94

Comments

matthieugomez commented Mar 22, 2019 • edited Loading

kleinschmidt commented Apr 10, 2019

nalimilan commented Apr 10, 2019

kleinschmidt commented Jun 12, 2019

matthieugomez commented Mar 22, 2019 •

edited

Loading