Standardized parameters #75

DominiqueMakowski · 2018-09-24T16:04:56Z

Follow up of this issue to discuss the possibility and methods to obtain standardized parameters from models, with a general solution at the StatModels level.

Several methods exist to compute standardized coefs:
- Coefs transformation (usually including the outcome's SD)
- Data standardization and model re-fitting

What would be the best way to obtain them similarly to method 2?

Related to #73

nalimilan · 2018-09-24T16:36:07Z

Model re-fitting is clearly #73. If you need to access the original data, we need to define a consistent interface, and that depends on #32. But I'm not sure we want to keep all the data used to fit the model in the model itself, as it uses a lot of memory which is not necessarily useful. So maybe we should store the means and standard deviations of variables somewhere?

DominiqueMakowski · 2018-09-24T17:00:20Z

I agree that storing the whole data seems inefficient.

Storing only characteristics of variables (such as mean and SD) would do the trick for, I believe, almost all use-cases, but there is always the possibility that someone would for instance get the coefs based on a robust standardization (with Median and MAD), or based on normalization (fixed range 0-1)... Storing all useful indices (mean, median, SD, MAD, min, max) would work, but feels kinda clunky 😕

Another possibility is to have an optional non-default argument (transformation) that would, at the time of fitting (when data is available), transform the data and compute the appropriate model. This technique is used in several ML packages (I am sure for R's caret), but the trick would reside in the extension of this argument (additional_transformation?) that would trigger the computation of both models (with and without the said transformation) and return both. This could be quite flexible:

my_transformation(df) = create_personalised_transformation(df)

m = fit(Model, @formula(y ~ X), data, additional_transformation=my_transformation)

# Access both models
m.raw
m.transformed

Or

m = fit(Model, @formula(y ~ X), data, transformation=my_transformation, additional=true)

nalimilan · 2018-09-24T19:08:59Z

Asking users to supply the required transformation doesn't sound very convenient. We could as well pass the input data again to the function which standardizes coefficients.

DominiqueMakowski · 2018-09-24T19:34:08Z

Well it would be still possible for such transformation argument to accept either a function or a symbol from a set of predefined routines, such as :standardize.

e.g.:
m = fit(Model, @formula(y ~ X), data, transformation=:standardize)

It seems fairly transparent, explicit and flexible. As for wether it is the most convenient way, I don't know 😄

kleinschmidt · 2018-09-24T20:31:10Z

I think the best solution is something along the lines of the tidymodels recipes package. In a way, the formula is also a data transformation step. I think it would be good to make StatsModels "transformation agnostic" as much as possible, and allow multiple transformation steps to be chained into a pipeline.

kleinschmidt · 2018-09-24T20:34:16Z

Note that if we end up with something like what I've proposed in #71, the schema for the data source there includes the mean/std of continuous variables (which in turn is inspired by JuliaDB.ML; it wouldn't be crazy to also include min/max (although quantiles are harder if you assume that the data isn't all available at once).

kleinschmidt · 2018-09-24T20:35:54Z

Lastly, there are other situations where you might want to transform a model after it's been fit (for instance, to estimate simple effects, or look at the coefficients under different contrast coding schemes). Many of these are linear transformations, and so there might be room for some kind of abstraction here that captures whether a model can be re-expressed with a linear transformation of some kind (and if so, how) or requires re-fitting.

Nosferican · 2019-03-05T20:19:59Z

Terms 2.0 era would be,

m = fit(Model, @formula(y ~ scale(X)), data) # scale doing z-scoring for instance

no?

nalimilan · 2019-03-05T21:32:45Z

Terms 2.0 stores the mean and standard deviation of continuous variables, so we could provide a function to recompute coefficients without re-fitting the model.

kleinschmidt · 2019-03-06T15:43:44Z

I think scale would need to be implemented as a special term, since it can't work elementwise (at least not without special logic). "non-special" function calls are interpreted like f.(args...), so as to work with both streaming and columnar data stores.

Nosferican · 2019-03-06T15:52:24Z

Aye. Should I open a separate issue to address that? The behavior can do a simple check for most cases,

hasmethod(*, [AbstractVector]) # false
hasmethod(log, [AbstractVector]) # false
hasmethod(abs2, [AbstractVector]) # false
hasmethod(scale, [AbstractVector]) # true

nalimilan · 2019-03-06T17:59:24Z

Thinking about it, wouldn't it be better to require writing e.g. log.(x) in formulas? That would be more consistent with standard Julia syntax and would avoid weird situations where some functions are automatically vectorized and others not.

Nosferican · 2019-03-06T18:01:30Z

I think it is just hard to parse it... maybe the syntax can make it broadcast in the anonymous function though.

kleinschmidt · 2019-03-06T21:46:42Z

What would the behavior be then if there's no . (e.g., how should log(x) be interpreted)? As a parsing error? As a function that operates on the whole column? And how would you make the whole-column functions work smoothly with streaming data?

Nosferican · 2019-03-06T21:50:23Z

MethodError just like in standard Julia syntax? Operations that need the whole column by definition can't work with streaming data so not missing out on anything there.

nalimilan · 2019-03-07T08:10:24Z

Yes that would have to throw an error. So maybe it's better to keep the current behavior where the formula is interpreted as a row-wise operation. Maybe that's OK as long there's a limited number of special operations (scale, lead, lag, diff...). But scale doesn't work with streaming either, right?

kleinschmidt · 2019-03-07T15:29:06Z

scale would if it's implemented as a special term; then it has access to the schema. also lag might be workable with streamign data: just have to hold onto the last n observations in the term struct. Same with diff.

Nosferican · 2019-03-07T15:34:04Z

There is still the matter of dimensional mismatch / introduced missing. For group-by operations I use

X = mapreduce(df -> modelcols(ft.rhs, df), vcat, groupby(data, :ID))

Just wanna caution against something that might break that through a streaming data API.

kleinschmidt · 2019-03-07T15:40:39Z

Yeah lead/lag does present problems that way. Ideally I'd also like to have tools to do table-level transformations (e.g., dropping rows, omitting missings, etc.) that the formula can plug into. That aspect of the whole table-to-matrix pipeline are just kind of a kludge right now.

palday · 2022-05-20T15:20:13Z

@DominiqueMakowski this is now implemented in StandardizedPredictors.jl!

I'm closing this for now -- the original question is now handled via a different package and most of the tangential issues are discussed in separate issues.

DominiqueMakowski mentioned this issue Sep 24, 2018

Standardized coefs JuliaStats/MixedModels.jl#136

Closed

DominiqueMakowski mentioned this issue Sep 27, 2018

update() method #73

Open

nalimilan mentioned this issue Mar 15, 2019

Terms 2.0: son of Terms #71

Merged

matthieugomez mentioned this issue Mar 22, 2019

Vector functions vs elementwise function #94

Open

palday closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardized parameters #75

Standardized parameters #75

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading

nalimilan commented Sep 24, 2018

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading

nalimilan commented Sep 24, 2018

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading

kleinschmidt commented Sep 24, 2018

kleinschmidt commented Sep 24, 2018

kleinschmidt commented Sep 24, 2018

Nosferican commented Mar 5, 2019

nalimilan commented Mar 5, 2019

kleinschmidt commented Mar 6, 2019

Nosferican commented Mar 6, 2019

nalimilan commented Mar 6, 2019

Nosferican commented Mar 6, 2019

kleinschmidt commented Mar 6, 2019

Nosferican commented Mar 6, 2019

nalimilan commented Mar 7, 2019

kleinschmidt commented Mar 7, 2019

Nosferican commented Mar 7, 2019

kleinschmidt commented Mar 7, 2019

palday commented May 20, 2022

Standardized parameters #75

Standardized parameters #75

Comments

DominiqueMakowski commented Sep 24, 2018 • edited Loading

nalimilan commented Sep 24, 2018

DominiqueMakowski commented Sep 24, 2018 • edited Loading

nalimilan commented Sep 24, 2018

DominiqueMakowski commented Sep 24, 2018 • edited Loading

kleinschmidt commented Sep 24, 2018

kleinschmidt commented Sep 24, 2018

kleinschmidt commented Sep 24, 2018

Nosferican commented Mar 5, 2019

nalimilan commented Mar 5, 2019

kleinschmidt commented Mar 6, 2019

Nosferican commented Mar 6, 2019

nalimilan commented Mar 6, 2019

Nosferican commented Mar 6, 2019

kleinschmidt commented Mar 6, 2019

Nosferican commented Mar 6, 2019

nalimilan commented Mar 7, 2019

kleinschmidt commented Mar 7, 2019

Nosferican commented Mar 7, 2019

kleinschmidt commented Mar 7, 2019

palday commented May 20, 2022

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading

DominiqueMakowski commented Sep 24, 2018 •

edited

Loading