Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized parameters #75

Closed
DominiqueMakowski opened this issue Sep 24, 2018 · 20 comments
Closed

Standardized parameters #75

DominiqueMakowski opened this issue Sep 24, 2018 · 20 comments

Comments

@DominiqueMakowski
Copy link

DominiqueMakowski commented Sep 24, 2018

Follow up of this issue to discuss the possibility and methods to obtain standardized parameters from models, with a general solution at the StatModels level.

  • Several methods exist to compute standardized coefs:
    • Coefs transformation (usually including the outcome's SD)
    • Data standardization and model re-fitting

What would be the best way to obtain them similarly to method 2?

Related to #73

@nalimilan
Copy link
Member

Model re-fitting is clearly #73. If you need to access the original data, we need to define a consistent interface, and that depends on #32. But I'm not sure we want to keep all the data used to fit the model in the model itself, as it uses a lot of memory which is not necessarily useful. So maybe we should store the means and standard deviations of variables somewhere?

@DominiqueMakowski
Copy link
Author

DominiqueMakowski commented Sep 24, 2018

I agree that storing the whole data seems inefficient.

Storing only characteristics of variables (such as mean and SD) would do the trick for, I believe, almost all use-cases, but there is always the possibility that someone would for instance get the coefs based on a robust standardization (with Median and MAD), or based on normalization (fixed range 0-1)... Storing all useful indices (mean, median, SD, MAD, min, max) would work, but feels kinda clunky 😕

Another possibility is to have an optional non-default argument (transformation) that would, at the time of fitting (when data is available), transform the data and compute the appropriate model. This technique is used in several ML packages (I am sure for R's caret), but the trick would reside in the extension of this argument (additional_transformation?) that would trigger the computation of both models (with and without the said transformation) and return both. This could be quite flexible:

my_transformation(df) = create_personalised_transformation(df)

m = fit(Model, @formula(y ~ X), data, additional_transformation=my_transformation)

# Access both models
m.raw
m.transformed

Or

m = fit(Model, @formula(y ~ X), data, transformation=my_transformation, additional=true)

@nalimilan
Copy link
Member

Asking users to supply the required transformation doesn't sound very convenient. We could as well pass the input data again to the function which standardizes coefficients.

@DominiqueMakowski
Copy link
Author

DominiqueMakowski commented Sep 24, 2018

Well it would be still possible for such transformation argument to accept either a function or a symbol from a set of predefined routines, such as :standardize.

e.g.:
m = fit(Model, @formula(y ~ X), data, transformation=:standardize)

It seems fairly transparent, explicit and flexible. As for wether it is the most convenient way, I don't know 😄

@kleinschmidt
Copy link
Member

I think the best solution is something along the lines of the tidymodels recipes package. In a way, the formula is also a data transformation step. I think it would be good to make StatsModels "transformation agnostic" as much as possible, and allow multiple transformation steps to be chained into a pipeline.

@kleinschmidt
Copy link
Member

Note that if we end up with something like what I've proposed in #71, the schema for the data source there includes the mean/std of continuous variables (which in turn is inspired by JuliaDB.ML; it wouldn't be crazy to also include min/max (although quantiles are harder if you assume that the data isn't all available at once).

@kleinschmidt
Copy link
Member

Lastly, there are other situations where you might want to transform a model after it's been fit (for instance, to estimate simple effects, or look at the coefficients under different contrast coding schemes). Many of these are linear transformations, and so there might be room for some kind of abstraction here that captures whether a model can be re-expressed with a linear transformation of some kind (and if so, how) or requires re-fitting.

@Nosferican
Copy link
Contributor

Terms 2.0 era would be,

m = fit(Model, @formula(y ~ scale(X)), data) # scale doing z-scoring for instance

no?

@nalimilan
Copy link
Member

Terms 2.0 stores the mean and standard deviation of continuous variables, so we could provide a function to recompute coefficients without re-fitting the model.

@kleinschmidt
Copy link
Member

I think scale would need to be implemented as a special term, since it can't work elementwise (at least not without special logic). "non-special" function calls are interpreted like f.(args...), so as to work with both streaming and columnar data stores.

@Nosferican
Copy link
Contributor

Aye. Should I open a separate issue to address that? The behavior can do a simple check for most cases,

hasmethod(*, [AbstractVector]) # false
hasmethod(log, [AbstractVector]) # false
hasmethod(abs2, [AbstractVector]) # false
hasmethod(scale, [AbstractVector]) # true

@nalimilan
Copy link
Member

Thinking about it, wouldn't it be better to require writing e.g. log.(x) in formulas? That would be more consistent with standard Julia syntax and would avoid weird situations where some functions are automatically vectorized and others not.

@Nosferican
Copy link
Contributor

I think it is just hard to parse it... maybe the syntax can make it broadcast in the anonymous function though.

@kleinschmidt
Copy link
Member

What would the behavior be then if there's no . (e.g., how should log(x) be interpreted)? As a parsing error? As a function that operates on the whole column? And how would you make the whole-column functions work smoothly with streaming data?

@Nosferican
Copy link
Contributor

MethodError just like in standard Julia syntax? Operations that need the whole column by definition can't work with streaming data so not missing out on anything there.

@nalimilan
Copy link
Member

Yes that would have to throw an error. So maybe it's better to keep the current behavior where the formula is interpreted as a row-wise operation. Maybe that's OK as long there's a limited number of special operations (scale, lead, lag, diff...). But scale doesn't work with streaming either, right?

@kleinschmidt
Copy link
Member

scale would if it's implemented as a special term; then it has access to the schema. also lag might be workable with streamign data: just have to hold onto the last n observations in the term struct. Same with diff.

@Nosferican
Copy link
Contributor

There is still the matter of dimensional mismatch / introduced missing. For group-by operations I use

X = mapreduce(df -> modelcols(ft.rhs, df), vcat, groupby(data, :ID))

Just wanna caution against something that might break that through a streaming data API.

@kleinschmidt
Copy link
Member

Yeah lead/lag does present problems that way. Ideally I'd also like to have tools to do table-level transformations (e.g., dropping rows, omitting missings, etc.) that the formula can plug into. That aspect of the whole table-to-matrix pipeline are just kind of a kludge right now.

@palday
Copy link
Member

palday commented May 20, 2022

@DominiqueMakowski this is now implemented in StandardizedPredictors.jl!

I'm closing this for now -- the original question is now handled via a different package and most of the tangential issues are discussed in separate issues.

@palday palday closed this as completed May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants