-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardized parameters #75
Comments
Model re-fitting is clearly #73. If you need to access the original data, we need to define a consistent interface, and that depends on #32. But I'm not sure we want to keep all the data used to fit the model in the model itself, as it uses a lot of memory which is not necessarily useful. So maybe we should store the means and standard deviations of variables somewhere? |
I agree that storing the whole data seems inefficient. Storing only characteristics of variables (such as mean and SD) would do the trick for, I believe, almost all use-cases, but there is always the possibility that someone would for instance get the coefs based on a robust standardization (with Median and MAD), or based on normalization (fixed range 0-1)... Storing all useful indices (mean, median, SD, MAD, min, max) would work, but feels kinda clunky 😕 Another possibility is to have an optional non-default argument ( my_transformation(df) = create_personalised_transformation(df)
m = fit(Model, @formula(y ~ X), data, additional_transformation=my_transformation)
# Access both models
m.raw
m.transformed Or m = fit(Model, @formula(y ~ X), data, transformation=my_transformation, additional=true) |
Asking users to supply the required transformation doesn't sound very convenient. We could as well pass the input data again to the function which standardizes coefficients. |
Well it would be still possible for such e.g.: It seems fairly transparent, explicit and flexible. As for wether it is the most convenient way, I don't know 😄 |
I think the best solution is something along the lines of the tidymodels |
Note that if we end up with something like what I've proposed in #71, the schema for the data source there includes the mean/std of continuous variables (which in turn is inspired by JuliaDB.ML; it wouldn't be crazy to also include min/max (although quantiles are harder if you assume that the data isn't all available at once). |
Lastly, there are other situations where you might want to transform a model after it's been fit (for instance, to estimate simple effects, or look at the coefficients under different contrast coding schemes). Many of these are linear transformations, and so there might be room for some kind of abstraction here that captures whether a model can be re-expressed with a linear transformation of some kind (and if so, how) or requires re-fitting. |
Terms 2.0 era would be,
no? |
Terms 2.0 stores the mean and standard deviation of continuous variables, so we could provide a function to recompute coefficients without re-fitting the model. |
I think |
Aye. Should I open a separate issue to address that? The behavior can do a simple check for most cases,
|
Thinking about it, wouldn't it be better to require writing e.g. |
I think it is just hard to parse it... maybe the syntax can make it broadcast in the anonymous function though. |
What would the behavior be then if there's no |
|
Yes that would have to throw an error. So maybe it's better to keep the current behavior where the formula is interpreted as a row-wise operation. Maybe that's OK as long there's a limited number of special operations ( |
|
There is still the matter of dimensional mismatch / introduced
Just wanna caution against something that might break that through a streaming data API. |
Yeah lead/lag does present problems that way. Ideally I'd also like to have tools to do table-level transformations (e.g., dropping rows, omitting missings, etc.) that the formula can plug into. That aspect of the whole table-to-matrix pipeline are just kind of a kludge right now. |
@DominiqueMakowski this is now implemented in StandardizedPredictors.jl! I'm closing this for now -- the original question is now handled via a different package and most of the tangential issues are discussed in separate issues. |
Follow up of this issue to discuss the possibility and methods to obtain standardized parameters from models, with a general solution at the StatModels level.
What would be the best way to obtain them similarly to method 2?
Related to #73
The text was updated successfully, but these errors were encountered: