Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compatibility with pre-contrasts ModelFrame constructor #1042

Merged
merged 2 commits into from
Sep 14, 2016

Conversation

GordStephen
Copy link

@GordStephen GordStephen commented Aug 28, 2016

Some downstream packages (like FixedEffectModels) call the direct ModelFrame constructor rather than use the outer constructor functions. This PR provides compatibility with the pre-#870 signature for that constructor method.

@kleinschmidt
Copy link
Contributor

LGTM

@nalimilan
Copy link
Member

Just curious, why do you need to do that?

@kleinschmidt You're the most qualified to review this now that you've rewritten most of that code, so feel free to merge after leaving a chance for others to comment.

@GordStephen
Copy link
Author

Not sure why FixedEffectModels needs the raw constructor specifically, @matthieugomez would be the one to ask. My guess would be that it's to support explicitly specifying a subset of the input dataframe rows for use in the model?

@matthieugomez
Copy link
Contributor

matthieugomez commented Aug 31, 2016

I do need to remove more observations than what ModelFrame usual removes:
missing obs for variables used in fixed effects (not part of the formula given to ModelFrame — they are treated separately), missing obs for variables used in standard errors (like clustered standard errors), missing obs for weight. Also I have a subset option.

Tbh, I'm a little bit confused about what I did and what has changed here, so I'm not sure if it's exact reason.

df, msng = na_omit(DataFrame(map(x -> d[x], trms.eterms)))
names!(df, convert(Vector{Symbol}, map(string, trms.eterms)))
for c in eachcol(df) dropunusedlevels!(c[2]) end
function evalcontrasts(df::AbstractDataFrame, contrasts::Dict = Dict())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty line at top of function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And BTW, move the comments below before the signature, as they clearly document the function.

@nalimilan
Copy link
Member

@matthieugomez When you'll have the time to look at this more deeply, it would be good to find a higher-level interface which would work for your use case. With the recent changes, we support arbitrary contrasts (not just dummy coding), but your custom evaluation function probably doesn't.

@matthieugomez
Copy link
Contributor

matthieugomez commented Sep 5, 2016

I used this constructor because I already had a dataframe without missing values after doing a bunch of operations (in particular excluding obs with missing weights or missing variables used to compute standard errors).
I wanted to avoid a new copy of the dataframe, which is done in ModelFrame with the na_omit function.
I do think that ModelFrame and ModelMatrix are doing excessive amount of copies (in particular it's impossible to do regressions with datasets more than 1/10 my RAM). For instance, in the future, it would be great to for df to be a subdataframe. #818.
Now, this particular line is just one more copy and I'm not sure it changes anything, so I can just switch back to the exported function.

@nalimilan
Copy link
Member

If that's the only problem, then we can probably change this line (or drop_na) to avoid making a copy when there are no missing values. (That code is know to be particularly inefficient WRT copying, and there are some low-hanging fruits waiting for someone to fix.)

@matthieugomez
Copy link
Contributor

Honestly it's ok. There will probably be a better fix someday (using subdataframes). A temporary fix based on the number of missing values would just complicate the function (df would sometimes be a reference to the original dataframes, sometimes not).

@matthieugomez
Copy link
Contributor

matthieugomez commented Sep 5, 2016

Sorry: I actually do need a way to specify a msng vector that is not restricted to the missing obs for variables in the formula. I think other may use it too. What about splitting the function ModelFrame as follows

function ModelFrame(trms::Terms, d::AbstractDataFrame; kwargs...)
   select = complete_cases(DataFrame(map(x -> d[x], trms.eterms)))
   ModelFrame(trms, d, select; kwargs...)
end

function ModelFrame(trms::Terms, d::AbstractDataFrame, select::AbstractVector; contrasts::Dict = Dict())
    df = DataFrame(map(x -> d[select, :x], trms.eterms))
    names!(df, convert(Vector{Symbol}, map(string, trms.eterms)))
    for c in eachcol(df) dropunusedlevels!(c[2]) end

    ## Set up contrasts:
    ## Combine actual DF columns and contrast types if necessary to compute the
    ## actual contrasts matrices, levels, and term names (using DummyCoding
    ## as the default)
    evaledContrasts = Dict()
    for (term, col) in eachcol(df)
        is_categorical(col) || continue
        evaledContrasts[term] = ContrastsMatrix(haskey(contrasts, term) ?
                                                contrasts[term] :
                                                DEFAULT_CONTRASTS(),
                                                col)
    end

    ## Check for non-redundant terms, modifying terms in place
    check_non_redundancy!(trms, df)

    ModelFrame(df, trms, select, evaledContrasts)
end

@nalimilan
Copy link
Member

Yeah, makes sense as a public API. Please make a PR (with tests for the new method). Though I'm not sure DataFrame(map(x -> d[x], trms.eterms)) is the most efficient approach: why not index d with a vector of symbols?

Also, would this replace or complement the present PR?

@matthieugomez
Copy link
Contributor

Sorry I'm not familiar with the package, why isn't this efficient?

DataFrame(map(x -> d[x], trms.eterms))

@nalimilan
Copy link
Member

@GordStephen Thanks. Unfortunately, recent changes on master have introduced conflicts. Could you rebase? (I would have done it myself, but I don't have the rights to push to your branch...)

@matthieugomez I'm not sure it really inefficient, but if the simpler d[trms.eterms] works, better use it. Anyway, please open a PR an we can discuss that.

@GordStephen GordStephen force-pushed the gs/modelframe-constructor branch from 82e0958 to 88def09 Compare September 13, 2016 23:30
@GordStephen GordStephen force-pushed the gs/modelframe-constructor branch from 88def09 to a444fd6 Compare September 13, 2016 23:39
@GordStephen
Copy link
Author

Ah, sorry. I'll check that "Allow edits from maintainers" box. Rebased now.

@nalimilan nalimilan changed the title RFC: Add compatibility with pre-contrasts ModelFrame constructor Add compatibility with pre-contrasts ModelFrame constructor Sep 14, 2016
@nalimilan nalimilan merged commit 968e980 into JuliaData:master Sep 14, 2016
@nalimilan
Copy link
Member

Thanks!

@matthieugomez
Copy link
Contributor

matthieugomez commented Sep 14, 2016

For reference, I ended up using the basic ModelFrame function, replacing
the msng field by a custom vector at the end. Code here

On Wed, Sep 14, 2016 at 4:13 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

Thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1042 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF733br-gH6N-2kQacAhGtc_ZEWRGleMks5qp6zBgaJpZM4Ju6-L
.

@nalimilan
Copy link
Member

Can you make a PR to add this method? You shouldn't have to carry things like that in your package.

maximerischard pushed a commit to maximerischard/DataFrames.jl that referenced this pull request Sep 28, 2016
…iaData#1042)

Add compatibility with pre-contrasts ModelFrame constructor
ararslan pushed a commit that referenced this pull request Oct 7, 2016
Add compatibility with pre-contrasts ModelFrame constructor
(cherry picked from commit 968e980)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants