Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rip out formula parsing and replace it with something sane #54

Merged
merged 20 commits into from
Apr 22, 2018

Conversation

kleinschmidt
Copy link
Member

This PR replaces the spaghetti code that handled the expression re-writes that transform formula surface syntax into a bunch main effects and interactions. It works by specifying re-write rules (represented by subtypes of FormulaRewrite) for associative rule, distributive rule, "star expansion" (a*b to a+b+a&b), and subtraction (replacing x-1 with x + -1). These rules are implemented via methods on two functions: applies(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite}) and rewrite!(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite}). The first checks whether the child of ex at child_idx should be re-written under rule T, and the second modifies ex in place and returns the index of the next child to check. For example, the associative rule checks whether ex and ex.args[child_idx] are both calls to the same associative operator, and splices the args of the child into ex in place of the child.

The goal here is to make the code easier to read, maintain, and extend. Be expressing all the operations on the formula DSL in the same terms it'll be easier, I hope, to expand it or for package authors to implement custom extensions (although that may require using custom macros with the current implementation). In the future I plan to change how the terms themselves are represented to make things even more composable and extensible, but I think the current PR should be considered regardless of that.

The only thing that's currently broken is the drop_term! tests: they're Terms are equivalent but the actual Formulae are not because I wasn't sure what the best way to modify the whole expression of the formula is (which I now store on the Formula struct, to hold onto both the original (un-parsed) expression that's passed to the macro and the whole parsed expression).

@@ -2,3 +2,4 @@ julia 0.6
DataFrames 0.11.5
StatsBase 0.20.1
Compat 0.63
ArgCheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer we not add a dependency on ArgCheck, since @argcheck thing "ahhh" is identical to thing || throw(ArgumentError("ahhh")), and the latter is both clearer (IMO) and will provide a cleaner backtrace (nothing about inlined macro expansion from @argcheck).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a habit than anything for me so I'm fine to do the checks manually. At this point that wouldn't add too much boilerplate. Your objection is just that argcheck isn't strictly necessary here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. And I always favor fewer dependencies when possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT you removed all uses of @argcheck but still added the dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I caught this right after I tagged a release so I pushed another commit on master to remove it. Thanks attobot for showing the requires diff.

@kleinschmidt
Copy link
Member Author

I don't know what's happening with the nightly tests; they pass locally for me (but the version that Pkg3 seems to pick up CategoricalArrays 0.3.7 and not 0.3.8 which is what Travis seems to pick). Something is badly wrong with 0.6 so I need to get that sorted.

## not a :call to s, return condensed version of ex
return condense(ex)
# recursively sort children
sort_terms!.(ex.args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach(sort_terms!, ex.args) may be more efficient here since it avoids allocations from broadcasting.

@kleinschmidt
Copy link
Member Author

Ah CategoricalArrays 0.3.8 is going to break this whole package because we rely on unique returning levels order, not order of occurrence.

@kleinschmidt
Copy link
Member Author

Okay I think #55 will fix the categorical array issue.

@codecov-io
Copy link

codecov-io commented Apr 17, 2018

Codecov Report

Merging #54 into master will decrease coverage by 0.73%.
The diff coverage is 98.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   94.59%   93.85%   -0.74%     
==========================================
  Files           5        6       +1     
  Lines         333      358      +25     
==========================================
+ Hits          315      336      +21     
- Misses         18       22       +4
Impacted Files Coverage Δ
src/deprecated.jl 100% <100%> (ø)
src/formula.jl 96.96% <98.9%> (-3.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bbc2a7...f1b98c2. Read the comment docs.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're by far the one who knows this code best, so if you think it's OK I can only agree! Though as @ararslan I don't think requiring ArgCheck is a good idea.

src/formula.jl Outdated
getterms(a::Any) = Any[a]

ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1
ord(a::Any) = 1
# ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?


## totally empty
t = Terms(Formula(nothing, 0))
t = Terms(@eval @formula $(:($nothing ~ 0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intriguing...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well since I've moved parsing into the @formula macro and there's not currently syntactic support for one-sided formulae this is what's necessary :)

test/formula.jl Outdated
@@ -101,16 +104,16 @@
## @test t.terms == [:x2, :(x1&x2)] # == [:(1 & x1)]
## @test t.eterms == [:y, :x1, :x2]

@test dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz)
@test dropterm(@formula(foo ~ 1 + bar + baz), 1) == @formula(foo ~ 0 + bar + baz)
@test_broken dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you plan to fix this?

@kleinschmidt
Copy link
Member Author

I've removed the argparse dependency and fixed those tests. I also added a few more parser checks (for & with 1 and one-element interactions) that fixes a few tests that were broken before. Maybe @dmbates wants to weight in before this is merged? I don't think this should cause any problems with MixedModels but just in case...

@dmbates
Copy link
Contributor

dmbates commented Apr 18, 2018

I do have a problem with one of the tests in MixedModels because of the way that I explicitly construct a one-sided formula. It is at MixedModels/src/pls.jl:90 and I am trying to build up a model matrix from individual terms. I use an expression of the form ModelFrame(Formula(nothing, rhs), fr) where rhs is an expression.

I think it may be best if I rewrite this section of the MixedModels code. It seems to me that this is probably where JuliaStats/MixedModels.jl#100 originates. However, I should have a temporary fix so that this PR in StatsModels can go forward. Is there an explicit constructor for a one-sided formula?

@kleinschmidt
Copy link
Member Author

You might want to wait to do that re-write because I have bigger changes planned here about how Terms are represented (which may or may not change how best to handle things in MixedModels).

This PR actually fixes that issue, since the parsing rules are applied to the arguments of | as well:

julia> @formula( y ~ 1 + x + (a*b | c))
Formula: y ~ 1 + x + ((a + b + a & b) | c)

Your issue is due to the fact that the formula holds onto the original expression. You could probably get away with just constructing a Formula with empty ex and ex_orig fields for now.

@dmbates
Copy link
Contributor

dmbates commented Apr 18, 2018

I had forgotten that the representation of a Formula had changed. Would it be possible to add another Formula constructor like

Formula(lhs::Union{Expr, Symbol, Void}, rhs::Union{Expr, Integer, Symbol}) = Formula(:(), :(), lhs, rhs)

to make this case work in the short term? I could do something in the MixedModels package but I think it would be easier to add the method here.

@kleinschmidt
Copy link
Member Author

Yes, I'd be okay with that, as long as there's a deprecation warning so that people who have been constructing formulae manually are aware. That might help avoid breakage from this change, too.

The other option is to do something like @eval @formula($(nothing) ~ $rhs).

Either way probably good to upperbound to the current version of statsmodels for anyone who relies on the internal structure of Formulas.

@kleinschmidt
Copy link
Member Author

I added a Formula(lhs, rhs) method that creates a .ex field by splicing in the two arguments, and sets ex_orig to be :(), and gives a deprecation warning suggesting the 4-argument form or the @eval @formula solution if parsing is needed. This ensures backwards compatibility except in cases where you expect the lhs/rhs fields to be un-parsed, which...serves them right I guess.

@kleinschmidt
Copy link
Member Author

The failure here is because the way warnings are printed is different between 0.6 and 0.7, so using @test_warn "deprecated" ... doesn't work on 0.7, and @test_logs isn't available on 0.6. So maybe it's better to just leave out a test for this, or do something like

    if VERSION > v"0.7.0-DEV"
        f2 = @eval(@test_logs (:warn, r"Formula\(lhs, rhs\) is deprecated") Formula(f.lhs, f.rhs))
    else        
        f2 = @eval(@test_warn "deprecated" Formula(f.lhs, f.rhs))
    end

@@ -1,7 +1,12 @@
@testset "Deprecations" begin
f = @formula y ~ 1 + a*b

f2 = @test_warn "deprecated" Formula(f.lhs, f.rhs)
if VERSION > v"0.7.0-DEV"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use a more precise version here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly want to split 0.6/0.7...I don't know when the @warn_logs appeared but I think it's fairly safe to assume that anyone who's on 0.7.something will want the other test. So somthing like if VERSION < v"0.6.9999"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

@static if VERSION >= v"0.7.0-DEV.2988"
    @test_logs ...
else
    # other thing
end

The @static will prevent Julia from trying to expand @test_logs when it doesn't exist.

Copy link
Member Author

@kleinschmidt kleinschmidt Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that v"0.7.0" is higher than any v"0.7.0-DEV" so that makes sense to me so this makes sense.

@kleinschmidt
Copy link
Member Author

I fixed the version check and I think that's the last thing.

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 26, 2018

I am trying to update my package for this pull request.
I have an issue for formulas with no left hand side:

using DataFrames, StatsModels
df = DataFrames.DataFrame(x1 = [1])
StatsModels.ModelFrame(StatsModels.Terms(StatsModels.@formula(nothing~x1)), df)
#> ERROR: KeyError: key :nothing not found

How can I adapt this code to make it work?

@kleinschmidt
Copy link
Member Author

kleinschmidt commented Apr 26, 2018

you can do @eval @formula($nothing ~ x1). If you look at the expression :(nothing ~ x1) you'll see that nothing is actually the symbol :nothing:

julia> dump( :(nothing ~ x1) )
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ~
    2: Symbol nothing
    3: Symbol x1
  typ: Any

whereas you want

julia> dump(:($nothing ~ x1))
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ~
    2: Void nothing
    3: Symbol x1
  typ: Any

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 26, 2018

Ok — it works now. Thanks a lot for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants