rip out formula parsing and replace it with something sane #54

kleinschmidt · 2018-04-16T22:05:29Z

This PR replaces the spaghetti code that handled the expression re-writes that transform formula surface syntax into a bunch main effects and interactions. It works by specifying re-write rules (represented by subtypes of FormulaRewrite) for associative rule, distributive rule, "star expansion" (a*b to a+b+a&b), and subtraction (replacing x-1 with x + -1). These rules are implemented via methods on two functions: applies(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite}) and rewrite!(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite}). The first checks whether the child of ex at child_idx should be re-written under rule T, and the second modifies ex in place and returns the index of the next child to check. For example, the associative rule checks whether ex and ex.args[child_idx] are both calls to the same associative operator, and splices the args of the child into ex in place of the child.

The goal here is to make the code easier to read, maintain, and extend. Be expressing all the operations on the formula DSL in the same terms it'll be easier, I hope, to expand it or for package authors to implement custom extensions (although that may require using custom macros with the current implementation). In the future I plan to change how the terms themselves are represented to make things even more composable and extensible, but I think the current PR should be considered regardless of that.

The only thing that's currently broken is the drop_term! tests: they're Terms are equivalent but the actual Formulae are not because I wasn't sure what the best way to modify the whole expression of the formula is (which I now store on the Formula struct, to hold onto both the original (un-parsed) expression that's passed to the macro and the whole parsed expression).

ararslan · 2018-04-16T22:23:02Z

REQUIRE

@@ -2,3 +2,4 @@ julia 0.6
 DataFrames 0.11.5
 StatsBase 0.20.1
 Compat 0.63
+ArgCheck


I'd really prefer we not add a dependency on ArgCheck, since @argcheck thing "ahhh" is identical to thing || throw(ArgumentError("ahhh")), and the latter is both clearer (IMO) and will provide a cleaner backtrace (nothing about inlined macro expansion from @argcheck).

It's more of a habit than anything for me so I'm fine to do the checks manually. At this point that wouldn't add too much boilerplate. Your objection is just that argcheck isn't strictly necessary here, right?

Yeah. And I always favor fewer dependencies when possible.

AFAICT you removed all uses of @argcheck but still added the dependency.

Yes, I caught this right after I tagged a release so I pushed another commit on master to remove it. Thanks attobot for showing the requires diff.

kleinschmidt · 2018-04-16T22:23:44Z

I don't know what's happening with the nightly tests; they pass locally for me (but the version that Pkg3 seems to pick up CategoricalArrays 0.3.7 and not 0.3.8 which is what Travis seems to pick). Something is badly wrong with 0.6 so I need to get that sorted.

ararslan · 2018-04-16T22:25:20Z

src/formula.jl

-        ## not a :call to s, return condensed version of ex
-        return condense(ex)
+        # recursively sort children
+        sort_terms!.(ex.args)


foreach(sort_terms!, ex.args) may be more efficient here since it avoids allocations from broadcasting.

kleinschmidt · 2018-04-16T22:36:48Z

Ah CategoricalArrays 0.3.8 is going to break this whole package because we rely on unique returning levels order, not order of occurrence.

kleinschmidt · 2018-04-17T00:25:53Z

Okay I think #55 will fix the categorical array issue.

codecov-io · 2018-04-17T13:02:24Z

Codecov Report

Merging #54 into master will decrease coverage by 0.73%.
The diff coverage is 98.92%.

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   94.59%   93.85%   -0.74%     
==========================================
  Files           5        6       +1     
  Lines         333      358      +25     
==========================================
+ Hits          315      336      +21     
- Misses         18       22       +4

Impacted Files	Coverage Δ
src/deprecated.jl	`100% <100%> (ø)`
src/formula.jl	`96.96% <98.9%> (-3.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bbc2a7...f1b98c2. Read the comment docs.

nalimilan

You're by far the one who knows this code best, so if you think it's OK I can only agree! Though as @ararslan I don't think requiring ArgCheck is a good idea.

nalimilan · 2018-04-18T09:34:03Z

src/formula.jl

 getterms(a::Any) = Any[a]

-ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1
-ord(a::Any) = 1
+# ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1


nalimilan · 2018-04-18T09:34:52Z

test/formula.jl


    ## totally empty
-    t = Terms(Formula(nothing, 0))
+    t = Terms(@eval @formula $(:($nothing ~ 0)))


This is intriguing...

well since I've moved parsing into the @formula macro and there's not currently syntactic support for one-sided formulae this is what's necessary :)

nalimilan · 2018-04-18T09:35:07Z

test/formula.jl

@@ -101,16 +104,16 @@
    ## @test t.terms == [:x2, :(x1&x2)]    # == [:(1 & x1)]
    ## @test t.eterms == [:y, :x1, :x2]

-    @test dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz)
-    @test dropterm(@formula(foo ~ 1 + bar + baz), 1) == @formula(foo ~ 0 + bar + baz)
+    @test_broken dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz)


I guess you plan to fix this?

some have been fixed by this PR, some have been broken (but maintain functionality so passing versions have been added alongside).

kleinschmidt · 2018-04-18T17:01:06Z

I've removed the argparse dependency and fixed those tests. I also added a few more parser checks (for & with 1 and one-element interactions) that fixes a few tests that were broken before. Maybe @dmbates wants to weight in before this is merged? I don't think this should cause any problems with MixedModels but just in case...

dmbates · 2018-04-18T19:31:32Z

I do have a problem with one of the tests in MixedModels because of the way that I explicitly construct a one-sided formula. It is at MixedModels/src/pls.jl:90 and I am trying to build up a model matrix from individual terms. I use an expression of the form ModelFrame(Formula(nothing, rhs), fr) where rhs is an expression.

I think it may be best if I rewrite this section of the MixedModels code. It seems to me that this is probably where JuliaStats/MixedModels.jl#100 originates. However, I should have a temporary fix so that this PR in StatsModels can go forward. Is there an explicit constructor for a one-sided formula?

kleinschmidt · 2018-04-18T20:13:55Z

You might want to wait to do that re-write because I have bigger changes planned here about how Terms are represented (which may or may not change how best to handle things in MixedModels).

This PR actually fixes that issue, since the parsing rules are applied to the arguments of | as well:

julia> @formula( y ~ 1 + x + (a*b | c))
Formula: y ~ 1 + x + ((a + b + a & b) | c)

Your issue is due to the fact that the formula holds onto the original expression. You could probably get away with just constructing a Formula with empty ex and ex_orig fields for now.

dmbates · 2018-04-18T20:33:48Z

I had forgotten that the representation of a Formula had changed. Would it be possible to add another Formula constructor like

Formula(lhs::Union{Expr, Symbol, Void}, rhs::Union{Expr, Integer, Symbol}) = Formula(:(), :(), lhs, rhs)

to make this case work in the short term? I could do something in the MixedModels package but I think it would be easier to add the method here.

kleinschmidt · 2018-04-19T00:22:02Z

Yes, I'd be okay with that, as long as there's a deprecation warning so that people who have been constructing formulae manually are aware. That might help avoid breakage from this change, too.

The other option is to do something like @eval @formula($(nothing) ~ $rhs).

Either way probably good to upperbound to the current version of statsmodels for anyone who relies on the internal structure of Formulas.

kleinschmidt · 2018-04-19T00:43:32Z

I added a Formula(lhs, rhs) method that creates a .ex field by splicing in the two arguments, and sets ex_orig to be :(), and gives a deprecation warning suggesting the 4-argument form or the @eval @formula solution if parsing is needed. This ensures backwards compatibility except in cases where you expect the lhs/rhs fields to be un-parsed, which...serves them right I guess.

kleinschmidt · 2018-04-19T12:21:38Z

The failure here is because the way warnings are printed is different between 0.6 and 0.7, so using @test_warn "deprecated" ... doesn't work on 0.7, and @test_logs isn't available on 0.6. So maybe it's better to just leave out a test for this, or do something like

    if VERSION > v"0.7.0-DEV"
        f2 = @eval(@test_logs (:warn, r"Formula\(lhs, rhs\) is deprecated") Formula(f.lhs, f.rhs))
    else        
        f2 = @eval(@test_warn "deprecated" Formula(f.lhs, f.rhs))
    end

ararslan · 2018-04-19T15:20:53Z

test/deprecated.jl

@@ -1,7 +1,12 @@
 @testset "Deprecations" begin
    f = @formula y ~ 1 + a*b

-    f2 = @test_warn "deprecated" Formula(f.lhs, f.rhs)
+    if VERSION > v"0.7.0-DEV"


It would be nice to use a more precise version here.

I mostly want to split 0.6/0.7...I don't know when the @warn_logs appeared but I think it's fairly safe to assume that anyone who's on 0.7.something will want the other test. So somthing like if VERSION < v"0.6.9999"?

Should be

@static if VERSION >= v"0.7.0-DEV.2988" @test_logs ... else # other thing end

The @static will prevent Julia from trying to expand @test_logs when it doesn't exist.

I see now that v"0.7.0" is higher than any v"0.7.0-DEV" so that makes sense to me so this makes sense.

kleinschmidt · 2018-04-19T23:03:49Z

I fixed the version check and I think that's the last thing.

matthieugomez · 2018-04-26T20:16:41Z

I am trying to update my package for this pull request.
I have an issue for formulas with no left hand side:

using DataFrames, StatsModels
df = DataFrames.DataFrame(x1 = [1])
StatsModels.ModelFrame(StatsModels.Terms(StatsModels.@formula(nothing~x1)), df)
#> ERROR: KeyError: key :nothing not found

How can I adapt this code to make it work?

kleinschmidt · 2018-04-26T20:26:14Z

you can do @eval @formula($nothing ~ x1). If you look at the expression :(nothing ~ x1) you'll see that nothing is actually the symbol :nothing:

julia> dump( :(nothing ~ x1) )
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ~
    2: Symbol nothing
    3: Symbol x1
  typ: Any

whereas you want

julia> dump(:($nothing ~ x1))
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol ~
    2: Void nothing
    3: Symbol x1
  typ: Any

matthieugomez · 2018-04-26T20:43:28Z

Ok — it works now. Thanks a lot for your help.

ararslan reviewed Apr 16, 2018

View reviewed changes

kleinschmidt added 8 commits April 17, 2018 08:42

drop in new parsing code

38a5134

use typemax(Int) instead of Inf

6ed79ba

fix onesided formulae

2c72081

more tests are broken

969c5bc

guess debug macro isn't exported

b162f95

get rid of debugging code

94acf5b

using argcheck in formula: cut the crap and indent

ff277c8

forgot a Compat.findfirst

0f88bfa

kleinschmidt force-pushed the formula2 branch from 24ec99b to 0f88bfa Compare April 17, 2018 12:42

kleinschmidt requested review from andreasnoack and nalimilan April 18, 2018 00:39

nalimilan reviewed Apr 18, 2018

View reviewed changes

kleinschmidt added 8 commits April 18, 2018 07:58

remove argcheck dependency

5918628

updating tests

d2e0ba0

some have been fixed by this PR, some have been broken (but maintain functionality so passing versions have been added alongside).

actually just use Terms to get functional version of formula

99a411b

rewrite rule for interactions with number terms

e28da1b

make revise-friendly, deleteat ex.args, missing ||

c96f6e4

typo

e379f25

docstrings for new rewrites and update tests

cca0e6f

remove dead code

0090e78

add (deprecated) 2-arg Formula constructor, and tests

8da82bc

kleinschmidt mentioned this pull request Apr 19, 2018

Julia 0.7-DEV version #52

Closed

use different warning tests depending on the version

643bc8c

ararslan reviewed Apr 19, 2018

View reviewed changes

kleinschmidt added 2 commits April 19, 2018 18:36

use @ static for version check

bc8c314

check version against 0.7.0-DEV.2988

f1b98c2

kleinschmidt merged commit dc070dc into master Apr 22, 2018

ararslan deleted the formula2 branch April 22, 2018 23:17

kleinschmidt mentioned this pull request Apr 24, 2018

interactions in random effect specification JuliaStats/MixedModels.jl#100

Closed

kleinschmidt mentioned this pull request May 9, 2018

parse one-sided formulas correctly and add tests #20

Closed

kleinschmidt mentioned this pull request Aug 7, 2018

Terms 2.0: son of Terms #71

Merged

kleinschmidt mentioned this pull request Mar 10, 2019

RFC use Term type for parsing Formulas #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rip out formula parsing and replace it with something sane #54

rip out formula parsing and replace it with something sane #54

kleinschmidt commented Apr 16, 2018

ararslan Apr 16, 2018

kleinschmidt Apr 17, 2018

ararslan Apr 17, 2018

nalimilan Apr 26, 2018

kleinschmidt Apr 26, 2018

kleinschmidt commented Apr 16, 2018

ararslan Apr 16, 2018

kleinschmidt commented Apr 16, 2018

kleinschmidt commented Apr 17, 2018

codecov-io commented Apr 17, 2018 •

edited

Loading

nalimilan left a comment

nalimilan Apr 18, 2018

nalimilan Apr 18, 2018

kleinschmidt Apr 18, 2018

nalimilan Apr 18, 2018

kleinschmidt commented Apr 18, 2018

dmbates commented Apr 18, 2018

kleinschmidt commented Apr 18, 2018

dmbates commented Apr 18, 2018

kleinschmidt commented Apr 19, 2018

kleinschmidt commented Apr 19, 2018

kleinschmidt commented Apr 19, 2018

ararslan Apr 19, 2018

kleinschmidt Apr 19, 2018

ararslan Apr 19, 2018

kleinschmidt Apr 19, 2018 •

edited

Loading

kleinschmidt commented Apr 19, 2018

matthieugomez commented Apr 26, 2018 •

edited

Loading

kleinschmidt commented Apr 26, 2018 •

edited

Loading

matthieugomez commented Apr 26, 2018 •

edited

Loading

rip out formula parsing and replace it with something sane #54

rip out formula parsing and replace it with something sane #54

Conversation

kleinschmidt commented Apr 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt commented Apr 16, 2018

Choose a reason for hiding this comment

kleinschmidt commented Apr 16, 2018

kleinschmidt commented Apr 17, 2018

codecov-io commented Apr 17, 2018 • edited Loading

Codecov Report

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt commented Apr 18, 2018

dmbates commented Apr 18, 2018

kleinschmidt commented Apr 18, 2018

dmbates commented Apr 18, 2018

kleinschmidt commented Apr 19, 2018

kleinschmidt commented Apr 19, 2018

kleinschmidt commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kleinschmidt Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

kleinschmidt commented Apr 19, 2018

matthieugomez commented Apr 26, 2018 • edited Loading

kleinschmidt commented Apr 26, 2018 • edited Loading

matthieugomez commented Apr 26, 2018 • edited Loading

codecov-io commented Apr 17, 2018 •

edited

Loading

kleinschmidt Apr 19, 2018 •

edited

Loading

matthieugomez commented Apr 26, 2018 •

edited

Loading

kleinschmidt commented Apr 26, 2018 •

edited

Loading

matthieugomez commented Apr 26, 2018 •

edited

Loading