-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rip out formula parsing and replace it with something sane #54
Conversation
@@ -2,3 +2,4 @@ julia 0.6 | |||
DataFrames 0.11.5 | |||
StatsBase 0.20.1 | |||
Compat 0.63 | |||
ArgCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really prefer we not add a dependency on ArgCheck, since @argcheck thing "ahhh"
is identical to thing || throw(ArgumentError("ahhh"))
, and the latter is both clearer (IMO) and will provide a cleaner backtrace (nothing about inlined macro expansion from @argcheck
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more of a habit than anything for me so I'm fine to do the checks manually. At this point that wouldn't add too much boilerplate. Your objection is just that argcheck isn't strictly necessary here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. And I always favor fewer dependencies when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT you removed all uses of @argcheck
but still added the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I caught this right after I tagged a release so I pushed another commit on master to remove it. Thanks attobot for showing the requires diff.
I don't know what's happening with the nightly tests; they pass locally for me (but the version that Pkg3 seems to pick up CategoricalArrays 0.3.7 and not 0.3.8 which is what Travis seems to pick). Something is badly wrong with 0.6 so I need to get that sorted. |
## not a :call to s, return condensed version of ex | ||
return condense(ex) | ||
# recursively sort children | ||
sort_terms!.(ex.args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach(sort_terms!, ex.args)
may be more efficient here since it avoids allocations from broadcasting.
Ah CategoricalArrays 0.3.8 is going to break this whole package because we rely on unique returning levels order, not order of occurrence. |
Okay I think #55 will fix the categorical array issue. |
Codecov Report
@@ Coverage Diff @@
## master #54 +/- ##
==========================================
- Coverage 94.59% 93.85% -0.74%
==========================================
Files 5 6 +1
Lines 333 358 +25
==========================================
+ Hits 315 336 +21
- Misses 18 22 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're by far the one who knows this code best, so if you think it's OK I can only agree! Though as @ararslan I don't think requiring ArgCheck is a good idea.
src/formula.jl
Outdated
getterms(a::Any) = Any[a] | ||
|
||
ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1 | ||
ord(a::Any) = 1 | ||
# ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
|
||
## totally empty | ||
t = Terms(Formula(nothing, 0)) | ||
t = Terms(@eval @formula $(:($nothing ~ 0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intriguing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well since I've moved parsing into the @formula
macro and there's not currently syntactic support for one-sided formulae this is what's necessary :)
test/formula.jl
Outdated
@@ -101,16 +104,16 @@ | |||
## @test t.terms == [:x2, :(x1&x2)] # == [:(1 & x1)] | |||
## @test t.eterms == [:y, :x1, :x2] | |||
|
|||
@test dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz) | |||
@test dropterm(@formula(foo ~ 1 + bar + baz), 1) == @formula(foo ~ 0 + bar + baz) | |||
@test_broken dropterm(@formula(foo ~ 1 + bar + baz), :bar) == @formula(foo ~ 1 + baz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you plan to fix this?
some have been fixed by this PR, some have been broken (but maintain functionality so passing versions have been added alongside).
I've removed the argparse dependency and fixed those tests. I also added a few more parser checks (for |
I do have a problem with one of the tests in I think it may be best if I rewrite this section of the MixedModels code. It seems to me that this is probably where JuliaStats/MixedModels.jl#100 originates. However, I should have a temporary fix so that this PR in |
You might want to wait to do that re-write because I have bigger changes planned here about how Terms are represented (which may or may not change how best to handle things in MixedModels). This PR actually fixes that issue, since the parsing rules are applied to the arguments of
Your issue is due to the fact that the formula holds onto the original expression. You could probably get away with just constructing a |
I had forgotten that the representation of a
to make this case work in the short term? I could do something in the |
Yes, I'd be okay with that, as long as there's a deprecation warning so that people who have been constructing formulae manually are aware. That might help avoid breakage from this change, too. The other option is to do something like Either way probably good to upperbound to the current version of statsmodels for anyone who relies on the internal structure of |
I added a |
The failure here is because the way warnings are printed is different between 0.6 and 0.7, so using
|
test/deprecated.jl
Outdated
@@ -1,7 +1,12 @@ | |||
@testset "Deprecations" begin | |||
f = @formula y ~ 1 + a*b | |||
|
|||
f2 = @test_warn "deprecated" Formula(f.lhs, f.rhs) | |||
if VERSION > v"0.7.0-DEV" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to use a more precise version here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly want to split 0.6/0.7...I don't know when the @warn_logs
appeared but I think it's fairly safe to assume that anyone who's on 0.7.something will want the other test. So somthing like if VERSION < v"0.6.9999"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
@static if VERSION >= v"0.7.0-DEV.2988"
@test_logs ...
else
# other thing
end
The @static
will prevent Julia from trying to expand @test_logs
when it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that v"0.7.0"
is higher than any v"0.7.0-DEV"
so that makes sense to me so this makes sense.
I fixed the version check and I think that's the last thing. |
I am trying to update my package for this pull request. using DataFrames, StatsModels
df = DataFrames.DataFrame(x1 = [1])
StatsModels.ModelFrame(StatsModels.Terms(StatsModels.@formula(nothing~x1)), df)
#> ERROR: KeyError: key :nothing not found How can I adapt this code to make it work? |
you can do julia> dump( :(nothing ~ x1) )
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ~
2: Symbol nothing
3: Symbol x1
typ: Any whereas you want julia> dump(:($nothing ~ x1))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ~
2: Void nothing
3: Symbol x1
typ: Any |
Ok — it works now. Thanks a lot for your help. |
This PR replaces the spaghetti code that handled the expression re-writes that transform formula surface syntax into a bunch main effects and interactions. It works by specifying re-write rules (represented by subtypes of
FormulaRewrite
) for associative rule, distributive rule, "star expansion" (a*b
toa+b+a&b
), and subtraction (replacingx-1
withx + -1
). These rules are implemented via methods on two functions:applies(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite})
andrewrite!(ex::Expr, child_idx::Int, ::Type{T<:FormulaRewrite})
. The first checks whether the child ofex
atchild_idx
should be re-written under ruleT
, and the second modifiesex
in place and returns the index of the next child to check. For example, the associative rule checks whetherex
andex.args[child_idx]
are both calls to the same associative operator, and splices the args of the child intoex
in place of the child.The goal here is to make the code easier to read, maintain, and extend. Be expressing all the operations on the formula DSL in the same terms it'll be easier, I hope, to expand it or for package authors to implement custom extensions (although that may require using custom macros with the current implementation). In the future I plan to change how the terms themselves are represented to make things even more composable and extensible, but I think the current PR should be considered regardless of that.
The only thing that's currently broken is the
drop_term!
tests: they'reTerms
are equivalent but the actualFormulae
are not because I wasn't sure what the best way to modify the whole expression of the formula is (which I now store on theFormula
struct, to hold onto both the original (un-parsed) expression that's passed to the macro and the whole parsed expression).