-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC use Term type for parsing Formulas #4
Conversation
function Base.show(io::IO, f::Formula) | ||
print(io, | ||
string("Formula: ", | ||
f.lhs == nothing ? "" : f.lhs, " ~ ", f.rhs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put this all on one line and still be under the usual cutoff. Also I think it's typically preferred to use ===
when comparing against nothing
.
|
||
## Integers to intercept terms | ||
function Term(i::Integer) | ||
i == 0 || i == -1 || i == 1 || error("Can't construct term from Integer $i") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw
ing an ArgumentError
might be more appropriate in this situation.
else | ||
## not a :call to s, return condensed version of ex | ||
return condense(ex) | ||
haslhs = f.lhs != nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!==
I'm kind of wondering whether we would benefit from having the right- and left-hand sides of the formula be Nullable
, so these checks are just isnull
.
## always return an ARRAY of terms | ||
getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex] | ||
getterms(a::Any) = Any[a] | ||
evalterm_sets = [Set(x) for x in evalterms] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do Set.(evalterms)
, though that doesn't really matter
getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex] | ||
getterms(a::Any) = Any[a] | ||
evalterm_sets = [Set(x) for x in evalterms] | ||
evalterms = unique(vcat(evalterms...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce(vcat, evalterms)
avoids the performance penalty associated with splatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce
will grow the vector repeatedly, which isn't great either. Here you don't actually need to build the full vector just to extract the unique values. I think repeated calls to union!
should do what you want most efficiently. Or if performance doesn't really matter, reduce(union, evalterms)
.
evalterms = unique(vcat(evalterms...)) | ||
|
||
factors = Int8[t in s for t in evalterms, s in evalterm_sets] | ||
non_redundants = fill(false, size(factors)) # initialize to false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fill(false, n)
-> falses(n)
?
* clean up show * argument error for intercept Term * === nothing * reduce instead of ...
Current coverage is 91.60% (diff: 91.66%)@@ master #4 diff @@
==========================================
Files 5 5
Lines 307 286 -21
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 290 262 -28
- Misses 17 24 +7
Partials 0 0
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really tell what's the best approach, you're in the best position to know whether this makes the code simpler or not.
Though I wonder whether relying on dispatch for this is really a good idea: the types will only be known at run time, so there's no performance gain to expect. Extensibility isn't really a concern either: it packages need new operators, I'd rather add centralized support here rather than having each package introduce variants which may end up conflicting with each other.
The fact that you need to wrap evaluation terms inside a :eval
node may be an indication that another approach would be more natural. Without changing the spirit of the code, you could replace method definitions with a single method containing an if
sequence. That may sound un-Julian at first, but it should be actually more efficient, and not longer. For example, several evt
methods are identical and could go to the same branch. I think what matters is the clean separation of operations into specific functions; whether they are implemented via type dispatch or manual within-method dispatch doesn't affect the clarity of the code.
end | ||
|
||
## no-op constructor | ||
Term{H}(t::Term{H}) = t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for H
here.
Term{:+}([add_children!(deepcopy(t), c, others) for c in new_child.children]) | ||
|
||
## Expansion of a*b -> a + b + a&b | ||
expand_star(a::Term,b::Term) = Term{:+}([a, b, Term{:&}([a,b])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing spaces after commas.
Term{:+}([children[1], Term{-1}()]) : | ||
error("invalid subtraction of $(children[2]); subtraction only supported for -1") | ||
|
||
## sorting term by the degree of its children: order is 1 for everything except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sorting a term's children by their degree" would be more accurate.
end | ||
|
||
################################################################################ | ||
## This duplicates the functionality of the DataFrames.Terms type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataFrames type is supposed to be removed, isn't it? If so, no need to mention it in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a note to myself that this is the "old bit". I think, ultimately, that we can replace the Terms
type and work with the tree of Term
directly when constructing model matrices.
|
||
## printing Terms: | ||
@test string(Term(:a)) == "a" | ||
@test string(Term(:(a+b))) == "+(a, b)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use infix operators? Or is that too much work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better, too. Marginally more work but probably worth it, this is just what I hacked together initially...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this can wait for another PR.
evalterm_sets = [Set(x) for x in evalterms] | ||
evalterms = unique(reduce(vcat, [], evalterms)) | ||
|
||
factors = Int8[t in s for t in evalterms, s in evalterm_sets] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holdover from the DataFrames.jl version...
evalterms = unique(reduce(vcat, [], evalterms)) | ||
|
||
factors = Int8[t in s for t in evalterms, s in evalterm_sets] | ||
non_redundants = falses(size(factors)) # initialize to false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
falses
creates a BitArray
, which is going to be converted by the constructor. So fill(falses, size(factors))
is actually better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, I didn't realize that. My bad for suggesting falses
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit confusing and stands out in the API. I'd rather do this via falses(BitArray, ...)
.
|
||
ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1 | ||
ord(a::Any) = 1 | ||
Terms(terms, evalterms, factors, non_redundants, degrees, haslhs, hasintercept) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless line break.
print(io, string("Formula: ", f.lhs === nothing ? "" : f.lhs, " ~ ", f.rhs)) | ||
|
||
|
||
## Define Terms type that manages formula parsing and extension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Term
type (no plural).
|
||
typealias InterceptTerm Union{Term{0}, Term{-1}, Term{1}} | ||
|
||
## equality of Terms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
Bump! |
I've thought a bit more about this in the context of generating model matrices. My long-term vision for this has roughly been that a It would still be possible to use an abstraction like this without immediately parsing expressions into Terms, but I don't really see the advantage of that (other than possibly performance), especially since there's already a fair amount of transformations that happen in that parsing process (e.g., applying distributive property, expanding |
This PR changes how Formulas are parsed. Instead of manually manipulating the
Expr
, this converts it into a tree ofTerm
s. These are parametric on the type of node, e.g.Term{:+}
. These are combined into a reduced tree, adding children one at a time, and using dispatch to implement special rules like the associative and distributive properties, expansion of*
, etc.This results in cleaner parsing code, which is also (I think) more extensible, since people need only add additional
add_children!
methods to implement whatever behavior they'd like. It also, I hope, will make for writing more flexibleModelMatrix
code, since you can dispatch on each of the Terms and the data source type to create the actual columns. But I haven't really dug into that yet.I'm not 100% satisfied with this design but it passes all the tests and I think is ready for input. The main thing I'm unhappy with is how leaf nodes ("evaluation terms") are handled. These implemented using a special
:eval
place holder, with the variable name symbol as the only child. Originally I had used the symbol itself as the parameter, but this is a terrible idea because you need to compile special methods to handled every variable (and so performance goes to crap for formulas with many variables).