RFC use Term type for parsing Formulas #4

kleinschmidt · 2016-10-18T21:38:08Z

This PR changes how Formulas are parsed. Instead of manually manipulating the Expr, this converts it into a tree of Terms. These are parametric on the type of node, e.g. Term{:+}. These are combined into a reduced tree, adding children one at a time, and using dispatch to implement special rules like the associative and distributive properties, expansion of *, etc.

This results in cleaner parsing code, which is also (I think) more extensible, since people need only add additional add_children! methods to implement whatever behavior they'd like. It also, I hope, will make for writing more flexible ModelMatrix code, since you can dispatch on each of the Terms and the data source type to create the actual columns. But I haven't really dug into that yet.

I'm not 100% satisfied with this design but it passes all the tests and I think is ready for input. The main thing I'm unhappy with is how leaf nodes ("evaluation terms") are handled. These implemented using a special :eval place holder, with the variable name symbol as the only child. Originally I had used the symbol itself as the parameter, but this is a terrible idea because you need to compile special methods to handled every variable (and so performance goes to crap for formulas with many variables).

ararslan · 2016-10-18T22:14:28Z

src/formula.jl

+function Base.show(io::IO, f::Formula)
+    print(io,
+          string("Formula: ",
+                 f.lhs == nothing ? "" : f.lhs, " ~ ", f.rhs))


You could put this all on one line and still be under the usual cutoff. Also I think it's typically preferred to use === when comparing against nothing.

ararslan · 2016-10-18T22:16:03Z

src/formula.jl

+
+## Integers to intercept terms
+function Term(i::Integer)
+    i == 0 || i == -1 || i == 1 || error("Can't construct term from Integer $i")


throwing an ArgumentError might be more appropriate in this situation.

ararslan · 2016-10-18T22:19:58Z

src/formula.jl

-    else
-        ## not a :call to s, return condensed version of ex
-        return condense(ex)
+    haslhs = f.lhs != nothing


!==

I'm kind of wondering whether we would benefit from having the right- and left-hand sides of the formula be Nullable, so these checks are just isnull.

ararslan · 2016-10-18T22:21:54Z

src/formula.jl

-## always return an ARRAY of terms
-getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex]
-getterms(a::Any) = Any[a]
+    evalterm_sets = [Set(x) for x in evalterms]


I think you can do Set.(evalterms), though that doesn't really matter

ararslan · 2016-10-18T22:22:30Z

src/formula.jl

-getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex]
-getterms(a::Any) = Any[a]
+    evalterm_sets = [Set(x) for x in evalterms]
+    evalterms = unique(vcat(evalterms...))


reduce(vcat, evalterms) avoids the performance penalty associated with splatting

reduce will grow the vector repeatedly, which isn't great either. Here you don't actually need to build the full vector just to extract the unique values. I think repeated calls to union! should do what you want most efficiently. Or if performance doesn't really matter, reduce(union, evalterms).

ararslan · 2016-10-18T22:23:06Z

src/formula.jl

+    evalterms = unique(vcat(evalterms...))
+
+    factors = Int8[t in s for t in evalterms, s in evalterm_sets]
+    non_redundants = fill(false, size(factors)) # initialize to false


fill(false, n) -> falses(n)?

* clean up show * argument error for intercept Term * === nothing * reduce instead of ...

codecov-io · 2016-10-19T00:40:36Z

Current coverage is 91.60% (diff: 91.66%)

Merging #4 into master will decrease coverage by 2.85%

@@             master         #4   diff @@
==========================================
  Files             5          5          
  Lines           307        286    -21   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits            290        262    -28   
- Misses           17         24     +7   
  Partials          0          0

Powered by Codecov. Last update db6dc9c...4b59c93

nalimilan

I can't really tell what's the best approach, you're in the best position to know whether this makes the code simpler or not.

Though I wonder whether relying on dispatch for this is really a good idea: the types will only be known at run time, so there's no performance gain to expect. Extensibility isn't really a concern either: it packages need new operators, I'd rather add centralized support here rather than having each package introduce variants which may end up conflicting with each other.

The fact that you need to wrap evaluation terms inside a :eval node may be an indication that another approach would be more natural. Without changing the spirit of the code, you could replace method definitions with a single method containing an if sequence. That may sound un-Julian at first, but it should be actually more efficient, and not longer. For example, several evt methods are identical and could go to the same branch. I think what matters is the clean separation of operations into specific functions; whether they are implemented via type dispatch or manual within-method dispatch doesn't affect the clarity of the code.

nalimilan · 2016-10-19T12:06:46Z

src/formula.jl

+end
+
+## no-op constructor
+Term{H}(t::Term{H}) = t


No need for H here.

nalimilan · 2016-10-19T12:09:34Z

src/formula.jl

+    Term{:+}([add_children!(deepcopy(t), c, others) for c in new_child.children])
+
+## Expansion of a*b -> a + b + a&b
+expand_star(a::Term,b::Term) = Term{:+}([a, b, Term{:&}([a,b])])


Missing spaces after commas.

nalimilan · 2016-10-19T12:11:07Z

src/formula.jl

+    Term{:+}([children[1], Term{-1}()]) :
+    error("invalid subtraction of $(children[2]); subtraction only supported for -1")
+
+## sorting term by the degree of its children: order is 1 for everything except


"sorting a term's children by their degree" would be more accurate.

nalimilan · 2016-10-19T12:11:46Z

src/formula.jl

+end
+
+################################################################################
+## This duplicates the functionality of the DataFrames.Terms type:


The DataFrames type is supposed to be removed, isn't it? If so, no need to mention it in the code.

This was just a note to myself that this is the "old bit". I think, ultimately, that we can replace the Terms type and work with the tree of Term directly when constructing model matrices.

nalimilan · 2016-10-19T12:16:06Z

test/term.jl

+
+## printing Terms:
+@test string(Term(:a)) == "a"
+@test string(Term(:(a+b))) == "+(a, b)"


Wouldn't it be better to use infix operators? Or is that too much work?

I think it would be better, too. Marginally more work but probably worth it, this is just what I hacked together initially...

OK, this can wait for another PR.

nalimilan · 2016-10-19T12:18:41Z

src/formula.jl

+    evalterm_sets = [Set(x) for x in evalterms]
+    evalterms = unique(reduce(vcat, [], evalterms))
+
+    factors = Int8[t in s for t in evalterms, s in evalterm_sets]


Why not Bool?

Holdover from the DataFrames.jl version...

nalimilan · 2016-10-19T12:19:33Z

src/formula.jl

+    evalterms = unique(reduce(vcat, [], evalterms))
+
+    factors = Int8[t in s for t in evalterms, s in evalterm_sets]
+    non_redundants = falses(size(factors)) # initialize to false


falses creates a BitArray, which is going to be converted by the constructor. So fill(falses, size(factors)) is actually better.

Oh interesting, I didn't realize that. My bad for suggesting falses.

Yeah, it's a bit confusing and stands out in the API. I'd rather do this via falses(BitArray, ...).

nalimilan · 2016-10-19T12:19:55Z

src/formula.jl


-ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1
-ord(a::Any) = 1
+    Terms(terms, evalterms, factors, non_redundants, degrees, haslhs, hasintercept)



Useless line break.

nalimilan · 2016-10-19T12:28:25Z

src/formula.jl

+    print(io, string("Formula: ", f.lhs === nothing ? "" : f.lhs, " ~ ", f.rhs))
+
+
+## Define Terms type that manages formula parsing and extension.


Term type (no plural).

nalimilan · 2016-10-19T12:28:45Z

src/formula.jl

+
+typealias InterceptTerm Union{Term{0}, Term{-1}, Term{1}}
+
+## equality of Terms


ararslan · 2016-11-12T18:49:32Z

Bump!

kleinschmidt · 2016-11-15T00:45:25Z

I've thought a bit more about this in the context of generating model matrices. My long-term vision for this has roughly been that a Term is the atomic unit that describes how column(s) are generated from data, or from other Terms. For instance, a Term{:+} says, generate columns for each of my children and then horizontally concatenate them. A Term tree could then replace the concept of the old Terms type, and possibly also ModelFrame (transforming leaf nodes to point to the corresponding data source, although it's less clear to me that this is a good idea). We'd handle things like encoding contrasts and checking redundancy by actually transforming the underlying Terms themselves. This will obviously take some doing to even see whether it works.

It would still be possible to use an abstraction like this without immediately parsing expressions into Terms, but I don't really see the advantage of that (other than possibly performance), especially since there's already a fair amount of transformations that happen in that parsing process (e.g., applying distributive property, expanding *).

kleinschmidt · 2019-03-10T16:50:58Z

This is superseded by #54 and #71

kleinschmidt added 5 commits October 16, 2016 23:25

[WIP] use Term types for parsing formulas

c1b98c2

fix bug in * expansion

9b52e47

test Term directly

28e0e08

hash Term; use unique fixed effects in Terms

2fb5804

clarify which parts are analogous to DataFrames

4b59c93

ararslan reviewed Oct 18, 2016

View reviewed changes

code review round 1

fc0e82a

* clean up show * argument error for intercept Term * === nothing * reduce instead of ...

nalimilan reviewed Oct 19, 2016

View reviewed changes

kleinschmidt mentioned this pull request Nov 23, 2016

An alternative representation of Terms? #8

Closed

kleinschmidt mentioned this pull request Aug 7, 2018

Terms 2.0: son of Terms #71

Merged

kleinschmidt closed this Mar 10, 2019

ararslan deleted the dkf/term-type branch March 10, 2019 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC use Term type for parsing Formulas #4

RFC use Term type for parsing Formulas #4

kleinschmidt commented Oct 18, 2016

ararslan Oct 18, 2016

ararslan Oct 18, 2016

ararslan Oct 18, 2016

ararslan Oct 18, 2016

ararslan Oct 18, 2016

nalimilan Oct 19, 2016

ararslan Oct 18, 2016

codecov-io commented Oct 19, 2016

nalimilan left a comment

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

kleinschmidt Oct 19, 2016

nalimilan Oct 19, 2016

kleinschmidt Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

kleinschmidt Oct 19, 2016

nalimilan Oct 19, 2016

ararslan Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

nalimilan Oct 19, 2016

ararslan commented Nov 12, 2016

kleinschmidt commented Nov 15, 2016

kleinschmidt commented Mar 10, 2019

		print(io, string("Formula: ", f.lhs === nothing ? "" : f.lhs, " ~ ", f.rhs))


		## Define Terms type that manages formula parsing and extension.


		typealias InterceptTerm Union{Term{0}, Term{-1}, Term{1}}

		## equality of Terms

RFC use Term type for parsing Formulas #4

RFC use Term type for parsing Formulas #4

Conversation

kleinschmidt commented Oct 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Oct 19, 2016

Current coverage is 91.60% (diff: 91.66%)

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan commented Nov 12, 2016

kleinschmidt commented Nov 15, 2016

kleinschmidt commented Mar 10, 2019