Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC use Term type for parsing Formulas #4

Closed
wants to merge 6 commits into from
Closed

Conversation

kleinschmidt
Copy link
Member

This PR changes how Formulas are parsed. Instead of manually manipulating the Expr, this converts it into a tree of Terms. These are parametric on the type of node, e.g. Term{:+}. These are combined into a reduced tree, adding children one at a time, and using dispatch to implement special rules like the associative and distributive properties, expansion of *, etc.

This results in cleaner parsing code, which is also (I think) more extensible, since people need only add additional add_children! methods to implement whatever behavior they'd like. It also, I hope, will make for writing more flexible ModelMatrix code, since you can dispatch on each of the Terms and the data source type to create the actual columns. But I haven't really dug into that yet.

I'm not 100% satisfied with this design but it passes all the tests and I think is ready for input. The main thing I'm unhappy with is how leaf nodes ("evaluation terms") are handled. These implemented using a special :eval place holder, with the variable name symbol as the only child. Originally I had used the symbol itself as the parameter, but this is a terrible idea because you need to compile special methods to handled every variable (and so performance goes to crap for formulas with many variables).

function Base.show(io::IO, f::Formula)
print(io,
string("Formula: ",
f.lhs == nothing ? "" : f.lhs, " ~ ", f.rhs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put this all on one line and still be under the usual cutoff. Also I think it's typically preferred to use === when comparing against nothing.


## Integers to intercept terms
function Term(i::Integer)
i == 0 || i == -1 || i == 1 || error("Can't construct term from Integer $i")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing an ArgumentError might be more appropriate in this situation.

else
## not a :call to s, return condensed version of ex
return condense(ex)
haslhs = f.lhs != nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!==

I'm kind of wondering whether we would benefit from having the right- and left-hand sides of the formula be Nullable, so these checks are just isnull.

## always return an ARRAY of terms
getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex]
getterms(a::Any) = Any[a]
evalterm_sets = [Set(x) for x in evalterms]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do Set.(evalterms), though that doesn't really matter

getterms(ex::Expr) = (ex.head == :call && ex.args[1] == :+) ? ex.args[2:end] : Expr[ex]
getterms(a::Any) = Any[a]
evalterm_sets = [Set(x) for x in evalterms]
evalterms = unique(vcat(evalterms...))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce(vcat, evalterms) avoids the performance penalty associated with splatting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce will grow the vector repeatedly, which isn't great either. Here you don't actually need to build the full vector just to extract the unique values. I think repeated calls to union! should do what you want most efficiently. Or if performance doesn't really matter, reduce(union, evalterms).

evalterms = unique(vcat(evalterms...))

factors = Int8[t in s for t in evalterms, s in evalterm_sets]
non_redundants = fill(false, size(factors)) # initialize to false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fill(false, n) -> falses(n)?

* clean up show
* argument error for intercept Term
* === nothing
* reduce instead of ...
@codecov-io
Copy link

Current coverage is 91.60% (diff: 91.66%)

Merging #4 into master will decrease coverage by 2.85%

@@             master         #4   diff @@
==========================================
  Files             5          5          
  Lines           307        286    -21   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits            290        262    -28   
- Misses           17         24     +7   
  Partials          0          0          

Powered by Codecov. Last update db6dc9c...4b59c93

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really tell what's the best approach, you're in the best position to know whether this makes the code simpler or not.

Though I wonder whether relying on dispatch for this is really a good idea: the types will only be known at run time, so there's no performance gain to expect. Extensibility isn't really a concern either: it packages need new operators, I'd rather add centralized support here rather than having each package introduce variants which may end up conflicting with each other.

The fact that you need to wrap evaluation terms inside a :eval node may be an indication that another approach would be more natural. Without changing the spirit of the code, you could replace method definitions with a single method containing an if sequence. That may sound un-Julian at first, but it should be actually more efficient, and not longer. For example, several evt methods are identical and could go to the same branch. I think what matters is the clean separation of operations into specific functions; whether they are implemented via type dispatch or manual within-method dispatch doesn't affect the clarity of the code.

end

## no-op constructor
Term{H}(t::Term{H}) = t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for H here.

Term{:+}([add_children!(deepcopy(t), c, others) for c in new_child.children])

## Expansion of a*b -> a + b + a&b
expand_star(a::Term,b::Term) = Term{:+}([a, b, Term{:&}([a,b])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spaces after commas.

Term{:+}([children[1], Term{-1}()]) :
error("invalid subtraction of $(children[2]); subtraction only supported for -1")

## sorting term by the degree of its children: order is 1 for everything except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sorting a term's children by their degree" would be more accurate.

end

################################################################################
## This duplicates the functionality of the DataFrames.Terms type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataFrames type is supposed to be removed, isn't it? If so, no need to mention it in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a note to myself that this is the "old bit". I think, ultimately, that we can replace the Terms type and work with the tree of Term directly when constructing model matrices.


## printing Terms:
@test string(Term(:a)) == "a"
@test string(Term(:(a+b))) == "+(a, b)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use infix operators? Or is that too much work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better, too. Marginally more work but probably worth it, this is just what I hacked together initially...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this can wait for another PR.

evalterm_sets = [Set(x) for x in evalterms]
evalterms = unique(reduce(vcat, [], evalterms))

factors = Int8[t in s for t in evalterms, s in evalterm_sets]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Bool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holdover from the DataFrames.jl version...

evalterms = unique(reduce(vcat, [], evalterms))

factors = Int8[t in s for t in evalterms, s in evalterm_sets]
non_redundants = falses(size(factors)) # initialize to false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

falses creates a BitArray, which is going to be converted by the constructor. So fill(falses, size(factors)) is actually better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, I didn't realize that. My bad for suggesting falses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit confusing and stands out in the API. I'd rather do this via falses(BitArray, ...).


ord(ex::Expr) = (ex.head == :call && ex.args[1] == :&) ? length(ex.args)-1 : 1
ord(a::Any) = 1
Terms(terms, evalterms, factors, non_redundants, degrees, haslhs, hasintercept)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless line break.

print(io, string("Formula: ", f.lhs === nothing ? "" : f.lhs, " ~ ", f.rhs))


## Define Terms type that manages formula parsing and extension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Term type (no plural).


typealias InterceptTerm Union{Term{0}, Term{-1}, Term{1}}

## equality of Terms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@ararslan
Copy link
Member

Bump!

@kleinschmidt
Copy link
Member Author

I've thought a bit more about this in the context of generating model matrices. My long-term vision for this has roughly been that a Term is the atomic unit that describes how column(s) are generated from data, or from other Terms. For instance, a Term{:+} says, generate columns for each of my children and then horizontally concatenate them. A Term tree could then replace the concept of the old Terms type, and possibly also ModelFrame (transforming leaf nodes to point to the corresponding data source, although it's less clear to me that this is a good idea). We'd handle things like encoding contrasts and checking redundancy by actually transforming the underlying Terms themselves. This will obviously take some doing to even see whether it works.

It would still be possible to use an abstraction like this without immediately parsing expressions into Terms, but I don't really see the advantage of that (other than possibly performance), especially since there's already a fair amount of transformations that happen in that parsing process (e.g., applying distributive property, expanding *).

@kleinschmidt
Copy link
Member Author

This is superseded by #54 and #71

@ararslan ararslan deleted the dkf/term-type branch March 10, 2019 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants