-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: introduced typed expressions, separate AST and IR #12604
Comments
I think I don't quite understand why explicitly carrying the type in |
Thanks @jayzhan211 @findepi. Youguys are right, currently we using What I'm trying to say is to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring |
correct, unless
The type will be needed sooner or later. No need to have implicit cache for something that is needed and can be "cached" explicitly. Explicitness helps understand the contract -- if something is a field it implies it won't change. Also, can Expr be used against two different DFSchemas? Currently it can (and may produce different types), but i assume this capability is incidental more than intentional.
That's a good question. Not sure yet how these should be handled.
@comphead that's a very good call. |
I think this is the
FWIW I don't think i have seen get_type appear in the planning profiling traces (this may be because the overhead is mostly dominated by copies so the type calculations are overshadowed. There are some times when the type of an intermedidate expression is needed (like the simplifier) but it doesn't seem to have been a bottleneck so far
I agree with @comphead -- I think this propsoal would be stronger if it was in the context of
In other words, if we were designing a system from scratch this proposal makes sense to me, but I think at this point we already have a system that seems to work reasonably well for our existing users so just changing the design to be better without some practical improvement is a hard one for me
I think making an API for user-defined coercion rules in DataFusion would be really nice (not just to support the usecase, but I think forcing the internal implementaiton into a thoughtful API would make the code easier to reason about and better strutured) |
Lack of this looks like a blocker for Comet 44 update #13717 (comment)
|
I thought this ticket related to Logical exprs (which Comet doesn't use 🤔 ) |
Good point. i jumped the gun. |
For additional context on the Comet use case, we use Spark for query planning and then translate Spark's physical plan to a DataFusion physical plan. We do not use any SQL planning or logical planning from DataFusion. |
The introduction of a new IR and new Expressions has these challenges
While we didn't make progress on the first two, it seems that at least that we made good progress on the last (but not least!), and there is general agreement that a better structured IR makes sense. |
Currently DataFusion expressions (
Expr
) do not know their type. The type can be calculated via<Expr as ExprSchemable::get_type
function.This has drawbacks unfortunately
I believe the current design of expressions come from the fact they are used during initial AST (abstract syntax tree) processing where types aren't known yet. The pair
(AST, AST → LogicalPlan conversion function)
is an implementation of a language. DF implements its own variant of SQL (#12357) and systems building on top of DF may or may not use it (e.g. Ballista probably uses it and Comet wants to be closely aligned with Spark SQL's semantics).Proposal plan
Expr
carry the type explicitly (that would be "logical type" from [Proposal] Decouple logical from physical types #11513)The text was updated successfully, but these errors were encountered: