-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix INTERVAL
parsing to support expressions and units via dialect
#1398
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb please could you take a look at this.
@@ -56,15 +56,6 @@ macro_rules! parser_err { | |||
}; | |||
} | |||
|
|||
// Returns a successful result if the optional expression is some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by change, this seemed really ugly, I couldn't resist fixing it.
Pull Request Test Coverage Report for Build 10703453851Details
💛 - Coveralls |
conflicts fixed. This is ready to review. |
After trying to adopt this in datafusion, I changed it to be more backwards compatible, e.g. |
This is tested in datafusion in apache/datafusion#12222. |
Okay, having thought about this more, I'm not convinced the workaround I have now on the generic dialect of fn allow_interval_expressions(&self) -> bool {
true
}
fn require_interval_units(&self) -> bool {
false
} Makes sense. It reduces the effect of the change, but at the expense of some incorrect behaviour, e.g.: I think
I don't mind which (we're using I've spent long enough on this with no feedback, so I'll wait for feedback before proceeding. But this is fairly urgent for us since interval precedence is pretty broken in datafusion right now. cc @alamb @andygrove @git-hulk |
INTERVAL
parsingINTERVAL
parsing to support expressions and units via dialect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samuelcolvin -- and I very much apologize for the delay in reviewing this PR.
I think this PR is well documented and well tested and the overall changes make sense to me. The biggest thing I think that we should do (I think alluded to by @git-hulk) is ensure the change for end users of this crate are well understood:
- Are there queries that would have parsed previously that now will not?
- If "yes", how could users get the old behavior if they wanted?
I have updated the PR title to be a little more specific -- can you double check this?
@@ -830,16 +829,14 @@ fn parse_typed_struct_syntax_bigquery() { | |||
expr_from_projection(&select.projection[3]) | |||
); | |||
|
|||
let sql = r#"SELECT STRUCT<INTERVAL>(INTERVAL '1-2 3 4:5:6.789999'), STRUCT<JSON>(JSON '{"class" : {"students" : [{"name" : "Jane"}]}}')"#; | |||
let sql = r#"SELECT STRUCT<INTERVAL>(INTERVAL '2' HOUR), STRUCT<JSON>(JSON '{"class" : {"students" : [{"name" : "Jane"}]}}')"#; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please
- leave the existing test as is to show what the effect of the changes are on this query? (I think it would error?)
- Add a new test for this new
'2' HOUR
variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the BigQuery syntax in https://cloud.google.com/bigquery/docs/reference/standard-sql/interval_functions I actually think the new test is more useful / correct -- all the intervals appear to have a unit after them, such as:
...
UNNEST([INTERVAL '1-2 3 4:5:6.789999' YEAR TO SECOND,
INTERVAL '0-13 370 48:61:61' YEAR TO SECOND]) AS i
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, BigQuery is MySQL flavoured when it comes to intervals, so intervals without units are not permitted.
Great PR ! Maybe a nitpick about the name: |
Okay I think this is ready @alamb @lovasoa.
I believe this PR will cause minimal changes to end users now. The following changes will be introduced:
|
(I've updated my comment above to correct a very confusing "not" -> "now") |
I'll plan to merge this PR tomorrow Sep 5 unless anyone else would like time to review or otherwise comment |
🚀 |
Replaces #1396.
Background
I've done some more research, and basically there are three groups of databases with pretty distinct ways of parsing interval expressions:
1. PostgreSQL-like (here
require_interval_qualifier => false
)Expect
INTERVAL
expressions to be in the formThese bind
interval
to the literal tightly.There's pretty good support for parsing these interval expressions in arrow-rs (added in apache/arrow-rs#6211).
These dialects also support syntax like
INTERVAL 1 SECOND
, but notably do not support syntax of the formINTERVAL 1 + 1 SECOND
.Databases: Postgres, DuckDB, Redshift, Snowflake
2. MySQL-like aka (here
require_interval_qualifier => true
)Expect
INTERVAL
expressions to be in the formthey also support
INTERVAL 1 + 1 SECOND
, so I presume the logic is "keep parsing literals until you meet one of the following units", there's a good list of these units here. Note these aren't all included in sqlparser-rs, that would be a good follow-up PR.These dialects do not support syntax like
INTERVAL '1 second'
— e.g. they require the unit/qualifier to be providedDatabases: MySQL, BigQuery, Databricks, Hive, ClickHouse
Sqlite and MsSql
Don't seem to support interval syntax at all.
The issue
Currently
sqlparser-rs
seems to partially support both syntaxes, with a preference for the MySQL style.This meant that
INTERVAL '1 second' > x
was wrongly being interpreted asINTERVAL ('1 second' > x)
.Datafusion then tries to hack around the problem but it wasn't a water tight solution.
Change Proposed
This PR adds
fn require_interval_qualifier(&self) -> bool
to dialects and uses that to decide how to parse interval expressions.require_interval_qualifier => false
also means expressions within intervals (e.g.INTERVAL 1 + 1 DAY
) are forbidden.The
GenericDialect
is set torequire_interval_qualifier => false
since intervals without units are vastly more common than intervals with expressions.The changes required are actually smaller than I feared.
Once this is used in datafusion, it should allow us to remove the
interval
hack completely.