Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix INTERVAL parsing to support expressions and units via dialect #1398

Merged
merged 16 commits into from
Sep 6, 2024

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Aug 23, 2024

Replaces #1396.

Background

I've done some more research, and basically there are three groups of databases with pretty distinct ways of parsing interval expressions:

1. PostgreSQL-like (here require_interval_qualifier => false)

Expect INTERVAL expressions to be in the form

SELECT INTERVAL '1 second'

These bind interval to the literal tightly.

There's pretty good support for parsing these interval expressions in arrow-rs (added in apache/arrow-rs#6211).

These dialects also support syntax like INTERVAL 1 SECOND, but notably do not support syntax of the form INTERVAL 1 + 1 SECOND.

Databases: Postgres, DuckDB, Redshift, Snowflake

2. MySQL-like aka (here require_interval_qualifier => true)

Expect INTERVAL expressions to be in the form

SELECT INTERVAL 1 SECOND

they also support INTERVAL 1 + 1 SECOND, so I presume the logic is "keep parsing literals until you meet one of the following units", there's a good list of these units here. Note these aren't all included in sqlparser-rs, that would be a good follow-up PR.

These dialects do not support syntax like INTERVAL '1 second' — e.g. they require the unit/qualifier to be provided

Databases: MySQL, BigQuery, Databricks, Hive, ClickHouse

Sqlite and MsSql

Don't seem to support interval syntax at all.

The issue

Currently sqlparser-rs seems to partially support both syntaxes, with a preference for the MySQL style.

This meant that INTERVAL '1 second' > x was wrongly being interpreted as INTERVAL ('1 second' > x).

Datafusion then tries to hack around the problem but it wasn't a water tight solution.

Change Proposed

This PR adds fn require_interval_qualifier(&self) -> bool to dialects and uses that to decide how to parse interval expressions. require_interval_qualifier => false also means expressions within intervals (e.g. INTERVAL 1 + 1 DAY) are forbidden.

The GenericDialect is set to require_interval_qualifier => false since intervals without units are vastly more common than intervals with expressions.

The changes required are actually smaller than I feared.

Once this is used in datafusion, it should allow us to remove the interval hack completely.

Copy link
Contributor Author

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb please could you take a look at this.

@@ -56,15 +56,6 @@ macro_rules! parser_err {
};
}

// Returns a successful result if the optional expression is some
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by change, this seemed really ugly, I couldn't resist fixing it.

src/parser/mod.rs Show resolved Hide resolved
tests/sqlparser_bigquery.rs Outdated Show resolved Hide resolved
tests/sqlparser_common.rs Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Aug 28, 2024

Pull Request Test Coverage Report for Build 10703453851

Details

  • 185 of 199 (92.96%) changed or added relevant lines in 10 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.004%) to 89.255%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/dialect/ansi.rs 1 2 50.0%
src/dialect/bigquery.rs 1 2 50.0%
src/dialect/databricks.rs 1 2 50.0%
src/dialect/hive.rs 1 2 50.0%
src/dialect/mod.rs 1 2 50.0%
src/dialect/mysql.rs 1 2 50.0%
src/dialect/clickhouse.rs 0 2 0.0%
tests/sqlparser_common.rs 152 158 96.2%
Files with Coverage Reduction New Missed Lines %
tests/sqlparser_common.rs 1 89.75%
Totals Coverage Status
Change from base Build 10653671802: 0.004%
Covered Lines: 28850
Relevant Lines: 32323

💛 - Coveralls

@samuelcolvin
Copy link
Contributor Author

conflicts fixed. This is ready to review.

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Aug 28, 2024

After trying to adopt this in datafusion, I changed it to be more backwards compatible, e.g. interval '1 month' (without the units) is still allowed even where the dialect wouldn't allow it.

@samuelcolvin
Copy link
Contributor Author

This is tested in datafusion in apache/datafusion#12222.

@samuelcolvin
Copy link
Contributor Author

Okay, having thought about this more, I'm not convinced the workaround I have now on the generic dialect of

    fn allow_interval_expressions(&self) -> bool {
        true
    }

    fn require_interval_units(&self) -> bool {
        false
    }

Makes sense.

It reduces the effect of the change, but at the expense of some incorrect behaviour, e.g.: SELECT interval '1 day' > interval '1 hour' is interpreted as SELECT interval ('1 day' > interval '1 hour'). This is the original behaviour I came to fix on Postgres.

I think GenericDialect should choose one flavor and stick to it, either:

  • MySQL flavor — expressions are allowed, units are required
  • or, PostgreSQL flavor — expressions are not allowed, units are not required

I don't mind which (we're using PostgresDialect), but I will say that the number of changes required in datafusion tests will be much smaller if we go with Postgres flavor, and my instinct is that "no units" (e.g. INTERVAL '1 day') is much more commonly used than expressions like INTERVAL '1' + '1' day.

I've spent long enough on this with no feedback, so I'll wait for feedback before proceeding.

But this is fairly urgent for us since interval precedence is pretty broken in datafusion right now. cc @alamb @andygrove @git-hulk

@git-hulk
Copy link
Member

git-hulk commented Sep 2, 2024

This fix looks correct to me, but it might affect a lot for the downstream. To see if @iffyio @alamb feels good about this change?

@alamb alamb changed the title Refactor INTERVAL parsing Fix INTERVAL parsing to support expressions and units via dialect Sep 3, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin -- and I very much apologize for the delay in reviewing this PR.

I think this PR is well documented and well tested and the overall changes make sense to me. The biggest thing I think that we should do (I think alluded to by @git-hulk) is ensure the change for end users of this crate are well understood:

  1. Are there queries that would have parsed previously that now will not?
  2. If "yes", how could users get the old behavior if they wanted?

I have updated the PR title to be a little more specific -- can you double check this?

cc @iffyio @lovasoa and @jmhain

src/parser/mod.rs Outdated Show resolved Hide resolved
@@ -830,16 +829,14 @@ fn parse_typed_struct_syntax_bigquery() {
expr_from_projection(&select.projection[3])
);

let sql = r#"SELECT STRUCT<INTERVAL>(INTERVAL '1-2 3 4:5:6.789999'), STRUCT<JSON>(JSON '{"class" : {"students" : [{"name" : "Jane"}]}}')"#;
let sql = r#"SELECT STRUCT<INTERVAL>(INTERVAL '2' HOUR), STRUCT<JSON>(JSON '{"class" : {"students" : [{"name" : "Jane"}]}}')"#;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please

  1. leave the existing test as is to show what the effect of the changes are on this query? (I think it would error?)
  2. Add a new test for this new '2' HOUR variant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the BigQuery syntax in https://cloud.google.com/bigquery/docs/reference/standard-sql/interval_functions I actually think the new test is more useful / correct -- all the intervals appear to have a unit after them, such as:

...
  UNNEST([INTERVAL '1-2 3 4:5:6.789999' YEAR TO SECOND,
          INTERVAL '0-13 370 48:61:61' YEAR TO SECOND]) AS i
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, BigQuery is MySQL flavoured when it comes to intervals, so intervals without units are not permitted.

@lovasoa
Copy link
Contributor

lovasoa commented Sep 3, 2024

Great PR ! Maybe a nitpick about the name: allow_interval_expressions is probably a little bit confusing. I would have named it the other way around: allow_expressions_in_interval...

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Sep 4, 2024

Okay I think this is ready @alamb @lovasoa.

The biggest thing I think that we should do (I think alluded to by @git-hulk) is ensure the change for end users of this crate are well understood:

I believe this PR will cause minimal changes to end users now.

The following changes will be introduced:

  • With the generic dialect expressions within intervals (e.g. INTERVAL 1 + 1 DAYS) will now be disallowed
  • Same with Postgres, snowflake, duckdb, redshift — From testing Postgres and duckdb, this matches their actual behaviour where expressions within intervals have never been valid
  • With MySQL, BigQuery, databricks, hive — units are now required (e.g. INTERVAL 1 day works, but interval '1 day' does not) — From testing MySQL and BigQuery, this matches their actual behaviour

@samuelcolvin
Copy link
Contributor Author

(I've updated my comment above to correct a very confusing "not" -> "now")

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

I'll plan to merge this PR tomorrow Sep 5 unless anyone else would like time to review or otherwise comment

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

🚀

@alamb alamb merged commit aa714e3 into apache:main Sep 6, 2024
10 checks passed
@samuelcolvin samuelcolvin deleted the interval-refactor branch September 6, 2024 20:44
ayman-sigma pushed a commit to sigmacomputing/sqlparser-rs that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants