Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Make aggregation statement compilation robust #2767

Open
LantaoJin opened this issue Jun 21, 2024 · 1 comment · May be fixed by #2781
Open

[RFC] Make aggregation statement compilation robust #2767

LantaoJin opened this issue Jun 21, 2024 · 1 comment · May be fixed by #2781
Labels
enhancement New feature or request

Comments

@LantaoJin
Copy link
Member

LantaoJin commented Jun 21, 2024

Is your feature request related to a problem?
Aggregation statement, query with aggregate function, group-by clause that we've supported, grouping set, cube that we haven't supported, has many restrictions based on SQL standard. OpenSearch SQL adapts MySQL grammar as it initial grammar, but the query compilation check should be done in query resolution/analyzing. Current query compilation check is weak, especially for aggregation statement.

Here is the result comparison of query compilation check between OpenSearch SQL, Spark, Postgres and Oracle.

  • Example data:
POST _bulk
{ "index" : { "_index" : "test_bulk1", "_id" : "1" } }
{ "name": "Clark", "query_id": 1 }
{ "index" : { "_index" : "test_bulk1", "_id" : "2" } }
{ "name": "Dave", "query_id": 2 }
{ "index" : { "_index" : "test_bulk1", "_id" : "3" } }
{ "name": "Ava", "query_id": 3 }
name query_id
Clark 1
Dave 2
Ava 3
  • Example queries:
ID Query
1 SELECT query_id FROM test_bulk1 GROUP BY name
2 SELECT query_id, max(query_id) FROM test_bulk1
3 SELECT query_id FROM test_bulk1 GROUP BY max(query_id)
4 SELECT max(query_id) FROM test_bulk1 GROUP BY 1
5 SELECT max(query_id) FILTER(WHERE query_id) FROM test_bulk1
6 SELECT max(query_id) FILTER(WHERE max(query_id)>10) FROM test_bulk1
7 SELECT * FROM test_bulk1 WHERE ROW_NUMBER() OVER(ORDER BY name) > 0
8 SELECT name, rank() OVER (PARTITION BY a) FROM test_bulk1
  • Result of query compilation:
ID Description OS-SQL Spark Postgres Oracle
1 column not in group-by clause return null value AnalysisException: The non-aggregating expression "query_id" is based on columns which are not participating in the GROUP BY clause. ERROR: column "query_id" must appear in the GROUP BY clause or be used in an aggregate function ORA-00979: "QUERY_ID": must appear in the GROUP BY clause or be used in an aggregate function
2 group-by is missing Explicit GROUP BY clause is required because expression [query_id] contains non-aggregated column The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses ditto^ ORA-00937: not a single-group group function
3 aggregate functions in group-by Server side error during query execution Aggregate functions are not allowed in GROUP BY, but found max(query_id) ERROR: aggregate functions are not allowed in GROUP BY ORA-00934: group function is not allowed here
4 group-by position refers to aggregate function Server side error during query execution GROUP BY 1 refers to an expression that contains an aggregate function. Aggregate functions are not allowed in GROUP BY ditto^ return 3
5 non boolean filter in aggregate Client side error during query execution: [filter] must not be null FILTER expression is not of type boolean. It cannot be used in an aggregate function ERROR: argument of FILTER must be type boolean, not type integer ORA-00923: FROM keyword not found where expected
6 aggregate in aggregate-filter error Server side error during query execution FILTER expression contains aggregate. It cannot be used in an aggregate function ERROR: aggregate functions are not allowed in FILTER ditto^
7 window function in where-clause Server side error during query execution It is not allowed to use window functions inside WHERE clause ERROR: window functions are not allowed in WHERE ORA-30483: window functions are not allowed here
8 ranking window function misses order clause return Ava,1; Clark,1; Dave,1 Window function rank() requires window to be ordered, please add ORDER BY clause return Ava,1; Clark,1; Dave,1 missing ORDER BY expression in the window specification

We should fix above query compilation error to make aggregation more robust.

What solution would you like?
This RFC is to address the weakness of query compilation for aggregation statement, including query with aggregate function, window function, group-by clause etc.
We will align with semantics of Spark and PostgresSQL as possible as we can. If the behaviours between Spark and PostgresSQL are different, we will choose the option which one could mitigate the impact or reduces breaking change.

What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

@dblock
Copy link
Member

dblock commented Jul 15, 2024

[Catch All Triage - 1, 2]

@dblock dblock removed the untriaged label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants