Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSERT/REPLACE can omit clustering when catalog has default #16260

Merged
merged 26 commits into from
Apr 26, 2024

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Apr 10, 2024

Description

This PR contains a portion of the changes from the inactive draft PR for integrating the catalog with the Calcite planner #13686 from @paul-rogers, allowing for tables that are defined in the catalog to have any defined clustering columns used in DML INSERT/REPLACE operations without needing to be specified at query time. If the user specified a clustering columns at query time, these columns are preferred to the catalog defined clustering columns.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zachjsh zachjsh requested a review from kgyrtkirk April 17, 2024 19:10
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 ; just left some questions :)

Comment on lines +474 to +478
final SqlIdentifier colIdent = new SqlIdentifier(
Collections.singletonList(keyCol.expr()),
null, SqlParserPos.ZERO,
Collections.singletonList(SqlParserPos.ZERO)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what will happen in the following case:

  • say colunmn c is a clusterKey
  • we are selecting from a join which has column c on both sides

but it seems like the column in the select list will take precedence.

one more thing I was wondering about: do we have a check that all keyCols are present in the selected column list?

Copy link
Contributor Author

@zachjsh zachjsh Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About whether there is a check that al keyCols are present in the selected column list, see the following tests:

testInsertTableWithClusteringWithClusteringOnNewColumnFromQuery
testInsertTableWithClusteringWithClusteringOnBadColumn

Do these cover the cases you are talking about?

About the join issue, do you have a concrete query in example, just to clarify?

final IdentifierNamespace insertNs = (IdentifierNamespace) targetNamespace;
SqlIdentifier identifier = insertNs.getId();
SqlValidatorTable catalogTable = getCatalogReader().getTable(identifier.names);
if (catalogTable != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the fall-thru from this condtional will cause that the CLUSTER BY on the ingestNode will not be applied (line399 right now); even if its there - is that okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the ingestNode already has the clustering columns, they will be used. There are existing tests which test that the clustering columns are used in the plan returned from dml query, when clustering is defined at query time, and the table is / it not in catalog. Let me know if this covers the issue that think could occur.

@zachjsh zachjsh merged commit 365cd7e into apache:master Apr 26, 2024
87 checks passed
@zachjsh zachjsh deleted the use-catalog-clustering-columns branch April 26, 2024 14:19
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants