Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a limit to the number of columns in the CLUSTERED BY clause #13352

Merged
merged 6 commits into from
Nov 15, 2022

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Nov 11, 2022

If there is a huge number of columns passed to the clustered by clause while ingesting via MSQ, then the Worker tasks can OOM. (With sequential merge in place, controller tasks shouldn't OOM).
This PR adds a limit to the number of clustered by columns that can be passed in a query and throws a fault in case they are exceeded.

Release note

There is a limit to the number of columns that can be passed in the CLUSTERED BY clause while ingesting via MSQ.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@adarshsanjeev adarshsanjeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Had one comment.

@@ -55,6 +57,15 @@ public static void validateQueryDef(final QueryDefinition queryDef)
throw new ISE("Number of workers must be greater than 0");
}
}

// Check if the number of columns in the query's CLUSTERED BY clause donot exceed the limit
ClusterBy queryClusteredBy = queryDef.getFinalStageDefinition().getClusterBy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does only the final stage lead to an OOM? Wouldn't it be possible for more cluster by columns to be present in earlier stages than the final one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster by columns in the earlier stages might not have a 1:1 correspondence with the query that the user has written therefore raising a cluster by error, in that case, shouldn't be actionable for the user IMO. Hence I only added the limit in the final stage (the original query that the user has written). Along with the Sequential merge mode on, I think that there should be enough guard rails in place to prevent an OOM.

However we can add a limit on the cluster by in the other stages if we rephrase the error message as something like "Enough grouping keys present in stage [xx], the query might OOM". Those cluster by keys can correspond to something present in the group by clause for example. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the TooManyColumnsFault, I think that we can also go ahead with the second proposition since that is also imposed at a per-stage level, which might not correspond to the final result that the user expects. (The wording might need to change though).

import java.util.Objects;

@JsonTypeName(TooManyClusteredByColumnsFault.CODE)
public class TooManyClusteredByColumnsFault extends BaseMSQFault
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document this fault as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, updated!

@cryptoe cryptoe added Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Release Notes labels Nov 13, 2022
Copy link
Contributor

@adarshsanjeev adarshsanjeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after resolving merge conflict!

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will merge post the conflicts are resolved.
Thanks @LakshSingla


import java.util.Objects;

@JsonTypeName(TooManyClusteredByColumnsFault.CODE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to add this to MSQIndexingModule.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, I added it to the module.

@LakshSingla
Copy link
Contributor Author

Test failures seem unrelated/flaky, can the second stage of the CI/CD be run again?

@cryptoe cryptoe merged commit 9e938b5 into apache:master Nov 15, 2022
@cryptoe
Copy link
Contributor

cryptoe commented Nov 15, 2022

Failure look unrelated.
Thanks for the PR @LakshSingla.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants