Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window function on msq #15470

Merged
merged 54 commits into from
Mar 28, 2024
Merged

Window function on msq #15470

merged 54 commits into from
Mar 28, 2024

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Dec 1, 2023

This PR aims to introduce Window functions on MSQ by doing the following:

  1. Introduce a Window querykit for handling window queries along with its factory and a processor for window queries
  2. If a window operator is present with a partition by clause, pushes the partition as a shuffle spec of the previous stage
  3. In presence of empty OVER() clause lets all operators loose on a single rac
  4. In presence of no empty OVER() clause, breaks down each window into individual stages
  5. Associated machinery to handle window functions in MSQ
  6. Introduced a separate hidden engine feature WINDOW_LEAF_OPERATOR which is set only for MSQ engine. In presence of this feature, the planner plans without the leaf operators by creating a window query over an inner scan query. In case of native this is set to false and the planner generates the leafOperators
  7. Guardrails around materialization
  8. Comprehensive UTs

Release notes

Add support in MSQE to run window functions using a context flag enableWindowing:true. In the native engine, we need a group by clause to enable window functions. In the MSQE the requirement of providing a mandatory group by clause to enable window functions is removed.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Dec 1, 2023
@soumyava soumyava added the WIP label Dec 1, 2023
catch (IOException e) {
throw new RuntimeException(e);
}
return Operator.Signal.GO;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning GO check if the frames can be paused. In such a case return that. Also need to test pausing frames through the MSQ framework correctly

.inputs(new StageInputSpec(firstStageNumber - 1))
.signature(rowSignature)
.maxWorkerCount(maxWorkerCount)
.shuffleSpec(null)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the shuffle spec is null. Tell the previous stage to shuffle by the appropriate partition here so that the data comes correctly. For example if previous stage is a groupByPostShuffle, find a way to tell it to set a shuffle spec for the next stage. Since the inner query has no idea of the outer operators, we can use the context to pass the information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shuffle spec for a stage tells it how to partition the data for the next stage. Therefore it should use a combination of the resultShuffleSpecFactory to construct the final shuffleSpec.
If you want the data in a particular format inside a stage, its input should always be a stage, and the shuffle spec of that stage should be set accordingly. Hash Shuffle uses similar logic.

@somu-imply somu-imply marked this pull request as ready for review January 9, 2024 05:54
@somu-imply somu-imply marked this pull request as draft January 15, 2024 14:02
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

{
super(
CODE,
"Too many rows in a window (requested = %d, max = %d). Try creating a window with a higher cardinality column or change the query shape.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention the user can set MAX_ROWS_MATERIALIZED_IN_WINDOW config in the query context. We should also tell the user that setting this config can lead to OOM errors so use with caution.

.inputs(new StageInputSpec(firstStageNumber))
.signature(stageSignature)
.maxWorkerCount(maxWorkerCount)
.shuffleSpec(nextShuffleWindowSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of limit, this should not be nextShuffeWIndowSpec no ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a limit on the inner query, the window is going to operate on the result of the limit, so I think it should be the nextShuffleSpec as it contains the partition by for the next window

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a UT for this if its already not there

Copy link
Contributor

@cryptoe cryptoe Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the limit and offset should be applied on the grouping key. So it should be shuffleSpecFactoryPostAggregation != null ? : null
Also we can actually short circuit the shuffle spec of the OffsetLimitProcessor to null since limit always gets applied on 1 worker and 1 partition. So we would be okay in case a window processor is the next stage since the data would already be sorted :)

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the PR is almost there. Left some comments.
Thanks for working on this.


private final List<OperatorFactory> operatorFactoryList;
private final ObjectMapper jsonMapper;
private final ArrayList<RowsAndColumns> frameRowsAndCols;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frameRowsAndCols who clears this array list, I was expecting after we add stuff to the result, the frameRowsAndCols can be cleared no ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is being cleared once the result is written

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be cleared after we are done writing results to the frames which seems suspect.
Shouldn't it be cleared once we have added stuff to resultRowsAndCols ?

.inputs(new StageInputSpec(firstStageNumber))
.signature(stageSignature)
.maxWorkerCount(maxWorkerCount)
.shuffleSpec(nextShuffleWindowSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a UT for this if its already not there

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Thanks for the patience @somu-imply !!.

@cryptoe cryptoe merged commit 524842a into apache:master Mar 28, 2024
85 checks passed
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
@cryptoe
Copy link
Contributor

cryptoe commented May 22, 2024

Updated the release notes taking into account the follow up PR as well #16229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants