Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSQ: Report the warning directly as an error if none of it is allowed by the user #13198

Merged
merged 10 commits into from
Oct 20, 2022

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Oct 10, 2022

Description

In MSQ, there can be an upper limit to the number of worker warnings. For example, for parseExceptions encountered while parsing the external data, the user can specify an upper limit to the number of parse exceptions that can be allowed before it throws an error of type TooManyWarnings.

This PR makes it so that if the user disallows warnings of a certain type i.e. the limit is 0 (or is executing in strict mode), instead of throwing an error of type TooManyWarnings, we can directly surface the warning as the error, saving the user from the hassle of going throw the warning reports.


Key changed/added classes in this PR
  • ControllerImpl

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Oct 10, 2022
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @LakshSingla. Left some comments.

msqErrorReport.getFault().getCodeWithMessage()
);
}
if (expectedMSQFaultClass != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we always assert the MSQ fault?
Is there any specific reason to assert on the MSQ class?

Copy link
Contributor Author

@LakshSingla LakshSingla Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the unparseable exceptions, the exception message contained nonstandard characters (from the erroneous line) which I didn't want to use in the code file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change the get message call at line 863 to be a pattern matcher. That way you need not code out the whole string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or change the source test file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking, if we have a malformed line, we can have a parseException with x mb's of input.
Multiple such lines would blow up our report.
Should we chomp the input to lets day 300 bytes or something in ParseException#75
cc @jon-wei

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack traces of innocuous warnings and errors can also cause a similar issue. Should this be taken up as a separate PR? Also, 300 bytes seem less to me, something like 4KB might work better, wdyt? (I think we should be fine with an even larger limit as the number of warnings sent to the controller are limited).

Copy link
Contributor

@cryptoe cryptoe Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup We can do that outside this PR.

workerError(errorReport1);
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also throw an illegal state exception if we donot find the error code in the workerWarnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, or else the controller would fail to stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, let the worker impl directly hit worker error endpoint in case the limit is 0 wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying out the changes, that seems like a much better alternative.

ImmutableMap.of(CannotParseExternalDataFault.CODE, maxVerboseParseExceptions),
disallowedWarningCode,
controllerClient,
id(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: id, host, controllerClient should be the first arguments

try {
controllerClient.postWorkerError(workerId, MSQErrorReport.fromException(workerId, host, stageNumber, e));
}
catch (IOException e2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw a new RE with the message failed to post the worker error xyz to the controller. That way the worker will get terminated with a relevant error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename from e2 to postException?

final ConcurrentHashMap<String, Long> errorCodeToCurrentCount = new ConcurrentHashMap<>();
private final MSQWarningReportPublisher delegate;
private final long totalLimit;
private final Map<String, Long> errorCodeToVerboseCountLimit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: errorCodeToLimit seems like a better variable to me.
we really do not need verboseCountLimt as that's a concept outside this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the errorCodeToLimit was slightly misleading. This is because there can be more than those error codes can be present in the controller. However, I do get your point. Maybe I will add clearer javadocs so that this ambiguity isn't present and revert the change.

msqErrorReport.getFault().getCodeWithMessage()
);
}
if (expectedMSQFaultClass != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking, if we have a malformed line, we can have a parseException with x mb's of input.
Multiple such lines would blow up our report.
Should we chomp the input to lets day 300 bytes or something in ParseException#75
cc @jon-wei

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.
LGTM (+1 non binding)


/**
* Creates a publisher which publishes the warnings to the controller if they have not yet exceeded the allowed limit.
* Moreover, if a warning is disallowed, i.e. it's limit is set to 0, then the publisher directly reports the warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be much better if we could explain each variable. We really do not need to mention the tooManyWarningsFault piece as it's just confusing.

try {
controllerClient.postWorkerError(workerId, MSQErrorReport.fromException(workerId, host, stageNumber, e));
}
catch (IOException e2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename from e2 to postException?

@abhishekagarwal87 abhishekagarwal87 merged commit fc262df into apache:master Oct 20, 2022
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants