Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.camunda.zeebe.broker.client.api.BrokerErrorException: Received error from broker (INTERNAL_ERROR): Processing paused for partition '3' #22928

Closed
korthout opened this issue Oct 1, 2024 · 2 comments · Fixed by #23654
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1

Comments

@korthout
Copy link
Member

korthout commented Oct 1, 2024

Describe the bug

io.camunda.zeebe.broker.client.api.BrokerErrorException: Received error from broker (INTERNAL_ERROR): Processing paused for partition '3'

This error was logged about 40 times in a short timespan.
Screenshot 2024-10-01 at 16 03 44

To Reproduce

Not sure, but likely it involves pausing a partition during processing.

Expected behavior

This is probably just noise. We should log that the partition is paused at INFO level and:

  • no longer accept requests for a paused partition
  • log at WARN level information about the requests that can no longer be handled

Log/Stacktrace

Full logs are available on Google Drive.

Full Stacktrace

io.camunda.zeebe.broker.client.api.BrokerErrorException: Received error from broker (INTERNAL_ERROR): Processing paused for partition '3'

at io.camunda.zeebe.broker.client.impl.BrokerRequestManager.handleResponse ( io/camunda.zeebe.broker.client.impl/BrokerRequestManager.java:195 )
at io.camunda.zeebe.broker.client.impl.BrokerRequestManager.lambda$sendRequestInternal$2 ( io/camunda.zeebe.broker.client.impl/BrokerRequestManager.java:144 )
at io.camunda.zeebe.scheduler.future.FutureContinuationRunnable.run ( io/camunda.zeebe.scheduler.future/FutureContinuationRunnable.java:28 )
at io.camunda.zeebe.scheduler.ActorJob.invoke ( io/camunda.zeebe.scheduler/ActorJob.java:85 )
at io.camunda.zeebe.scheduler.ActorJob.execute ( io/camunda.zeebe.scheduler/ActorJob.java:42 )
at io.camunda.zeebe.scheduler.ActorTask.execute ( io/camunda.zeebe.scheduler/ActorTask.java:122 )
at io.camunda.zeebe.scheduler.ActorThread.executeCurrentTask ( io/camunda.zeebe.scheduler/ActorThread.java:130 )
at io.camunda.zeebe.scheduler.ActorThread.doWork ( io/camunda.zeebe.scheduler/ActorThread.java:108 )
at io.camunda.zeebe.scheduler.ActorThread.run ( io/camunda.zeebe.scheduler/ActorThread.java:227 )

Environment:

  • OS:
  • Zeebe Version:
  • Configuration:
@korthout korthout added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team labels Oct 1, 2024
@npepinpe
Copy link
Member

npepinpe commented Oct 3, 2024

INTERNAL_ERROR is also likely the wrong error code, honestly. We probably want to return something like UNAVAILABLE, indicating the system is currently unavailable, but may become so eventually. This, or something like INVALID_STATE, or FAILED_PRECONDITION. I would opt for unavailable, personally.

I would propose introducing a new error code, which is mapped to 503/SERVICE UNAVAILABLE (for REST) and 14/UNAVAILABLE (for gRPC), and not logged as an error but debug (as with other temporarily unavailable things).

I would also argue this is more of a rejection than an error. There is no error here, and the command may be well formed, we're simply rejecting to process it. However, I understand we only have an ErrorResponseWriter in the command API, and it would be quite a bit of refactoring to return a rejection here 🤷

So acceptance criteria are:

  • Add a new PARTITION_UNAVAILABLE error code (as in, to the SBE generated ErrorCode enum), which is documented as meaning that the command cannot be processed because the processor is temporarily unavailable.
  • Update the CommandApiRequestHandler to return this error, instead of the current INTERNAL_ERROR.
  • Map PARTITION_UNAVAILABLE error code in GrpcErrorMapper such that the error is logged as debug, and the mapped gRPC error is UNAVAILABLE.
  • Map PARTITION_UNAVAILABLE error code in RestErrorMapper such that the error is logged as debug, and the mapped HTTP code is 503 (SERVICE_UNAVAILABLE).

As far as tests go, you should write a (parameterized) integration/QA regression test which verifies the above behavior (e.g. create node, pause processing, send request, make sure you get appropriate code), and some unit tests for the mappers. Unit tests alone are likely not enough here as we want to ensure that pausing the processing causes such errors to return. Use the RegressionTest annotation :)

@npepinpe npepinpe added the severity/low Marks a bug as having little to no noticeable impact for the user label Oct 3, 2024
@filipecampos filipecampos self-assigned this Oct 4, 2024
github-merge-queue bot pushed a commit that referenced this issue Oct 22, 2024
…essing (#23654)

## Description
Create a new `PARTITION_UNAVAILABLE` error code corresponding to when a
partition pauses processing requests.

## Checklist

- [x] Add a new `PARTITION_UNAVAILABLE` error code (as in, to the SBE
generated `ErrorCode` enum), which is documented as meaning that the
command cannot be processed because the processor is temporarily
unavailable.
- [x] Update the `CommandApiRequestHandler` to return this error,
instead of the current `INTERNAL_ERROR`.
- [x] Map `PARTITION_UNAVAILABLE` error code in `GrpcErrorMapper` such
that the error is logged as debug, and the mapped gRPC error
is `UNAVAILABLE`.
- [x] Map `PARTITION_UNAVAILABLE` error code in `RestErrorMapper` such
that the error is logged as debug, and the mapped HTTP code is 503
(`SERVICE_UNAVAILABLE`).
- [x]  Write test for `GrpcErrorMapper`
- [x]  Write test for `RestErrorMapper`  in `ErrorMapperTest`
- [x]  Updated integration/QA regression tests

## Related issues

closes #22928
@camundait camundait added the version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1 label Nov 5, 2024
@ana-vinogradova-camunda
Copy link
Contributor

Happened again here
Please feel free to let me know if you think it is a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:8.7.0-alpha1 Label that represents issues released on verions 8.7.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants