Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a non-blocking handler for SmallRye Health Status #35099

Closed
ahus1 opened this issue Jul 29, 2023 · 8 comments · Fixed by #36434 or #37352
Closed

Use a non-blocking handler for SmallRye Health Status #35099

ahus1 opened this issue Jul 29, 2023 · 8 comments · Fixed by #36434 or #37352
Milestone

Comments

@ahus1
Copy link
Contributor

ahus1 commented Jul 29, 2023

Description

The Quarkus handlers for the SmallRye Health are blocking request handlers.

This requires them to enqueue in the blocking queue even for the simplest non-blocking requests, which might delay them if there is a high load.

For the Keycloak project, a simple liveness probe is all-local, and there is no need to queue the request. So the request should return the result immediately. We're experimenting with limiting the queue of the executor pool for load shedding, which would then shed also the liveness probes, which is undesired.

Implementation ideas

SmallRye Health now supports async responses:

https://github.com/smallrye/smallrye-health/blob/25d4f0bd75f00054ad5adba103fecce5c83eecbd/implementation/src/main/java/io/smallrye/health/SmallRyeHealthReporter.java#L252-L256

I'll prepare a PR for this.

@ahus1 ahus1 added the kind/enhancement New feature or request label Jul 29, 2023
@quarkus-bot
Copy link

quarkus-bot bot commented Jul 29, 2023

/cc @Ladicek (smallrye), @jmartisk (health,smallrye), @phillip-kruger (smallrye), @radcortez (smallrye), @xstefank (health)

@ahus1
Copy link
Contributor Author

ahus1 commented Jul 29, 2023

PR created: #35100

@ahus1
Copy link
Contributor Author

ahus1 commented Nov 24, 2023

Adding some comments here for people to see the status: This issue has been reverted from 3.5.1, see #36977
Can someone please remove the milestone 3.5.1?

@ngc4579
Copy link

ngc4579 commented Aug 29, 2024

I'm experiencing issues with health checks under high load situations in a Quarkus Kafka Streams application where liveness probes to /q/health/live would fail with either statuscode 503 or run into timeouts, eventually having the Pod killed. Could this be related? Quarkus version is 3.14.1 - have there been regressions in this area lately?

@ahus1
Copy link
Contributor Author

ahus1 commented Aug 29, 2024

To benefit from the non-blocking probes, the probes themselves need to need to be non-blocking aka asynchronous.

Please review your setup for both your own probes and those provided by frameworks. If they implement AsyncHealthCheck they are async, otherwise they are executed synchronously. If they are executed synchronously, they will be queue in the general blocking executor, which might lead to timeouts and errors for those probes.

@ngc4579
Copy link

ngc4579 commented Aug 29, 2024

@ahus1 Ah, I see. So with io.quarkus.kafka.streams.runtime.health.KafkaStreamsStateHealthCheck implementing org.eclipse.microprofile.health.HealthCheck those health checks are executed synchronously, right?

@ahus1
Copy link
Contributor Author

ahus1 commented Aug 29, 2024

@ngc4579 - yes, this explains the behavior you're seeing.

As you've seen the contract is the following: if the liveness probe fails, your Pod will be killed.

For the way forward you have different options:

  • Rethink if you really want you Pod to be killed if Kafka isn't "running or balancing" (is there a chance to have this solved after a Pod restart?)
    If not, disabling the liveness probe might be an option. See https://quarkus.io/guides/smallrye-health how to disable a specific check. With Keycloak, we think that having an empty Liveness probe is good enough as this would trigger a Pod restart only when the Pod is not answering to HTTP requests on the management port any more.
    If you think this generally not a good idea, ask for the check to be changed or removed in a Quarkus issue.
  • If you think the check should stay, it could implement AsyncHealthCheck instead, but then it is only allowed to call non-blocking methods. I'm not familiar enough with the Kafka interface to know if this is currently the case.

@ngc4579
Copy link

ngc4579 commented Aug 30, 2024

@ahus1 Yeah, thanks for clarification. I'll have an eye on production performance and tend to disable Kafka Streams specific health checks in case they actually turn out to be an issue. Meanwhile, I've filed an enhancement request to have the health check changed to an async one if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment