Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

IOException with message GOAWAY received causes readiness check to fail fast #155

Closed
adriansuarez opened this issue Jan 17, 2024 · 6 comments

Comments

@adriansuarez
Copy link
Contributor

When invoking KubeAPIServer.start(), the readiness check fails with the following stacktrace:

io.javaoperatorsdk.jenvtest.JenvtestException: java.io.IOException: /127.0.0.1:37288: GOAWAY received
	at io.javaoperatorsdk.jenvtest.process.ProcessReadinessChecker.ready(ProcessReadinessChecker.java:118)
	at io.javaoperatorsdk.jenvtest.process.ProcessReadinessChecker.lambda$waitUntilReady$1(ProcessReadinessChecker.java:84)
	at io.javaoperatorsdk.jenvtest.process.ProcessReadinessChecker.pollWithTimeout(ProcessReadinessChecker.java:92)
	at io.javaoperatorsdk.jenvtest.process.ProcessReadinessChecker.waitUntilReady(ProcessReadinessChecker.java:84)
	at io.javaoperatorsdk.jenvtest.process.KubeAPIServerProcess.waitUntilReady(KubeAPIServerProcess.java:92)
	at io.javaoperatorsdk.jenvtest.KubeAPIServer.start(KubeAPIServer.java:48)
	...
Caused by: java.io.IOException: /127.0.0.1:37288: GOAWAY received
	at java.net.http/jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:565)
	at java.net.http/jdk.internal.net.http.HttpClientFacade.send(HttpClientFacade.java:119)
	at io.javaoperatorsdk.jenvtest.process.ProcessReadinessChecker.ready(ProcessReadinessChecker.java:108)
	... 20 common frames omitted
Caused by: java.io.IOException: /127.0.0.1:37288: GOAWAY received
	at java.net.http/jdk.internal.net.http.Http2Connection.handleGoAway(Http2Connection.java:999)
	at java.net.http/jdk.internal.net.http.Http2Connection.handleConnectionFrame(Http2Connection.java:867)
	at java.net.http/jdk.internal.net.http.Http2Connection.processFrame(Http2Connection.java:738)
	at java.net.http/jdk.internal.net.http.frame.FramesDecoder.decode(FramesDecoder.java:155)
	at java.net.http/jdk.internal.net.http.Http2Connection$FramesController.processReceivedData(Http2Connection.java:232)
	at java.net.http/jdk.internal.net.http.Http2Connection.asyncReceive(Http2Connection.java:663)
	at java.net.http/jdk.internal.net.http.Http2Connection$Http2TubeSubscriber.processQueue(Http2Connection.java:1292)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SynchronizedRestartableTask.run(SequentialScheduler.java:175)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$CompleteRestartableTask.run(SequentialScheduler.java:147)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:198)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:271)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:224)
	at java.net.http/jdk.internal.net.http.Http2Connection$Http2TubeSubscriber.runOrSchedule(Http2Connection.java:1310)
	at java.net.http/jdk.internal.net.http.Http2Connection$Http2TubeSubscriber.onNext(Http2Connection.java:1336)
	at java.net.http/jdk.internal.net.http.Http2Connection$Http2TubeSubscriber.onNext(Http2Connection.java:1270)
	at java.net.http/jdk.internal.net.http.common.SSLTube$DelegateWrapper.onNext(SSLTube.java:202)
	at java.net.http/jdk.internal.net.http.common.SSLTube$SSLSubscriberWrapper.onNext(SSLTube.java:484)
	at java.net.http/jdk.internal.net.http.common.SSLTube$SSLSubscriberWrapper.onNext(SSLTube.java:287)
	at java.net.http/jdk.internal.net.http.common.SubscriberWrapper$DownstreamPusher.run1(SubscriberWrapper.java:318)
	at java.net.http/jdk.internal.net.http.common.SubscriberWrapper$DownstreamPusher.run(SubscriberWrapper.java:261)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SynchronizedRestartableTask.run(SequentialScheduler.java:175)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$CompleteRestartableTask.run(SequentialScheduler.java:147)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:198)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:271)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:224)
	at java.net.http/jdk.internal.net.http.common.SubscriberWrapper.outgoing(SubscriberWrapper.java:234)
	at java.net.http/jdk.internal.net.http.common.SubscriberWrapper.outgoing(SubscriberWrapper.java:200)
	at java.net.http/jdk.internal.net.http.common.SSLFlowDelegate$Reader.processData(SSLFlowDelegate.java:403)
	at java.net.http/jdk.internal.net.http.common.SSLFlowDelegate$Reader$ReaderDownstreamPusher.run(SSLFlowDelegate.java:264)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SynchronizedRestartableTask.run(SequentialScheduler.java:175)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$CompleteRestartableTask.run(SequentialScheduler.java:147)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:198)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Looking into the GOAWAY received message, it looks like this is part of the HTTP/2 protocol, and this is just the server destroying a connection and forcing the client to establish a new one. I don't see why this should cause the readiness check to fail fast, since it could simply be an indication of the server not being ready.

I'm actually not sure why any IOException should cause the polling loop to fail fast, since those could be an indication of the server not being ready. My proposal would be to update the polling loop from this:

  private boolean ready(HttpClient client, HttpRequest request, String processName, int port) {
    try {
      var response = client.send(request, HttpResponse.BodyHandlers.ofString());
      log.debug("Ready Response message:{} code: {} for {} on Port: {}", response.body(),
          response.statusCode(), processName,
          port);
      return response.statusCode() == 200;
    } catch (ConnectException e) {
      // still want to retry
      log.warn("Cannot connect to the server", e);
      return false;
    } catch (IOException e) {
      throw new JenvtestException(e);
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new JenvtestException(e);
    }
  }

to this (i.e. generalize the retry condition from ConnectException to IOException):

  private boolean ready(HttpClient client, HttpRequest request, String processName, int port) {
    try {
      var response = client.send(request, HttpResponse.BodyHandlers.ofString());
      log.debug("Ready Response message:{} code: {} for {} on Port: {}", response.body(),
          response.statusCode(), processName,
          port);
      return response.statusCode() == 200;
    } catch (IOException e) {
      // still want to retry
      log.warn("Cannot connect to the server", e);
      return false;
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new JenvtestException(e);
    }
  }

Thoughts?

@csviri
Copy link
Collaborator

csviri commented Jan 17, 2024

Hi @adriansuarez ,
will take a look, note the issue already there for this: #153

The version 0.9.6 released very recently (soon to be in maven central), just skips the health check until that is improved.

Will take a look on this approach you suggests and see if helps.

thx!!

This issue is technically a duplicate but will keep it open to discuss your suggestion - feel free also to do a PR for you suggestion

@adriansuarez
Copy link
Contributor Author

@csviri: Thank you for the extra context. I was very confused about why tests started failing in our repo without any code changes related to jenvtest, but it looks like it is because we are now pulling down K8s 1.29, which exposes this issue. That suggests a workaround that we could implement in our repo of pinning the version to 1.28.

@csviri
Copy link
Collaborator

csviri commented Jan 17, 2024

Yes, might be an option, but hopefully this will be somehow fixed soon. (Note that pinning is also possible with an env var for all tests.)

@adriansuarez
Copy link
Contributor Author

adriansuarez commented Jan 17, 2024

Suppressing and retrying on IOException did not fix the issue. Initially I thought the problem was that the HTTP/2 connection was not being re-established, but I added plumbing to rebuild the connection (and confirmed that the port numbers appearing in the exception message were changing) and we still receive the GOAWAY message repeatedly.

I think this is due to the feature gate UnauthenticatedHTTP2DOSMitigation being enabled by default on 1.29.

From the feature gates documentation:

Feature Default Stage Since Until
UnauthenticatedHTTP2DOSMitigation false Beta 1.28 1.28
UnauthenticatedHTTP2DOSMitigation true Beta 1.29

UnauthenticatedHTTP2DOSMitigation: Enables HTTP/2 Denial of Service (DoS) mitigations for unauthenticated clients. Kubernetes v1.28.0 through v1.28.2 do not include this feature gate.

It doesn't look like the client is having any credentials passed to it when it invokes the readyz endpoint, so I think this is the issue.

An easy fix is to just use HTTP/1.1 to invoke GET readyz. I'll put out a PR.

@BramMeerten
Copy link
Contributor

Is this fix included in release 0.9.7, or is this still an open issue?
@adriansuarez @csviri

@csviri
Copy link
Collaborator

csviri commented Feb 21, 2024

Ah, yes it si included, sorry, will close the issue.

@csviri csviri closed this as completed Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants