Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the HTTPMultiBin server timeout on flaky test #3545

Closed
wants to merge 1 commit into from
Closed

Increase the HTTPMultiBin server timeout on flaky test #3545

wants to merge 1 commit into from

Conversation

joanlopez
Copy link
Contributor

@joanlopez joanlopez commented Jan 16, 2024

What?

It increases the server timeout of the HTTPMultiBin server for a specific flaky test that fails from time to time (for instance here), from 2*time.Second to 5*time.Second.

Why?

After having looked at the code involved and tried a few possible solutions, it seemed to me that the root cause was the server timing out from time to time, depending on the conditions the test is executed at (like runner resources, etc).

In fact, it's difficult to say if it will never fail again because it's not that easy to reproduce it. To do so, I tried to run the specific test multiple times (-count=X), with the race detector enabled (-race), with a limited amount of resources (GOMAXPROCS=1) and with multiple runners (windows, macos, etc), and at the very least it seems to reduce much the chances to fail (previously it was always failing with the aforementioned conditions).

Checklist

  • I have performed a self-review of my code.
  • I have added tests for my changes.
  • I have run linter locally (make lint) and all checks pass.
  • I have run tests locally (make tests) and all tests pass.
  • I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

Contributes to #2144

@codecov-commenter
Copy link

codecov-commenter commented Jan 16, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (d031d2b) 72.97% compared to head (2027fb8) 72.93%.
Report is 1 commits behind head on master.

❗ Current head 2027fb8 differs from pull request most recent head b5993dc. Consider uploading reports for the commit b5993dc to get more accurate results

Files Patch % Lines
cmd/ui.go 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3545      +/-   ##
==========================================
- Coverage   72.97%   72.93%   -0.04%     
==========================================
  Files         280      278       -2     
  Lines       20941    20944       +3     
==========================================
- Hits        15281    15276       -5     
- Misses       4690     4694       +4     
- Partials      970      974       +4     
Flag Coverage Δ
macos ?
ubuntu 72.93% <94.44%> (+0.01%) ⬆️
windows ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joanlopez joanlopez changed the title Experimental Increase the HTTPMultiBin server timeout on flaky test Jan 16, 2024
@joanlopez joanlopez marked this pull request as ready for review January 16, 2024 15:39
@joanlopez joanlopez requested a review from a team as a code owner January 16, 2024 15:39
@joanlopez joanlopez requested review from codebien and olegbespalov and removed request for a team January 16, 2024 15:39
@olegbespalov
Copy link
Collaborator

olegbespalov commented Jan 16, 2024

I don't mind increasing the timeout of the server, but unfortunately, I bet it doesn't fix the issue of the flaky test that you've mentioned.

Long story short, the cause of the issue is not in our code base but in the GRPC package, which has no guarantee that connect will always return the same error since many concurrent things are happening inside.

For instance, if we look closer at the logs (in fact, they are direct logs from GRPC package) of the CI job, we see that the expected message is in CI output, but it wasn't propagated by grpc, and as a result, the test fails with context ended since it simply can't wait any longer.

image

I even opened an issue to the grpc-go and planned to work on it but failed since it has no capacity and a good understanding of the internals of GRPC. On our side the issue #3551.

@joanlopez
Copy link
Contributor Author

joanlopez commented Jan 17, 2024

I don't mind increasing the timeout of the server, but unfortunately, I bet it doesn't fix the issue of the flaky test that you've mentioned
...
...

Aha! I think that unfortunately you're right (see https://github.com/grafana/k6/actions/runs/7543208013/job/20546713199), so I guess it's just a matter of probability 🤷🏻

Thanks for the context and explanations!

PS: Although I have to admit that I'm curious about why it's flaky, or even more, why it's almost impossible to make it fail locally? If it's purely a matter of a bug in the code (that it considers the certificate errors as temporary and just wait for the context to exceed, shouldn't it be more deterministic?)

PS2: Indeed, I feel like if we would increase both timeouts enough (client & server), it would never fail, because it seems to remain waiting here and then follow this path, which ends up returning the connection error. That probably explains why it seemed to be that increasing the server timeout at least seems to make the test less prone to fail). But yeah, I'm not super convinced that's what we really want, because increasing timeouts means the server keeps waiting in the aforementioned line until the context deadline exceeds anyway (even if the test finishes successfully..).

@olegbespalov
Copy link
Collaborator

PS: Although I have to admit that I'm curious about why it's flaky, or even more, why it's almost impossible to make it fail locally?

I guess it's possible that we have to instruct tests with something like go test -count=1000 ... 🙈

@olegbespalov
Copy link
Collaborator

olegbespalov commented Jan 17, 2024

Indeed, I feel like if we would increase both timeouts enough (client & server), it would never fail, because it seems to remain waiting here and then follow this path, which ends up returning the connection error.

I think the predominant problem is that the connection error is stored in the structure and could be overwritten, and keeping in mind that internally GRPC manages things concurrently we could meet not what we expected there. This theory is proven by the fact that we see the bad certificate error in the logs at the beginning.

GRPC team says that we shouldn't rely on these connection errors; it's even an anti-pattern there. I believe they are considering them from the regular customer perspective, but from our testing/load testing perspective having a chance to catch this connection errors is useful.

@joanlopez
Copy link
Contributor Author

I think the predominant problem is that the connection error is stored in the structure and could be overwritten, and keeping in mind that internally GRPC manages things concurrently we could meet not what we expected there. This theory is proven by the fact that we see the bad certificate error in the logs at the beginning.

Okay, I see now what you mean. So, lastConnectionError could be overwritten by context deadline exceeded, even if there was a bad certificate there before. In such case, I guess it makes no sense to try to merge this PR, as increasing the timeout just makes the test slower (cause the clients waits till the context deadline is exceeded) without increasing the chances to success.

GRPC team says that we shouldn't rely on these connection errors; it's even an anti-pattern there. I believe they are considering them from the regular customer perspective, but from our testing/load testing perspective having a chance to catch this connection errors is useful.

Yeah, I get that error generalization is tricky, and for instance the reason why Temporary and Timeout have been largely discussed (and deprecated/removed) in Go. However, I think that ideally, in this case, the tls package should return a tls.CertificateVerificationError, so we could do something similar to what you suggested for grpc-go but specifically for certificate errors (so we fail fast in such case, what I think would make sense).

However, it's not as easy as that because although the returned error models a badCertificate, there's no proper way to assert that (like with errors.As/Is) and act consequently, and achieving that is either impossible or would take too much (way too out of the scope of our project).

@joanlopez joanlopez closed this Jan 17, 2024
@olegbespalov
Copy link
Collaborator

olegbespalov commented Jan 17, 2024

My issue is that this test case sounds to me a bit artificial. When I tried to investigate, I found the kind of error we're getting when we try to use a valid certificate, but that certificate is issued for the email signing.

I also believe that in that case, we are actually trying to test not k6's functionality but rather TLS or grpc, with a notice that we are "proxying" the errors back to the client. Keeping in mind the comments from above, maybe it's better to remove that test case and make a comment in connect documentation that, for some cases, we can't guarantee that an error should be returned, and it's better to handle errors coming from Invoke.

@joanlopez joanlopez mentioned this pull request Jan 18, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants