Increase the HTTPMultiBin server timeout on flaky test #3545

joanlopez · 2024-01-16T12:16:13Z

What?

It increases the server timeout of the HTTPMultiBin server for a specific flaky test that fails from time to time (for instance here), from 2*time.Second to 5*time.Second.

Why?

After having looked at the code involved and tried a few possible solutions, it seemed to me that the root cause was the server timing out from time to time, depending on the conditions the test is executed at (like runner resources, etc).

In fact, it's difficult to say if it will never fail again because it's not that easy to reproduce it. To do so, I tried to run the specific test multiple times (-count=X), with the race detector enabled (-race), with a limited amount of resources (GOMAXPROCS=1) and with multiple runners (windows, macos, etc), and at the very least it seems to reduce much the chances to fail (previously it was always failing with the aforementioned conditions).

Checklist

I have performed a self-review of my code.
I have added tests for my changes.
I have run linter locally (make lint) and all checks pass.
I have run tests locally (make tests) and all tests pass.
I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

Contributes to #2144

codecov-commenter · 2024-01-16T12:25:52Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (d031d2b) 72.97% compared to head (2027fb8) 72.93%.
Report is 1 commits behind head on master.

❗ Current head 2027fb8 differs from pull request most recent head b5993dc. Consider uploading reports for the commit b5993dc to get more accurate results

Files	Patch %	Lines
cmd/ui.go	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3545      +/-   ##
==========================================
- Coverage   72.97%   72.93%   -0.04%     
==========================================
  Files         280      278       -2     
  Lines       20941    20944       +3     
==========================================
- Hits        15281    15276       -5     
- Misses       4690     4694       +4     
- Partials      970      974       +4

Flag	Coverage Δ
macos	`?`
ubuntu	`72.93% <94.44%> (+0.01%)`	⬆️
windows	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

olegbespalov · 2024-01-16T17:59:52Z

I don't mind increasing the timeout of the server, but unfortunately, I bet it doesn't fix the issue of the flaky test that you've mentioned.

Long story short, the cause of the issue is not in our code base but in the GRPC package, which has no guarantee that connect will always return the same error since many concurrent things are happening inside.

For instance, if we look closer at the logs (in fact, they are direct logs from GRPC package) of the CI job, we see that the expected message is in CI output, but it wasn't propagated by grpc, and as a result, the test fails with context ended since it simply can't wait any longer.

I even opened an issue to the grpc-go and planned to work on it but failed since it has no capacity and a good understanding of the internals of GRPC. On our side the issue #3551.

joanlopez · 2024-01-17T08:54:53Z

I don't mind increasing the timeout of the server, but unfortunately, I bet it doesn't fix the issue of the flaky test that you've mentioned
...
...

Aha! I think that unfortunately you're right (see https://github.com/grafana/k6/actions/runs/7543208013/job/20546713199), so I guess it's just a matter of probability 🤷🏻

Thanks for the context and explanations!

PS: Although I have to admit that I'm curious about why it's flaky, or even more, why it's almost impossible to make it fail locally? If it's purely a matter of a bug in the code (that it considers the certificate errors as temporary and just wait for the context to exceed, shouldn't it be more deterministic?)

PS2: Indeed, I feel like if we would increase both timeouts enough (client & server), it would never fail, because it seems to remain waiting here and then follow this path, which ends up returning the connection error. That probably explains why it seemed to be that increasing the server timeout at least seems to make the test less prone to fail). But yeah, I'm not super convinced that's what we really want, because increasing timeouts means the server keeps waiting in the aforementioned line until the context deadline exceeds anyway (even if the test finishes successfully..).

olegbespalov · 2024-01-17T09:07:07Z

PS: Although I have to admit that I'm curious about why it's flaky, or even more, why it's almost impossible to make it fail locally?

I guess it's possible that we have to instruct tests with something like go test -count=1000 ... 🙈

olegbespalov · 2024-01-17T09:26:55Z

Indeed, I feel like if we would increase both timeouts enough (client & server), it would never fail, because it seems to remain waiting here and then follow this path, which ends up returning the connection error.

I think the predominant problem is that the connection error is stored in the structure and could be overwritten, and keeping in mind that internally GRPC manages things concurrently we could meet not what we expected there. This theory is proven by the fact that we see the bad certificate error in the logs at the beginning.

GRPC team says that we shouldn't rely on these connection errors; it's even an anti-pattern there. I believe they are considering them from the regular customer perspective, but from our testing/load testing perspective having a chance to catch this connection errors is useful.

joanlopez · 2024-01-17T11:25:04Z

I think the predominant problem is that the connection error is stored in the structure and could be overwritten, and keeping in mind that internally GRPC manages things concurrently we could meet not what we expected there. This theory is proven by the fact that we see the bad certificate error in the logs at the beginning.

Okay, I see now what you mean. So, lastConnectionError could be overwritten by context deadline exceeded, even if there was a bad certificate there before. In such case, I guess it makes no sense to try to merge this PR, as increasing the timeout just makes the test slower (cause the clients waits till the context deadline is exceeded) without increasing the chances to success.

GRPC team says that we shouldn't rely on these connection errors; it's even an anti-pattern there. I believe they are considering them from the regular customer perspective, but from our testing/load testing perspective having a chance to catch this connection errors is useful.

Yeah, I get that error generalization is tricky, and for instance the reason why Temporary and Timeout have been largely discussed (and deprecated/removed) in Go. However, I think that ideally, in this case, the tls package should return a tls.CertificateVerificationError, so we could do something similar to what you suggested for grpc-go but specifically for certificate errors (so we fail fast in such case, what I think would make sense).

However, it's not as easy as that because although the returned error models a badCertificate, there's no proper way to assert that (like with errors.As/Is) and act consequently, and achieving that is either impossible or would take too much (way too out of the scope of our project).

olegbespalov · 2024-01-17T12:23:25Z

My issue is that this test case sounds to me a bit artificial. When I tried to investigate, I found the kind of error we're getting when we try to use a valid certificate, but that certificate is issued for the email signing.

I also believe that in that case, we are actually trying to test not k6's functionality but rather TLS or grpc, with a notice that we are "proxying" the errors back to the client. Keeping in mind the comments from above, maybe it's better to remove that test case and make a comment in connect documentation that, for some cases, we can't guarantee that an error should be returned, and it's better to handle errors coming from Invoke.

Increase server timeout for flaky test

b5993dc

joanlopez changed the title ~~Experimental~~ Increase the HTTPMultiBin server timeout on flaky test Jan 16, 2024

joanlopez marked this pull request as ready for review January 16, 2024 15:39

joanlopez requested a review from a team as a code owner January 16, 2024 15:39

joanlopez requested review from codebien and olegbespalov and removed request for a team January 16, 2024 15:39

olegbespalov approved these changes Jan 16, 2024

View reviewed changes

joanlopez closed this Jan 17, 2024

joanlopez mentioned this pull request Jan 18, 2024

Remove flaky gRPC tests #3554

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the HTTPMultiBin server timeout on flaky test #3545

Increase the HTTPMultiBin server timeout on flaky test #3545

joanlopez commented Jan 16, 2024 •

edited

Loading

codecov-commenter commented Jan 16, 2024 •

edited

Loading

olegbespalov commented Jan 16, 2024 •

edited

Loading

joanlopez commented Jan 17, 2024 •

edited

Loading

olegbespalov commented Jan 17, 2024

olegbespalov commented Jan 17, 2024 •

edited

Loading

joanlopez commented Jan 17, 2024

olegbespalov commented Jan 17, 2024 •

edited

Loading

Increase the HTTPMultiBin server timeout on flaky test #3545

Increase the HTTPMultiBin server timeout on flaky test #3545

Conversation

joanlopez commented Jan 16, 2024 • edited Loading

What?

Why?

Checklist

Related PR(s)/Issue(s)

codecov-commenter commented Jan 16, 2024 • edited Loading

Codecov Report

olegbespalov commented Jan 16, 2024 • edited Loading

joanlopez commented Jan 17, 2024 • edited Loading

olegbespalov commented Jan 17, 2024

olegbespalov commented Jan 17, 2024 • edited Loading

joanlopez commented Jan 17, 2024

olegbespalov commented Jan 17, 2024 • edited Loading

joanlopez commented Jan 16, 2024 •

edited

Loading

codecov-commenter commented Jan 16, 2024 •

edited

Loading

olegbespalov commented Jan 16, 2024 •

edited

Loading

joanlopez commented Jan 17, 2024 •

edited

Loading

olegbespalov commented Jan 17, 2024 •

edited

Loading

olegbespalov commented Jan 17, 2024 •

edited

Loading