-
Notifications
You must be signed in to change notification settings - Fork 832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flakiness on integration tests #1402
Comments
I've been running the tests relevant to As I can replicate the flakiness locally, it seems safe to assume that they are not just related to lack of resources on the Kind cluster. Instead, it seems that occasionally Istio and Ambassador may lose track of all the Whenever it happens, Istio returns a |
In #1481 I increased flakiness max_retries parameter to 3. I have a feeling that introducing a short sleep to |
Assumption is its Ambassador tests and due to the change for Ambassador retries to zero. Will change tests to do 1 retry. Am testing locally. |
Context
Sometimes integration tests seem to behave flaky and raise false errors. We have noticed two different cases where this usually happens:
503
or504
errors on one or two of the tests in testing/scripts/test_rolling_updates.py. An example can be seen on the log output for PR #1400. This has happened in the past either due to issues with Ambassador / Istio falling over to the new deployment (actual red flag) or due to spamming theSeldonDeployment
pod with too many requests.seldon-core/jenkins-x/logs/SeldonIO/seldon-core/PR-1400/3.log
Lines 18139 to 18170 in f21fa8e
istio-ingressgateway
pod to come up. Instead of failing early, the tests continue and this in turn causes (almost) all of them to fail. An example can be seen on the log output for PR #1382. This has happened in the past whenever the cluster ran out of ephemeral storage (e.g. Internal build issue: PR builds intermittently fail with no space left on device #1322).seldon-core/jenkins-x/logs/SeldonIO/seldon-core/PR-1382/4.log
Lines 340 to 344 in f21fa8e
On both cases they seem to be intermittent.
Workaroud
While they are not fixed properly, we can verify if it's a red flag by running them locally (and adding a note about that on the PR for the reviewers).
The text was updated successfully, but these errors were encountered: