Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flakiness on integration tests #1402

Closed
adriangonz opened this issue Feb 4, 2020 · 4 comments · Fixed by #1518
Closed

Flakiness on integration tests #1402

adriangonz opened this issue Feb 4, 2020 · 4 comments · Fixed by #1518
Assignees
Labels
Milestone

Comments

@adriangonz
Copy link
Contributor

adriangonz commented Feb 4, 2020

Context

Sometimes integration tests seem to behave flaky and raise false errors. We have noticed two different cases where this usually happens:

  1. 503 or 504 errors on one or two of the tests in testing/scripts/test_rolling_updates.py. An example can be seen on the log output for PR #1400. This has happened in the past either due to issues with Ambassador / Istio falling over to the new deployment (actual red flag) or due to spamming the SeldonDeployment pod with too many requests.

_________________ TestRollingHttp.test_rolling_update8[ambas] __________________
[gw0] linux -- Python 3.6.0 /usr/local/bin/python
self = <test_rolling_updates.TestRollingHttp object at 0x7f9535cffc88>
namespace = 'test-rolling-update8-ambas', api_gateway = 'localhost:8003'
@with_api_gateways
# Test updating a model with a new resource request but same image
def test_rolling_update8(self, namespace, api_gateway):
if api_gateway == API_ISTIO_GATEWAY:
retry_run(
f"kubectl create -f ../resources/seldon-gateway.yaml -n {namespace}"
)
retry_run(f"kubectl apply -f ../resources/graph1svc.json -n {namespace}")
wait_for_status("mymodel", namespace)
wait_for_rollout("mymodel", namespace, expected_deployments=2)
r = initial_rest_request("mymodel", namespace, endpoint=api_gateway)
assert r.status_code == 200
assert r.json()["data"]["tensor"]["values"] == [1.0, 2.0, 3.0, 4.0]
retry_run(f"kubectl apply -f ../resources/graph4svc.json -n {namespace}")
r = initial_rest_request("mymodel", namespace, endpoint=api_gateway)
assert r.status_code == 200
assert r.json()["data"]["tensor"]["values"] == [1.0, 2.0, 3.0, 4.0]
i = 0
for i in range(50):
r = rest_request_ambassador("mymodel", namespace, api_gateway)
> assert r.status_code == 200
E assert 503 == 200
E -503
E +200
test_rolling_updates.py:287: AssertionError

  1. An error installing Istio. In particular, a timeout waiting for the istio-ingressgateway pod to come up. Instead of failing early, the tests continue and this in turn causes (almost) all of them to fail. An example can be seen on the log output for PR #1382. This has happened in the past whenever the cluster ran out of ephemeral storage (e.g. Internal build issue: PR builds intermittently fail with no space left on device #1322).

kubectl rollout status deployment.apps/istio-ingressgateway -n istio-system
Waiting for deployment "istio-ingressgateway" rollout to finish: 0 of 1 updated replicas are available...
error: deployment "istio-ingressgateway" exceeded its progress deadline
make: *** [Makefile:45: install_istio] Error 1
make: Entering directory '/workspace/source/python'

On both cases they seem to be intermittent.

Workaroud

While they are not fixed properly, we can verify if it's a red flag by running them locally (and adding a note about that on the PR for the reviewers).

@adriangonz adriangonz added bug triage Needs to be triaged and prioritised accordingly labels Feb 4, 2020
@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Feb 6, 2020
@ukclivecox ukclivecox added this to the 1.2 milestone Feb 6, 2020
@adriangonz
Copy link
Contributor Author

As a temporary workaround, on PR #1415 I'm marking the tests relevant to 1. above as "flaky" using the flaky package. This means that upon failure they will be retried 1 more time.

Note that this is temporary, it still can fail (if they fail more than 2 times) and it should be fixed properly.

@adriangonz
Copy link
Contributor Author

I've been running the tests relevant to 1. above (i.e. the suite in test_rolling_updates.py) locally to see if I could find any kind of pattern.

As I can replicate the flakiness locally, it seems safe to assume that they are not just related to lack of resources on the Kind cluster. Instead, it seems that occasionally Istio and Ambassador may lose track of all the Pods in a Service. Every time that I saw this, it happened during the update from an old to a new version of the test model.

Whenever it happens, Istio returns a 503 and Ambassador returns a 504. It's also way more frequent in Istio than it is in Ambassador (at least when I run them locally).

@RafalSkolasinski
Copy link
Contributor

In #1481 I increased flakiness max_retries parameter to 3.
This helped to have successful run of test_rolling_updates locally on the first try.

I have a feeling that introducing a short sleep to assert_model_during_op can help reduce the problem, though I didn't test it yet.

@ukclivecox
Copy link
Contributor

Assumption is its Ambassador tests and due to the change for Ambassador retries to zero. Will change tests to do 1 retry. Am testing locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants