Flakiness on integration tests #1402

adriangonz · 2020-02-04T11:06:36Z

Context

Sometimes integration tests seem to behave flaky and raise false errors. We have noticed two different cases where this usually happens:

503 or 504 errors on one or two of the tests in testing/scripts/test_rolling_updates.py. An example can be seen on the log output for PR #1400. This has happened in the past either due to issues with Ambassador / Istio falling over to the new deployment (actual red flag) or due to spamming the SeldonDeployment pod with too many requests.

seldon-core/jenkins-x/logs/SeldonIO/seldon-core/PR-1400/3.log

Lines 18139 to 18170 in f21fa8e

    
           _________________ TestRollingHttp.test_rolling_update8[ambas] __________________ 
        
           [gw0] linux -- Python 3.6.0 /usr/local/bin/python 
        
           self = <test_rolling_updates.TestRollingHttp object at 0x7f9535cffc88> 
        
           namespace = 'test-rolling-update8-ambas', api_gateway = 'localhost:8003' 
        
               @with_api_gateways 
        
               # Test updating a model with a new resource request but same image 
        
               def test_rolling_update8(self, namespace, api_gateway): 
        
                   if api_gateway == API_ISTIO_GATEWAY: 
        
                       retry_run( 
        
                           f"kubectl create -f ../resources/seldon-gateway.yaml -n {namespace}" 
        
                       ) 
        
                   retry_run(f"kubectl apply -f ../resources/graph1svc.json -n {namespace}") 
        
                   wait_for_status("mymodel", namespace) 
        
                   wait_for_rollout("mymodel", namespace, expected_deployments=2) 
        
                   r = initial_rest_request("mymodel", namespace, endpoint=api_gateway) 
        
                   assert r.status_code == 200 
        
                   assert r.json()["data"]["tensor"]["values"] == [1.0, 2.0, 3.0, 4.0] 
        
                   retry_run(f"kubectl apply -f ../resources/graph4svc.json -n {namespace}") 
        
                   r = initial_rest_request("mymodel", namespace, endpoint=api_gateway) 
        
                   assert r.status_code == 200 
        
                   assert r.json()["data"]["tensor"]["values"] == [1.0, 2.0, 3.0, 4.0] 
        
                   i = 0 
        
                   for i in range(50): 
        
                       r = rest_request_ambassador("mymodel", namespace, api_gateway) 
        
           >           assert r.status_code == 200 
        
           E           assert 503 == 200 
        
           E             -503 
        
           E             +200 
        
           test_rolling_updates.py:287: AssertionError

An error installing Istio. In particular, a timeout waiting for the istio-ingressgateway pod to come up. Instead of failing early, the tests continue and this in turn causes (almost) all of them to fail. An example can be seen on the log output for PR #1382. This has happened in the past whenever the cluster ran out of ephemeral storage (e.g. Internal build issue: PR builds intermittently fail with no space left on device #1322).

seldon-core/jenkins-x/logs/SeldonIO/seldon-core/PR-1382/4.log

Lines 340 to 344 in f21fa8e

    
           kubectl rollout status deployment.apps/istio-ingressgateway -n istio-system 
        
           Waiting for deployment "istio-ingressgateway" rollout to finish: 0 of 1 updated replicas are available... 
        
           error: deployment "istio-ingressgateway" exceeded its progress deadline 
        
           make: *** [Makefile:45: install_istio] Error 1 
        
           make: Entering directory '/workspace/source/python'

On both cases they seem to be intermittent.

Workaroud

While they are not fixed properly, we can verify if it's a red flag by running them locally (and adding a note about that on the PR for the reviewers).

The text was updated successfully, but these errors were encountered:

adriangonz · 2020-02-07T13:37:47Z

As a temporary workaround, on PR #1415 I'm marking the tests relevant to 1. above as "flaky" using the flaky package. This means that upon failure they will be retried 1 more time.

Note that this is temporary, it still can fail (if they fail more than 2 times) and it should be fixed properly.

adriangonz · 2020-02-07T18:31:15Z

I've been running the tests relevant to 1. above (i.e. the suite in test_rolling_updates.py) locally to see if I could find any kind of pattern.

As I can replicate the flakiness locally, it seems safe to assume that they are not just related to lack of resources on the Kind cluster. Instead, it seems that occasionally Istio and Ambassador may lose track of all the Pods in a Service. Every time that I saw this, it happened during the update from an old to a new version of the test model.

Whenever it happens, Istio returns a 503 and Ambassador returns a 504. It's also way more frequent in Istio than it is in Ambassador (at least when I run them locally).

RafalSkolasinski · 2020-03-04T14:50:23Z

In #1481 I increased flakiness max_retries parameter to 3.
This helped to have successful run of test_rolling_updates locally on the first try.

I have a feeling that introducing a short sleep to assert_model_during_op can help reduce the problem, though I didn't test it yet.

ukclivecox · 2020-03-06T09:07:50Z

Assumption is its Ambassador tests and due to the change for Ambassador retries to zero. Will change tests to do 1 retry. Am testing locally.

adriangonz added bug triage Needs to be triaged and prioritised accordingly labels Feb 4, 2020

This was referenced Feb 4, 2020

Fix CVE-2019-18224 #1400

Closed

Import Java wrapper #1382

Merged

Bump pandas from 0.25.3 to 1.0.0 in /python #1397

Merged

ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Feb 6, 2020

ukclivecox added this to the 1.2 milestone Feb 6, 2020

axsaucedo mentioned this issue Feb 9, 2020

Allow inputs in chain tensorflow protocol #1422

Merged

adriangonz mentioned this issue Mar 6, 2020

Reduce flakiness on rolling update tests #1513

Closed

ukclivecox modified the milestones: 1.2, 1.1 Mar 7, 2020

ukclivecox self-assigned this Mar 7, 2020

ukclivecox mentioned this issue Mar 7, 2020

Update operator deps and allow istio retries #1518

Merged

seldondev closed this as completed in #1518 Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flakiness on integration tests #1402

Flakiness on integration tests #1402

adriangonz commented Feb 4, 2020 •

edited

Loading

adriangonz commented Feb 7, 2020

adriangonz commented Feb 7, 2020

RafalSkolasinski commented Mar 4, 2020

ukclivecox commented Mar 6, 2020

Flakiness on integration tests #1402

Flakiness on integration tests #1402

Comments

adriangonz commented Feb 4, 2020 • edited Loading

Context

Workaroud

adriangonz commented Feb 7, 2020

adriangonz commented Feb 7, 2020

RafalSkolasinski commented Mar 4, 2020

ukclivecox commented Mar 6, 2020

adriangonz commented Feb 4, 2020 •

edited

Loading