Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test Flake] argo_client wait_for_workflows doesn't handle retryable errors correctly #204

Closed
jlewi opened this issue Sep 20, 2018 · 0 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Sep 20, 2018

Test flake:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/1579/kubeflow-presubmit/3421/?log#log

Here's the exception

ERROR|2018-09-20T03:41:37|/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py|257| Exception occurred: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 20 Sep 2018 03:41:37 GMT', 'Audit-Id': '565e8792-b2a6-4df3-86ec-9a0e0b46ba16', 'Content-Length': '129', 'Content-Type': 'application/json', 'Www-Authenticate': 'Basic realm="kubernetes-master"'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

Traceback (most recent call last):
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 243, in run
    status_callback=argo_client.log_status)
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/src/kubeflow/testing/py/kubeflow/testing/argo_client.py", line 70, in wait_for_workflows
    GROUP, VERSION, namespace, PLURAL, n)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 697, in get_namespaced_custom_object
    (data) = self.get_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/custom_objects_api.py", line 797, in get_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
ApiException: (401)

This should be a retryable error. We can see from the stack trace the retrying module is invoked.
We add retries to wait_for_workflows

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000,

The problem is the retry is on the function wait_for_workflows. With a maximum retry of 20 minutes.
That means we won't retry the error if wait_for_workflows takes longer than 20 minutes and the error occurs after wait_for_workflows was running for more than 20 minutes.

We should be applying the retry just to the call to K8s API

results = crd_api.get_namespaced_custom_object(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants