Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kuberenetes python live v12 in KubernetesExecutor - ApiException/410 errors #11841

Closed
jkinkead opened this issue Oct 25, 2020 · 11 comments
Assignees
Labels
kind:feature Feature Requests provider:cncf-kubernetes Kubernetes provider related issues

Comments

@jkinkead
Copy link
Contributor

Apache Airflow version: 1.10.11

Kubernetes version (if you are using kubernetes) (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

What happened:

We've been seeing occasional issues in our logs where the Kubernetes executor throws an API exception on this stream call:

[2020-10-25 15:59:15,636] {{kubernetes_executor.py:277}} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 271, in run
    self.worker_uuid, self.kube_config)
  File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 299, in _run
    **kwargs):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream
    status=obj['code'], reason=reason)
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 46672510 (46702381)

This is a normal response (and handled in the process_error method), and should be handled gracefully, probably like the event is (catching & resetting self.resource_version).

Anything else we need to know:

This seems to be triggered by having very long-running (multiple days old) task pods in our system. These aren't normal operations, but were the result of some deadlocking bugs.

@jkinkead jkinkead added the kind:bug This is a clearly a bug label Oct 25, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 25, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@alaiou
Copy link

alaiou commented Oct 26, 2020

We also encountered this issue. Turns out the root cause was the newest release of the k8s python client https://github.com/kubernetes-client/python/releases/tag/v12.0.0. Code was added to handle the 410 status code and raise an Exception in this PR. In Airflow however, the KubernetesJobWatcher is expecting an event, which is would then handle the status code gracefully, by resetting the resource_version number in process_error. It never actually gets to that point in the code. As you can see the Exception thrown is from here.

Our work around was to explicitly use the previous version of the k8s python client, by using the appropriate constrained/"know-to-be-working" version of Airflow and its libraries.

pip install \
 apache-airflow[kubernetes]==1.10.12 \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.12/constraints-3.7.txt"

@rigogsilva
Copy link

🤔 So, I am not sure what the apache-airflow is is, since we install the kubernetes python library. Should I just pin the version from kubernetes==12.0.0 to pip install kubernetes==11.0.0?

@potiuk
Copy link
Member

potiuk commented Nov 29, 2020

As a community, we heartily recommend using the official constraints of Airflow to install it.

You can see the constraints described in https://airflow.apache.org/docs/stable/installation.html (if you just care about the user story) as well as some details on how and why it works in https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pinned-constraint-files

Those constraint files contain a set of "known to be working" versions for Airflow - those are automatically upgraded during our test harness when we find them passing the tests and consistent with other limitations. While we cannot block you, from upgrading, using the versions from the constraints is the safest way to proceed. We are just about to release a bugfix 1.10.14 release and we are also upgrading the constraints there.

As pointed out by @alaiou - you can use the --constraint from GitHub, or download the constraint file and use it locally. On your own risk, you can also modify and use other versions. You can also try the latest 1-10 version of the constraints (candidate to 1.10.14) - just specify constraint-1.10 instead of the full version:

pip install \
 apache-airflow[kubernetes]==1.10.12 \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10/constraints-3.7.txt"

And while we can provide those "known to be working" set of versions, if you have your own libraries/requirements - you can modify them yourself. In the upcoming version we will make sure that the constraints are fully consistent (so 'pip check` does not complain when you install all dependencies) - and we recommend you do the same in your installation.

BTW. In both 2.0.0beta (constraint-master branch) and upcoming 1.10.14 (constraints-1-10) the kubernetes version is set to 11.0.0

@potiuk
Copy link
Member

potiuk commented Nov 29, 2020

I am closing the issue as it is clearly about newer version of kubernetes that is not supported.

@potiuk potiuk closed this as completed Nov 29, 2020
@ashb
Copy link
Member

ashb commented Nov 30, 2020

I'm going to re-open this, and change the title to match.

@ashb ashb reopened this Nov 30, 2020
@ashb ashb changed the title KubernetesExecutor occasionally throws exception on 410 responses Support Kuberenetes python live v12 in KubernetesExecutor - ApiException/410 errors Nov 30, 2020
@ashb ashb added kind:feature Feature Requests and removed kind:bug This is a clearly a bug labels Nov 30, 2020
@dimberman
Copy link
Contributor

@alaiou would you be interested in PRing the fix to this? I'd be glad to help get you set up with breeze/review :)

@kaxil kaxil added this to To Do in Kubernetes Issues - Sprint via automation Feb 24, 2021
@jedcunningham
Copy link
Member

kubernetes-client/python#1304 originally lead to us pinning to 11, so it'd need to be handled (either fixed or worked around) before we can support 12.

@kaxil
Copy link
Member

kaxil commented Sep 17, 2021

We should fix this soon as 11.0 is more than 1.5 years old - https://pypi.org/project/kubernetes/#history

@jedcunningham
Copy link
Member

This was resolved with #18797.

@jedcunningham
Copy link
Member

2.3.0 will allow 12+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests provider:cncf-kubernetes Kubernetes provider related issues
Projects
No open projects
Development

No branches or pull requests

9 participants