Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5448] Handle istio-proxy for Kubernetes Pods #62

Merged
merged 1 commit into from
Sep 12, 2019

Conversation

sjmiller609
Copy link

@sjmiller609 sjmiller609 commented Sep 11, 2019

There is an existing problem in airflow where pods running under Istio are not cleaned up properly. This PR handles the cleanup logic for Istio. Tested working with master branch of Astronomer components

@sjmiller609 sjmiller609 force-pushed the jira/AIRFLOW-5448-1.10.5.backport branch from 069a425 to 6e5eaa9 Compare September 11, 2019 23:40
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.
@sjmiller609 sjmiller609 force-pushed the jira/AIRFLOW-5448-1.10.5.backport branch from 6e5eaa9 to 56bddbd Compare September 11, 2019 23:45
@sjmiller609
Copy link
Author

upstream: apache#6079

Copy link
Member

@schnie schnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! In the interest of getting more people banging on this, I'm going to go ahead and merge in and cut an alpha. @dimberman definitely give this a review, we can squash more fixes in tomorrow!

# which is where /quitquitquit is implemented.
# Default to 15020.
status_port = "15020"
for i in range(len(container.args)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjmiller609 think there's any risk of this logic breaking in the future? I'm just thinking of the scenario where an istio upgrade changes the way its configured, like using a ConfigMap instead of a flag or something. I don't have any specific ideas on how to do this any better but just thought it'd be worth a discussion.

Copy link
Author

@sjmiller609 sjmiller609 Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we are version locking istio. I would think the best solution would be to add a system test for this scenario so we don't forget it when we upgrade

@schnie schnie merged commit 8598b7d into v1-10-5 Sep 12, 2019
kaxil pushed a commit that referenced this pull request Oct 30, 2019
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.
kaxil pushed a commit that referenced this pull request Dec 31, 2019
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.
kaxil pushed a commit that referenced this pull request Feb 28, 2020
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.

(cherry picked from commit 84fa48f)
kaxil pushed a commit that referenced this pull request Mar 19, 2020
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.

(cherry picked from commit 84fa48f)
kaxil pushed a commit that referenced this pull request Mar 23, 2020
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.

(cherry picked from commit 84fa48f)
kaxil pushed a commit that referenced this pull request Mar 30, 2020
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.

(cherry picked from commit 84fa48f)
danielhoherd pushed a commit that referenced this pull request Jan 5, 2023
Istio service mesh is not compatible by default with Kubernetes Jobs.
The normal behavior is that a Job will be started, get an istio-proxy
sidecar attached to it via the istio mutating webhook, run until
completion, then the 'main' container in the pod stops, but istio-proxy
hangs around indefinitely. This change handles cleanly exiting the
Istio sidecar 'istio-proxy' when a Kubernetes Executor task completes.

(cherry picked from commit 84fa48f)
(cherry picked from commit 6ed59bf)
(cherry picked from commit ba60ede)
(cherry picked from commit 80ac218)

Handle Istio containers with Kubernetes Executor Pod adoption (#1318)

closes https://github.com/astronomer/issues/issues/3030

>This edge case deals specifically with task instances that ended in the UP_FOR_RETRY state when a scheduler is adopting orphaned task. Generally, this issue does not affec OSS Airflow since the template kubernetes worker pods spawned doesn't have additional containers that would prevent the pod from going into the Succeeded pod state. Those pods in the Succeeded state are handled by the scheduler's adoption process in _adopt_completed_pods().

Since Astronomer's kubernetes worker pods have an additional container (istio-proxy), they are in the NotReady state when tasks are not killed and they are not eligible for adoption.

This can also happen for "completed" pods that have sidecars. Same process though, just a slightly different scenario: If a worker finishes while not being watched by a scheduler, it never gets adopted by another scheduler in _adopt_completed_pods() as the pod is still 'Running', but the TI also isn't in a resettable state so scheduler_job never asks the executor to adopt it! It's in limbo - "complete" in Airflows view (based on TI state) but "Running" in k8s view (since the sidecar is still running).

This commit re-uses current Istio code and handles those pods.

(cherry picked from commit 3f309b0)
(cherry picked from commit 58cfc68)
(cherry picked from commit 92a8289)

[astro] Fix istio sidecar shutdown on newer GKE

Newer GKE verions have started to emit multiple running events for a
given pod with the sidecar still being shown as running. We will put
retries around shutting down the sidecar and also check the current
status of the sidecar, not just the status at the time of the event.

e.g: GKE > 1.18.20.901

(cherry picked from commit cbd50ef)
(cherry picked from commit d1025e1)
(cherry picked from commit d56ba74)
(cherry picked from commit 11a80ae)
(cherry picked from commit 1f0e8be)
(cherry picked from commit 20b0bad)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants