Improve kube deploy process. #13397

lmossman · 2022-06-01T23:34:26Z

What

Resolves #13144

As the issue linked above describes, if a user tries to perform a rolling update of a kube deployment of airbyte, they may run into multiple issues: it may throw an error saying that the bootloader pod cannot be edited, and if a new db pod starts up it could cause the underlying db to be permanently broken. Even if users follow our upgrade instructions in our docs, i.e. in a non-rolling fashion, they could still run into both issues depending on how quickly they try to execute the commands.

This PR attempts to fix both issues

How

I tried a few strategies to fix the bootloader problem:

I tried using generateName instead of name so that the bootloader pod would always have a unique name. This didn't work because kubectl apply cannot be used on a kube resource that does not have a name field, and kubectl apply is what our docs currently instruct users to use and may be important in the future for rolling deploys.
I tried switching the bootloader to be of kind Deployment instead of Pod. This was bad because it caused the bootloader to be ran repeatedly; Deployment is not the right resource type to use for a one-time process.

I tried switching the bootloader to be of kind Job instead of Pod. This still had the issue where running kubectl apply with some changes to env variables would throw this error:

The Job "airbyte-bootloader" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"controller-uid":"96239008-b40f-432c-bcdc-fb258208b81e", "job-name":"airbyte-bootloader"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:"airbyte-bootloader-container", Image:"airbyte/bootloader:dev", Command:[]string(nil), Args:[]string(nil), WorkingDir:"", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:"AIRBYTE_VERSION", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1e60)}, core.EnvVar{Name:"DATABASE_HOST", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1e80)}, core.EnvVar{Name:"DATABASE_PORT", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ea0)}, core.EnvVar{Name:"DATABASE_PASSWORD", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ec0)}, core.EnvVar{Name:"DATABASE_URL", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ee0)}, core.EnvVar{Name:"DATABASE_USER", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1f00)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:"Never", TerminationGracePeriodSeconds:(*int64)(0xc00889bb38), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", SecurityContext:(*core.PodSecurityContext)(0xc00c4cc700), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:"", Subdomain:"", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable

I tried adding a ttlSecondsAfterFinished to the bootloader job, so that the bootloader pod would be automatically deleted after it completes. This fixed the above issue and allowed me to use kubectl apply to freely switch between stable and dev (as long as I waited for the bootloader pod to be automatically deleted).

For the db pod issue, adding the Recreate strategy to the db deployment manifest seems to have fully fixed the issue, as it causes kube to first terminate the existing db pod before spinning up a new one.

Recommended reading order

any

lmossman · 2022-06-01T23:36:19Z

kube/resources/bootloader.yaml

-            secretKeyRef:
-              name: airbyte-secrets
-              key: DATABASE_USER
+  ttlSecondsAfterFinished: 5


As mentioned in the PR description, this is necessary to avoid an error being about the airbyte-bootloader job being immutable. This isn't an ideal solution, because this means that the bootloader pod is deleted after it completes, making its logs inaccessible through kube.

This was the only solution I could come up with that allowed us to still use kubectl apply without issue though, so it may be a worthwhile tradeoff.

Definitely open to feedback here if there are any other options I haven't considered.

can you add a comment explaining why we do this, please?

Sure thing, done!

cgardens

nice!

davinchia · 2022-06-02T06:26:25Z

kube/resources/bootloader.yaml

-apiVersion: v1
-kind: Pod
+apiVersion: batch/v1
+kind: Job


Kube jobs only have best effort parallelism guarantees, which is why I don't really like using them for crucial workflows. Took a look and confirmed this is probably the best way of doing this with Kustomize. Can we add a comment here that we generally want to use Pod (our Helm charts use Pod) for the best exactly-once execution guarantees and cannot do so because Kustomize does not support generateName? Want to prevent confusion in the future.

If Kustomize did support generateName, we should be able to instruct users to run kubectl create on initial create and replace on subsequent runs.

This happens relatively infrequently so risk is low. In the long term, I think we'll consolidate the Kube deploys into Helm so I think this is fine for now.

davinchia

I appreciate the thoroughness and the detailed PR description. One note to explain why we are using a job here. Otherwise looks good!

* master: (142 commits) Highlight removed and added streams in Connection form (airbytehq#13392) 🐛 Source Amplitude: Fixed JSON Validator `date-time` validation (airbytehq#13373) 🐛 Source Mixpanel: publish v0.1.17 (airbytehq#13450) Fixed reverted PR: Fix cancel button when it doesn't provide feedback to the user + UX improvements (airbytehq#13388) 🎉 Source Freshdesk: Added new streams (airbytehq#13332) Prepare YamlSeedConfigPersistence for dependency injection (airbytehq#13384) helm chart: Support nodeSelector, tolerations and affinity on the booloader pod (airbytehq#11467) airbyte-api: add jackson model annotations to remove null values from responses (airbytehq#13370) Change stage to `beta` (airbytehq#13422) 🐛 Source Google Sheets: Retry on server errors (airbytehq#13446) Improve kube deploy process. (airbytehq#13397) Helm chart dependencies fix (airbytehq#13432) 🐛 Source HubSpot: Transform `contact_lists` data to comply with schema (airbytehq#13218) airbytehq#11758: Source Google Ads to GA (airbytehq#13441) Add more pr actions to tag pull requests (airbytehq#13437) Source Google Ads: drop schema field that filters out the data from stream (airbytehq#13423) Updates error view with new design (airbytehq#13197) Source MSSQL: correct enum Standard method (airbytehq#13419) Update postgres doc about cdc publication (airbytehq#13433) run source acceptance tests against image built from branch (airbytehq#13401) ...

switch bootloader to job and add recreate strategy to db

1d79767

github-actions bot added area/platform issues related to the platform kubernetes labels Jun 1, 2022

lmossman commented Jun 1, 2022

View reviewed changes

lmossman requested a review from davinchia June 1, 2022 23:37

add comment about recreate strategy

5d31232

lmossman requested review from benmoriceau and cgardens June 1, 2022 23:39

cgardens approved these changes Jun 2, 2022

View reviewed changes

Add comment about ttl on bootloader

ed3a77c

davinchia reviewed Jun 2, 2022

View reviewed changes

davinchia approved these changes Jun 2, 2022

View reviewed changes

benmoriceau approved these changes Jun 2, 2022

View reviewed changes

add comment about Pod vs Job

575fb7e

lmossman merged commit 88390f2 into master Jun 2, 2022

lmossman deleted the lmossman/fix-kube-deploys branch June 2, 2022 22:27

octavia-squidington-iii mentioned this pull request Jun 3, 2022

Bump Airbyte version from 0.39.8-alpha to 0.39.9-alpha #13454

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve kube deploy process. #13397

Improve kube deploy process. #13397

lmossman commented Jun 1, 2022 •

edited

Loading

lmossman Jun 1, 2022

cgardens Jun 2, 2022

lmossman Jun 2, 2022

cgardens left a comment

davinchia Jun 2, 2022

davinchia left a comment

Improve kube deploy process. #13397

Improve kube deploy process. #13397

Conversation

lmossman commented Jun 1, 2022 • edited Loading

What

How

Recommended reading order

lmossman Jun 1, 2022

Choose a reason for hiding this comment

cgardens Jun 2, 2022

Choose a reason for hiding this comment

lmossman Jun 2, 2022

Choose a reason for hiding this comment

cgardens left a comment

Choose a reason for hiding this comment

davinchia Jun 2, 2022

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

lmossman commented Jun 1, 2022 •

edited

Loading