Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve kube deploy process. #13397

Merged
merged 4 commits into from
Jun 2, 2022
Merged

Improve kube deploy process. #13397

merged 4 commits into from
Jun 2, 2022

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented Jun 1, 2022

What

Resolves #13144

As the issue linked above describes, if a user tries to perform a rolling update of a kube deployment of airbyte, they may run into multiple issues: it may throw an error saying that the bootloader pod cannot be edited, and if a new db pod starts up it could cause the underlying db to be permanently broken. Even if users follow our upgrade instructions in our docs, i.e. in a non-rolling fashion, they could still run into both issues depending on how quickly they try to execute the commands.

This PR attempts to fix both issues

How

I tried a few strategies to fix the bootloader problem:

  • I tried using generateName instead of name so that the bootloader pod would always have a unique name. This didn't work because kubectl apply cannot be used on a kube resource that does not have a name field, and kubectl apply is what our docs currently instruct users to use and may be important in the future for rolling deploys.
  • I tried switching the bootloader to be of kind Deployment instead of Pod. This was bad because it caused the bootloader to be ran repeatedly; Deployment is not the right resource type to use for a one-time process.
  • I tried switching the bootloader to be of kind Job instead of Pod. This still had the issue where running kubectl apply with some changes to env variables would throw this error:
    The Job "airbyte-bootloader" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"controller-uid":"96239008-b40f-432c-bcdc-fb258208b81e", "job-name":"airbyte-bootloader"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:"airbyte-bootloader-container", Image:"airbyte/bootloader:dev", Command:[]string(nil), Args:[]string(nil), WorkingDir:"", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:"AIRBYTE_VERSION", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1e60)}, core.EnvVar{Name:"DATABASE_HOST", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1e80)}, core.EnvVar{Name:"DATABASE_PORT", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ea0)}, core.EnvVar{Name:"DATABASE_PASSWORD", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ec0)}, core.EnvVar{Name:"DATABASE_URL", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1ee0)}, core.EnvVar{Name:"DATABASE_USER", Value:"", ValueFrom:(*core.EnvVarSource)(0xc00cca1f00)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:"Never", TerminationGracePeriodSeconds:(*int64)(0xc00889bb38), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", SecurityContext:(*core.PodSecurityContext)(0xc00c4cc700), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:"", Subdomain:"", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable
    
  • I tried adding a ttlSecondsAfterFinished to the bootloader job, so that the bootloader pod would be automatically deleted after it completes. This fixed the above issue and allowed me to use kubectl apply to freely switch between stable and dev (as long as I waited for the bootloader pod to be automatically deleted).

For the db pod issue, adding the Recreate strategy to the db deployment manifest seems to have fully fixed the issue, as it causes kube to first terminate the existing db pod before spinning up a new one.

Recommended reading order

any

@github-actions github-actions bot added area/platform issues related to the platform kubernetes labels Jun 1, 2022
secretKeyRef:
name: airbyte-secrets
key: DATABASE_USER
ttlSecondsAfterFinished: 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the PR description, this is necessary to avoid an error being about the airbyte-bootloader job being immutable. This isn't an ideal solution, because this means that the bootloader pod is deleted after it completes, making its logs inaccessible through kube.

This was the only solution I could come up with that allowed us to still use kubectl apply without issue though, so it may be a worthwhile tradeoff.

Definitely open to feedback here if there are any other options I haven't considered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment explaining why we do this, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, done!

@lmossman lmossman requested a review from davinchia June 1, 2022 23:37
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

apiVersion: v1
kind: Pod
apiVersion: batch/v1
kind: Job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kube jobs only have best effort parallelism guarantees, which is why I don't really like using them for crucial workflows. Took a look and confirmed this is probably the best way of doing this with Kustomize. Can we add a comment here that we generally want to use Pod (our Helm charts use Pod) for the best exactly-once execution guarantees and cannot do so because Kustomize does not support generateName? Want to prevent confusion in the future.

If Kustomize did support generateName, we should be able to instruct users to run kubectl create on initial create and replace on subsequent runs.

This happens relatively infrequently so risk is low. In the long term, I think we'll consolidate the Kube deploys into Helm so I think this is fine for now.

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thoroughness and the detailed PR description. One note to explain why we are using a job here. Otherwise looks good!

@lmossman lmossman merged commit 88390f2 into master Jun 2, 2022
@lmossman lmossman deleted the lmossman/fix-kube-deploys branch June 2, 2022 22:27
chebalski added a commit to BluestarGenomics/airbyte that referenced this pull request Jul 25, 2022
* master: (142 commits)
  Highlight removed and added streams in Connection form (airbytehq#13392)
  🐛  Source Amplitude: Fixed JSON Validator `date-time` validation (airbytehq#13373)
  🐛 Source Mixpanel: publish v0.1.17 (airbytehq#13450)
  Fixed reverted PR: Fix cancel button when it doesn't provide feedback to the user + UX improvements (airbytehq#13388)
  🎉 Source Freshdesk: Added new streams (airbytehq#13332)
  Prepare YamlSeedConfigPersistence for dependency injection (airbytehq#13384)
  helm chart: Support nodeSelector, tolerations and affinity on the booloader pod (airbytehq#11467)
  airbyte-api: add jackson model annotations to remove null values from responses (airbytehq#13370)
  Change stage to `beta` (airbytehq#13422)
  🐛 Source Google Sheets: Retry on server errors (airbytehq#13446)
  Improve kube deploy process. (airbytehq#13397)
  Helm chart dependencies fix (airbytehq#13432)
  🐛 Source HubSpot: Transform `contact_lists` data to comply with schema (airbytehq#13218)
  airbytehq#11758: Source Google Ads to GA (airbytehq#13441)
  Add more pr actions to tag pull requests (airbytehq#13437)
  Source Google Ads: drop schema field that filters out the data from stream (airbytehq#13423)
  Updates error view with new design (airbytehq#13197)
  Source MSSQL: correct enum Standard method (airbytehq#13419)
  Update postgres doc about cdc publication (airbytehq#13433)
  run source acceptance tests against image built from branch (airbytehq#13401)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform kubernetes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent broken db state / improve OSS kube deployment process
4 participants