-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't create an impossible disruption budget for smaller clusters #384
Conversation
I disagree here. Because our disruption budget is always impossible to resolve. It covers always only the master pod of which there is always only 1. The intent is that there is some kind of coordination going on with the operator to move and failover database pods instead of the "kubernetest cluster"-manager. The operator also does contain code to handle those single master pods with no replicas. You can trigger migration by setting / removing labels on nodes. Our Kubernetes infrastructure does have a timeout set, if the operator does not manage until that point to move the master it gets killed. |
We also allow single node clusters in our test environment; master nodes there should not go away voluntarily despite hosting test PG clusters, hence the need for small disruption budgets. |
If that's the case, then the operator needs to watch for evictions (nodes being set unschedulable) or something and implement its own version of draining nodes marked for maintenance :) Otherwise this will block |
@coderanger please watch our KubeCon talk "Continuously Deliver your Kubernetes Infrastructure" on how we do rolling cluster upgrades --- it's not a naive rolling node upgrade, but a bit more coordination (for PostgreSQL). |
@hjacobs That doesn't really apply for things like GKE and EKS though, and there should at least be options to work in those environments since they are very common :) I can move this elsewhere, maybe if the pod disruption budget name is |
Anyone have a strong feeling on how to structure it? Happy to do either the object name template approach above or a new flag, or something else that's better :) |
At the moment there is no setting in the operator to turn off creation of PDB for the Postgres cluster. Normally, even for a cluster with one pod, the operator should do the right thing and delete the master pod when the node is being drained and the 'ready' label is removed. Shortly, the procedure we used is: remove the ready label from all old Kubernetes nodes, create the new N nodes, drain the old N nodes, the operator will pick up all master pods from those old N nodes and move them to one of the new ones. How is that specific to AWS and what should be done differently on GKE or EKS? The legitimate case I see here is when one wants to opt-out completely from our node rolling update schema for some reason. We could have a work-around of updating the PDB with "MinAvailable: 0" iff the cluster is declared with number of instances set to 0. |
pkg/cluster/k8sres.go
Outdated
@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription) | |||
func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget { | |||
minAvailable := intstr.FromInt(1) | |||
|
|||
// Avoid creating an unsatisfyable budget. | |||
if c.Spec.NumberOfInstances <= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change it to < 1. A minAvailable: 1 for a cluster with one pod is easy to satisfy.
@alexeyklyukin when GKE runs an upgrade, it runs the whole process itself. Having to manually cycle pods is technically doable but creates substantial complexity that doens't need to be there. The expected behavior for most applications is if they run multiple replicas, set a disruption budget with a min of 1 so that at least one pod will migrated before the node cycles. For a single replica service, there will have to be downtime during a node cycle so having a pod disruption budget just blocks the upgrade forever (or until a human goes in an manually kills the pod). Requiring a human operator to manage these things is kind of against the point of automating the cluster, in a multi-replica database I would expect the normal failover system to kick in and fix things, so I don't want node drains to get blocked pending human interaction. If you have very high uptime requirements I can see the argument for manually triggering the failover I guess? But for most of my databases, I'm cool with a second or two of downtime as a failover runs because one of the pods was killed, or in the case of a single instance I've accepted that there may be momentary outages when pods get moved around :) |
@coderanger so what about setting the numberOfInstances to 0 explicitly for the clusters that is expected to have a downtime during the upgrade? If the operator reacts by setting the minAvailable to 0 on the PDB I am fine with that, as the user gave an explicit consent for the downtime by setting the cluster size to 0. |
I would expect numberOfInstances:0 to scale down the existing database pods? That would also have to be done on every database object every time there is a Kubernetes upgrade to run (or any node maintenance that requires draining). |
(it's very possible I misunderstand how the operator sets the number of replicas in the statefulset :) |
Yes, setting the number of instances to 0 terminates all pods of the existing cluster. You can do this for every cluster just before the upgrade if you are fine with an upgrade-related downtime. Otherwise, this would be received only for those clusters running just one pod; for the rest, you can either wait for the replica pods to terminate and start on new nodes and switchover to them afterwards manually, or use the existing upgrade procedure in the operator and figure out how to make Google remove the readiness label from the nodes that are about to be drained and assign it to the new nodes. Anyway, it will not make things worse, as the naive upgrade on GKE will most certainly lead to at least one downtime (when the master is terminated without any prior notice it may take up to 30 seconds to failover to the replica) and in many cases even more than one (if the master pod fail over to a node that is about to be decommissioned one or more times). By devising a special upgrade procedure we are avoiding exactly the issue of having multiple downtimes for the cluster during a single Kubernetes upgrade. |
I would be fine with a feature toggle by setting the min number of instances for PDB to be created if this helps the GKE deployment, which we have not had enough time to test right now. (e.g. default to value would be 1 pod, but you could change it to 2 and then skip pdb for 1 pod clusters) Otherwise we need some time to look into explaining you the triggering of the "pod migration" the operator does. To my memory you just need to set the end of life label. As Oleksii points out the important part is to reduce number of fail overs for clusters with 2+ pods as you do not want to fail over to a replica still running on an old node but a replica that has been moved to an updated node. |
A quick look at the call sites for "moveMasterPodsOffNode" would probably help. Seems we watch for unscheduleable and the removal of the ready label. (The absense of the ready label makes Postgres pods start on new nodes given the affinity config) |
While I'm 100% on board that the Zalando-style node upgrade/maintenance workflow is better, stuff like GKE and Kops upgrades are what they are :) I can rework this using some new options. I think the clearest way would be a |
…ich shouldn't matter anyway but be nicer to the scheduler).
Updated to add a new |
Anything further I should tweak on this? |
Looks good. Although I'm a bit concern about the name of this option - I'm not sure I personally would trust to a project with "allow to screw my db" option :) What about being flexible to the full extent and just have |
Heh, there was a test for that back on the previous implementation. I can switch the option to direct control over the budget minimum though :) |
What happened to it?
Yes, that would be nice. Other than that I don't see any issues with this proposal. |
What should it do if someone puts a value other than 0 or 1? |
Since the whole purpose of this feature is to add a bit more flexibility, I would argue that in this case operator needs just to log warning and proceed with a provided value (but of course the default value should be 1). Any other opinions? |
I mean it's a hard error if its more than 1 I think? The disruption budget only target the master pod, and I think there should only ever be one of those other than during a failover? 2 or more would be a permanently unsatisfiable budget in that case. |
With the current setup - yes, but then my vivid imagination helps me to think of operator managing a sharded cluster with more than one master. At the same time it's pretty far fetched for now, so you're probably right - let's then throw a validation error if min availability is more than 1. |
@coderanger @erthalion do you have any updates on this PR ? |
Looks like this PR would also resolve #547. |
@@ -1052,6 +1052,11 @@ func (c *Cluster) generateCloneEnvironment(description *acidv1.CloneDescription) | |||
func (c *Cluster) generatePodDisruptionBudget() *policybeta1.PodDisruptionBudget { | |||
minAvailable := intstr.FromInt(1) | |||
|
|||
// Is master disruption is enabled or if there is no master, set the budget to 0. | |||
if c.Spec.AllowMasterDisruption || c.Spec.NumberOfInstances <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we create the pod disruption budget when we scale up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test with kind
So, simply scaling to 0 and then back to 1 does't work out of the box. Syncing gets interrupted while syncing roles:
time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:14Z" level=debug msg="updating statefulset annotations" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:14Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:15Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:30Z" level=error msg="could not connect to PostgreSQL database: dial tcp 10.110.86.179:5432: connect: connection refused" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:44Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster worker=0
time="2019-05-24T14:01:44Z" level=info msg="cluster has been updated" cluster-name=default/acid-minimal-cluster pkg=controller worker=0
Deleting the operator pod helps to sync roles so that pdb is synced, too (and set to minAvailable: 1).
time="2019-05-24T14:10:53Z" level=debug msg="statefulset's rolling update annotation has been set to false" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing roles" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing databases" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="closing database connection" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=debug msg="syncing pod disruption budgets" cluster-name=default/acid-minimal-cluster pkg=cluster
time="2019-05-24T14:10:53Z" level=info msg="cluster has been synced" cluster-name=default/acid-minimal-cluster pkg=controller worker=0
Tolerations []v1.Toleration `json:"tolerations,omitempty"` | ||
Sidecars []Sidecar `json:"sidecars,omitempty"` | ||
PodPriorityClassName string `json:"pod_priority_class_name,omitempty"` | ||
AllowMasterDisruption bool `json:"allow_master_disruption,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "allow" a good word? should we not just stick with "enable"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then it should also default to "enabled=True"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. And should be introduced to all the different config targets?
Thanks @coderanger for your contribution. I have followed up on your work and created a configuration toggle for enabling/disabling the PDB instead of having a manifest parameter. Hope it helps your workflows. |
When the cluster has 1 node (or probably 0 nodes but that wasn't the case I hit), there is still a disruption budget with minimum=1, so the single existing node can never be killed. I hit this because it left those database nodes blocking a pre-upgrade node drain :) If you've already opted to create a single node cluster, it seems fair to allow it to be disrupted if needed.