-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design proposal of stable scheduling in TiDB #466
Conversation
67348bc
to
f218b78
Compare
} | ||
``` | ||
|
||
In new predicate `StableScheduling`, we filter out other nodes for TiDB pod if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I specify scheduling policy (whether enable StableScheduling
) for a TidbCluster
instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I'd like to add a global feature switch to control this scheduling behavior for all TiDB cluster instances tidb-operator managed. Because this is a best-effort policy, no harm if it fails.
I'm thinking should we fail the pod (with a switch to control) if it cannot be scheduled to its previous node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback is its previous node may not be the best node for the new node, e.g. there is another node which has more available CPU/Memory resources. However, whether to use this scheduling policy depends on cluster setup (use NodePort with Local policy and need to configure IP addresses of nodes in LB or applications), I think to control this behavior globally is enough. Furthermore, we can bypass this policy if TiDB service does not use local
externalTrafficPolicy intelligently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to add a switch for this scheduling policy as this feature is not suitable for all users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way is to control globally:
A flexible feature gate like this:
tidb-scheduler --features StableScheduling,FeatureA,FeatureB
It works like Kubernetes feature gate, but simpler (all disabled by default).
Or a dedicated flag
tidb-scheduler --enable-stable-scheduling
Another way is to control per TiDBCluster, add a field in TiDBCluster.
What do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be a global feature controlled by operator not TidbCluster
CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose both options are valuable:
- feature gate of scheduler: controlled by cluster admin, determine whether current cluster is willing to support stable scheduling.
- field of
TidbCluster
CRD: controlled by user who make the decision of whether using local mode and stable scheduling
set `externalTrafficPolicy` of service to `Local`. A side-effect is the service of | ||
TiDB will be accessible only on the nodes which a running TiDB pod. To avoid | ||
manual intervention to update IP addresses in load balancer when performing | ||
a rolling update, we prefer to schedule new pod of TiDB member to its previous node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary to make sure that the new tidb pod is deployed to the previous node, because we can add all the nodes of the k8s cluster to the LB's backend, LB can automatically remove these nodes without tidb pods through the LB's health check function. No need to manually update the backend IP of the LB when the tidb pod is deployed to other nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some scenarios, users may need Local
mode for NodePort
service and want to only bind the exact node other than all of the nodes to the external load balancer. Because when the k8s cluster scales large enough, binding all nodes to the load balancer and relies on LB heath check is too heavy for LB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best solution is to implement an operator for the load balancer, but it's beyond the scope of this document. Adding all nodes into load balancer is a good alternative solution, but it depends on load balancer:
- maybe too heavy for LB if the Kubernetes cluster is large (described by @tennix)
- NumberOfTiDBCluster x NumberOfNodes ports must be health checked
- need to add every new node into the backend of LB
- hard to monitor LB (not all failed backends must be fixed)
Try to schedule new pod of TiDB member to its previous node is the best we can do in tidb-operator and will work in certain circumstances. No harm if failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think we can improve this by restricting TiDB pods in a fixed set of nodes (by using NodeSelector/PodAffinity/Taints&Tolerations). Then, we only need to add part of the nodes into the load balancer and no need to update them in the future.
In my understanding, the proposed solution in this doc is like this except it does not require the user to pre-select nodes for TiDB pods and the fixed set of nodes are limited (equals to the number of TiDB pod instances) and may change if it's not possible to run the new pod in which scenario we need manual intervention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the stable scheduling policy makes it easier to understand for users not familiar with Kubernetes.
c7b4056
to
d3995a2
Compare
There is an issue here of loss of availabiliy during the rolling update. A Deployment RollingUpdate will "surge" and add an new pod, temporarily increasing the replica set replication by 1. Then an old pod will be taken down, and the replication will be at its original setting N. So the replication is always N or N + 1. Our operator could increase the statefulset capacity by 1 before rollout to maintain the same level of availability. However, the current proposal's goal is to avoids automatically updating a load balancer, so that would not help. To avoid a loss of availability we would need to convince Kubernetes to schedule the new TiDB pod onto the same node while the previous TiDB is still running and wait to cutover to the new TiDB when the new TiDB is ready. However, it doesn't seem possible to schedule two TiDB to the same node due to capacity limitations, and there could be OOM during the cutover. I am wondering what the specific issue is with the scalability of the load balancer? Why can't we announce just the nodes running TiDB to the load balancer? |
In my testing, the operator does not increase TiDB statefulset capacity by 1 (cc @tennix to confirm). To avoid a loss of availability is not the purpose of the design. Purpose of this design is to add a functionality in our scheduler extender (tidb-scheduler) to schedule the new pod of TiDB member back to its original node if possible. One of the clients needs to configure TiDB instances in an existing load balancer. They do rolling update manually (lower weight and remove backend in LB before terminating TiDB instance to reduce the impact of connection disruptions, related feature: pause the rolling update) and want to avoid manual intervention as much as possible. Workflow:
We have a discussion about this here. |
@gregwebs @cofyc Yes, using However, Besides, upgrading tidb-server will definitely close the client connection which results a service degradation. So it's reasonable to use |
For bare-metal deployment, there is usually no automatic load balancer to use. These users have to use
So in these scenarioes, users would prefer to only announce the nodes tidb pods are running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I still don't understand the health check overhead issue. But it seems the issue comes down to the load balancer not knowing which nodes have TiDB? I would think that it is possible for software to solve this, but maybe users don't want to configure api access to update the load balancer? Maybe it is possible for the load balancer to make an outward request for an updated configuration rather than do its own health checks? |
Users would think it's unreasonable to attach all the k8s nodes to the load balancer. Suppose there are a thousand nodes (of course there are no such users right now) and one of the tidb cluster only runs one tidb pod, if we announce all nodes, then the load balancer has to do health check on these thousand nodes every now and then. This sounds absurd to them. Besides, not all users can or want to make kubernetes to automatically configure their external load balancer. For these users, they need to configure the load balancer backends manually according to where we put tidb pods on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
meets all criteria to `demo-tidb-2` | ||
- tidb-scheduler does nothing if the original node does not exist in these | ||
nodes (e.g. not enough resources left for demo-tidb-2 if another pod is | ||
assigned to kube-node-2 after kube-demo-2 is deleted), kube-scheduler will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/kube-demo-2/demo-tidb-2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, Thanks!
|
||
### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements | ||
|
||
If we upgrade TiDB pods to request more resources, it is possible that its node node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/its node node/its node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-e2e-tests |
hi, @weekface PR has been rebased on the master, PTAL again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What problem does this PR solve?
This expands the design proposal of #332.
https://github.com/cofyc/tidb-operator/blob/fix332-doc/docs/design-proposals/tidb-stable-scheduling.md
What is changed and how it works?
Check List
Has documents change
Does this PR introduce a user-facing change?: