Design proposal of stable scheduling in TiDB #466

cofyc · 2019-05-08T09:56:38Z

What problem does this PR solve?

This expands the design proposal of #332.

https://github.com/cofyc/tidb-operator/blob/fix332-doc/docs/design-proposals/tidb-stable-scheduling.md

What is changed and how it works?

Check List

Has documents change

Does this PR introduce a user-facing change?:

NONE

aylei · 2019-05-08T10:13:45Z

docs/design-proposals/tidb-stable-scheduling.md

+}
+```
+
+In new predicate `StableScheduling`, we filter out other nodes for TiDB pod if


Can I specify scheduling policy (whether enable StableScheduling) for a TidbCluster instance?

Yes, but I'd like to add a global feature switch to control this scheduling behavior for all TiDB cluster instances tidb-operator managed. Because this is a best-effort policy, no harm if it fails.

I'm thinking should we fail the pod (with a switch to control) if it cannot be scheduled to its previous node?

One drawback is its previous node may not be the best node for the new node, e.g. there is another node which has more available CPU/Memory resources. However, whether to use this scheduling policy depends on cluster setup (use NodePort with Local policy and need to configure IP addresses of nodes in LB or applications), I think to control this behavior globally is enough. Furthermore, we can bypass this policy if TiDB service does not use local externalTrafficPolicy intelligently.

I agree to add a switch for this scheduling policy as this feature is not suitable for all users.

One way is to control globally:

A flexible feature gate like this:

tidb-scheduler --features StableScheduling,FeatureA,FeatureB

It works like Kubernetes feature gate, but simpler (all disabled by default).

Or a dedicated flag

tidb-scheduler --enable-stable-scheduling

Another way is to control per TiDBCluster, add a field in TiDBCluster.

What do you prefer?

I think it should be a global feature controlled by operator not TidbCluster CRD.

I suppose both options are valuable:

feature gate of scheduler: controlled by cluster admin, determine whether current cluster is willing to support stable scheduling.

field of TidbCluster CRD: controlled by user who make the decision of whether using local mode and stable scheduling

docs/design-proposals/tidb-stable-scheduling.md

onlymellb · 2019-05-08T12:00:36Z

docs/design-proposals/tidb-stable-scheduling.md

+set `externalTrafficPolicy` of service to `Local`. A side-effect is the service of
+TiDB will be accessible only on the nodes which a running TiDB pod. To avoid
+manual intervention to update IP addresses in load balancer when performing
+a rolling update, we prefer to schedule new pod of TiDB member to its previous node.


I think it's not necessary to make sure that the new tidb pod is deployed to the previous node, because we can add all the nodes of the k8s cluster to the LB's backend, LB can automatically remove these nodes without tidb pods through the LB's health check function. No need to manually update the backend IP of the LB when the tidb pod is deployed to other nodes.

In some scenarios, users may need Local mode for NodePort service and want to only bind the exact node other than all of the nodes to the external load balancer. Because when the k8s cluster scales large enough, binding all nodes to the load balancer and relies on LB heath check is too heavy for LB.

The best solution is to implement an operator for the load balancer, but it's beyond the scope of this document. Adding all nodes into load balancer is a good alternative solution, but it depends on load balancer:

maybe too heavy for LB if the Kubernetes cluster is large (described by @tennix)

NumberOfTiDBCluster x NumberOfNodes ports must be health checked

need to add every new node into the backend of LB

hard to monitor LB (not all failed backends must be fixed)

Try to schedule new pod of TiDB member to its previous node is the best we can do in tidb-operator and will work in certain circumstances. No harm if failed.

However, I think we can improve this by restricting TiDB pods in a fixed set of nodes (by using NodeSelector/PodAffinity/Taints&Tolerations). Then, we only need to add part of the nodes into the load balancer and no need to update them in the future.

In my understanding, the proposed solution in this doc is like this except it does not require the user to pre-select nodes for TiDB pods and the fixed set of nodes are limited (equals to the number of TiDB pod instances) and may change if it's not possible to run the new pod in which scenario we need manual intervention.

Also, the stable scheduling policy makes it easier to understand for users not familiar with Kubernetes.

docs/design-proposals/tidb-stable-scheduling.md

gregwebs · 2019-05-09T17:20:22Z

There is an issue here of loss of availabiliy during the rolling update. A Deployment RollingUpdate will "surge" and add an new pod, temporarily increasing the replica set replication by 1. Then an old pod will be taken down, and the replication will be at its original setting N. So the replication is always N or N + 1.
For a stateful set, I believe the replication will be N or N -1 during rollout, meaning a loss of availability.

Our operator could increase the statefulset capacity by 1 before rollout to maintain the same level of availability. However, the current proposal's goal is to avoids automatically updating a load balancer, so that would not help. To avoid a loss of availability we would need to convince Kubernetes to schedule the new TiDB pod onto the same node while the previous TiDB is still running and wait to cutover to the new TiDB when the new TiDB is ready. However, it doesn't seem possible to schedule two TiDB to the same node due to capacity limitations, and there could be OOM during the cutover.

I am wondering what the specific issue is with the scalability of the load balancer? Why can't we announce just the nodes running TiDB to the load balancer?

cofyc · 2019-05-10T03:43:57Z

Our operator could increase the statefulset capacity by 1 before rollout to maintain the same level of availability. However, the current proposal's goal is to avoids automatically updating a load balancer, so that would not help. To avoid a loss of availability we would need to convince Kubernetes to schedule the new TiDB pod onto the same node while the previous TiDB is still running and wait to cutover to the new TiDB when the new TiDB is ready. However, it doesn't seem possible to schedule two TiDB to the same node due to capacity limitations, and there could be OOM during the cutover.

In my testing, the operator does not increase TiDB statefulset capacity by 1 (cc @tennix to confirm). To avoid a loss of availability is not the purpose of the design. Purpose of this design is to add a functionality in our scheduler extender (tidb-scheduler) to schedule the new pod of TiDB member back to its original node if possible. One of the clients needs to configure TiDB instances in an existing load balancer. They do rolling update manually (lower weight and remove backend in LB before terminating TiDB instance to reduce the impact of connection disruptions, related feature: pause the rolling update) and want to avoid manual intervention as much as possible.

Workflow:

demo-tidb-2 is running on the node kube-node-2
the user performs an update
after the pod of demo-tidb-2 is terminated, the new pod of demo-tidb-2 is created
kube-scheduler sends feasible nodes which can run the pod demo-tidb-2 to tidb-scheduler (scheduler extender)
tidb-scheduler filters out other nodes if the original node exists in these nodes, kube-scheduler will choose kube-node-2 to run demo-tidb-2
- note that if kube-node-2 exist in the nodes sent from kube-scheduler, it meets all criteria to demo-tidb-2
tidb-scheduler does nothing if the original node does not exist in these nodes (e.g. not enough resources left for demo-tidb-2 if another pod is assigned to kube-node-2 after kube-demo-2 is deleted), kube-scheduler will prioritize all feasible nodes to find the best match

I am wondering what the specific issue is with the scalability of the load balancer? Why can't we announce just the nodes running TiDB to the load balancer?

We have a discussion about this here.

tennix · 2019-05-10T10:41:13Z

@gregwebs @cofyc Yes, using StatefulSet means there will be a pod unavailable during the upgrade. We can change this behavior by using Deployment. The rolling update of Deployment ensures no service degradation.

However, Deployment doesn't make sure the upgrade order which is required for tidb-server. Tidb-server internally has the DDL owner, and this requires special upgrading order which cannot be fulfilled by Deployment.

Besides, upgrading tidb-server will definitely close the client connection which results a service degradation. So it's reasonable to use StatefulSet for tidb-server.

tennix · 2019-05-10T12:12:54Z

For bare-metal deployment, there is usually no automatic load balancer to use. These users have to use NodePort type service. And TiDB has source IP based permission system, to use this feature we have to use Local mode of NodePort service. However, this mode has some drawbacks:

imbalanced traffic spreading as documented here
it's too heavy for the external load balancer to do health check when there are too many k8s nodes and NodePort services.

So in these scenarioes, users would prefer to only announce the nodes tidb pods are running.

weekface

LGTM

tennix

LGTM

gregwebs · 2019-05-10T13:16:22Z

Besides, upgrading tidb-server will definitely close the client connection which results a service degradation. So it's reasonable to use StatefulSet for tidb-server.
We will eventually need to support connection draining. If someone is running a long analytic query, it will be interrupted, but we can guarantee that transactions will not be.

I still don't understand the health check overhead issue. But it seems the issue comes down to the load balancer not knowing which nodes have TiDB? I would think that it is possible for software to solve this, but maybe users don't want to configure api access to update the load balancer? Maybe it is possible for the load balancer to make an outward request for an updated configuration rather than do its own health checks?

tennix · 2019-05-10T13:46:20Z

Users would think it's unreasonable to attach all the k8s nodes to the load balancer. Suppose there are a thousand nodes (of course there are no such users right now) and one of the tidb cluster only runs one tidb pod, if we announce all nodes, then the load balancer has to do health check on these thousand nodes every now and then. This sounds absurd to them.

Besides, not all users can or want to make kubernetes to automatically configure their external load balancer. For these users, they need to configure the load balancer backends manually according to where we put tidb pods on.

LinuxGit

LGTM

LinuxGit · 2019-05-10T15:04:13Z

docs/design-proposals/tidb-stable-scheduling.md

+    meets all criteria to `demo-tidb-2`
+- tidb-scheduler does nothing if the original node does not exist in these
+  nodes (e.g. not enough resources left for demo-tidb-2 if another pod is
+  assigned to kube-node-2 after kube-demo-2 is deleted), kube-scheduler will


s/kube-demo-2/demo-tidb-2 ?

Fixed, Thanks!

LinuxGit · 2019-05-10T15:19:55Z

docs/design-proposals/tidb-stable-scheduling.md

+
+### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements
+
+If we upgrade TiDB pods to request more resources, it is possible that its node node


s/its node node/its node

tennix

LGTM

cofyc · 2019-05-15T06:05:08Z

/run-e2e-tests

cofyc · 2019-05-15T06:42:41Z

hi, @weekface PR has been rebased on the master, PTAL again.

weekface

lgtm

cofyc force-pushed the fix332-doc branch 4 times, most recently from 67348bc to f218b78 Compare May 8, 2019 10:00

aylei reviewed May 8, 2019

View reviewed changes

cofyc force-pushed the fix332-doc branch from f218b78 to 92c3242 Compare May 8, 2019 11:48

onlymellb reviewed May 8, 2019

View reviewed changes

cofyc force-pushed the fix332-doc branch 2 times, most recently from c7b4056 to d3995a2 Compare May 9, 2019 06:01

cofyc mentioned this pull request May 9, 2019

Support stable scheduling for TiDB #477

Merged

cofyc requested review from tennix, aylei, onlymellb and xiaojingchen and removed request for aylei, onlymellb and xiaojingchen May 9, 2019 11:03

tennix requested changes May 9, 2019

View reviewed changes

Design proposal of stable scheduling in TiDB

33cc256

cofyc force-pushed the fix332-doc branch from d3995a2 to 33cc256 Compare May 10, 2019 02:21

cofyc requested a review from weekface May 10, 2019 04:06

cofyc force-pushed the fix332-doc branch from 36bdb11 to ed61a45 Compare May 10, 2019 04:15

add workflow and feature gate

877233a

cofyc force-pushed the fix332-doc branch from ed61a45 to 877233a Compare May 10, 2019 05:29

weekface previously approved these changes May 10, 2019

View reviewed changes

tennix previously approved these changes May 10, 2019

View reviewed changes

LinuxGit reviewed May 10, 2019

View reviewed changes

fix typos

1a9b392

cofyc dismissed stale reviews from tennix and weekface via 1a9b392 May 13, 2019 02:25

cofyc force-pushed the fix332-doc branch from 7fcbfe7 to 1a9b392 Compare May 13, 2019 02:25

tennix added 2 commits May 14, 2019 10:29

Merge branch 'master' into fix332-doc

e0338f9

Merge branch 'master' into fix332-doc

a69f3aa

tennix approved these changes May 15, 2019

View reviewed changes

weekface approved these changes May 15, 2019

View reviewed changes

weekface merged commit 4f574f8 into pingcap:master May 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design proposal of stable scheduling in TiDB #466

Design proposal of stable scheduling in TiDB #466

cofyc commented May 8, 2019 •

edited by weekface

Loading

aylei May 8, 2019

cofyc May 9, 2019 •

edited

Loading

cofyc May 9, 2019 •

edited

Loading

tennix May 9, 2019

cofyc May 9, 2019

tennix May 9, 2019

aylei May 9, 2019

onlymellb May 8, 2019 •

edited

Loading

tennix May 8, 2019

cofyc May 9, 2019 •

edited

Loading

cofyc May 9, 2019

tennix May 9, 2019

gregwebs commented May 9, 2019

cofyc commented May 10, 2019 •

edited

Loading

tennix commented May 10, 2019

tennix commented May 10, 2019

weekface left a comment

tennix left a comment

gregwebs commented May 10, 2019

tennix commented May 10, 2019

LinuxGit left a comment

LinuxGit May 10, 2019

cofyc May 13, 2019

LinuxGit May 10, 2019

cofyc May 13, 2019

tennix left a comment

cofyc commented May 15, 2019

cofyc commented May 15, 2019

weekface left a comment


		### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements

		If we upgrade TiDB pods to request more resources, it is possible that its node node

Design proposal of stable scheduling in TiDB #466

Design proposal of stable scheduling in TiDB #466

Conversation

cofyc commented May 8, 2019 • edited by weekface Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Does this PR introduce a user-facing change?:

Choose a reason for hiding this comment

cofyc May 9, 2019 • edited Loading

Choose a reason for hiding this comment

cofyc May 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onlymellb May 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cofyc May 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregwebs commented May 9, 2019

cofyc commented May 10, 2019 • edited Loading

tennix commented May 10, 2019

tennix commented May 10, 2019

weekface left a comment

Choose a reason for hiding this comment

tennix left a comment

Choose a reason for hiding this comment

gregwebs commented May 10, 2019

tennix commented May 10, 2019

LinuxGit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tennix left a comment

Choose a reason for hiding this comment

cofyc commented May 15, 2019

cofyc commented May 15, 2019

weekface left a comment

Choose a reason for hiding this comment

cofyc commented May 8, 2019 •

edited by weekface

Loading

cofyc May 9, 2019 •

edited

Loading

cofyc May 9, 2019 •

edited

Loading

onlymellb May 8, 2019 •

edited

Loading

cofyc May 9, 2019 •

edited

Loading

cofyc commented May 10, 2019 •

edited

Loading