Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal of stable scheduling in TiDB #466

Merged
merged 5 commits into from
May 15, 2019

Conversation

cofyc
Copy link
Contributor

@cofyc cofyc commented May 8, 2019

What problem does this PR solve?

This expands the design proposal of #332.

https://github.com/cofyc/tidb-operator/blob/fix332-doc/docs/design-proposals/tidb-stable-scheduling.md

What is changed and how it works?

Check List

Has documents change

Does this PR introduce a user-facing change?:

NONE

@cofyc cofyc force-pushed the fix332-doc branch 4 times, most recently from 67348bc to f218b78 Compare May 8, 2019 10:00
}
```

In new predicate `StableScheduling`, we filter out other nodes for TiDB pod if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I specify scheduling policy (whether enable StableScheduling) for a TidbCluster instance?

Copy link
Contributor Author

@cofyc cofyc May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I'd like to add a global feature switch to control this scheduling behavior for all TiDB cluster instances tidb-operator managed. Because this is a best-effort policy, no harm if it fails.

I'm thinking should we fail the pod (with a switch to control) if it cannot be scheduled to its previous node?

Copy link
Contributor Author

@cofyc cofyc May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback is its previous node may not be the best node for the new node, e.g. there is another node which has more available CPU/Memory resources. However, whether to use this scheduling policy depends on cluster setup (use NodePort with Local policy and need to configure IP addresses of nodes in LB or applications), I think to control this behavior globally is enough. Furthermore, we can bypass this policy if TiDB service does not use local externalTrafficPolicy intelligently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to add a switch for this scheduling policy as this feature is not suitable for all users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to control globally:

A flexible feature gate like this:

tidb-scheduler --features StableScheduling,FeatureA,FeatureB

It works like Kubernetes feature gate, but simpler (all disabled by default).

Or a dedicated flag

tidb-scheduler --enable-stable-scheduling

Another way is to control per TiDBCluster, add a field in TiDBCluster.

What do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a global feature controlled by operator not TidbCluster CRD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose both options are valuable:

  • feature gate of scheduler: controlled by cluster admin, determine whether current cluster is willing to support stable scheduling.
  • field of TidbCluster CRD: controlled by user who make the decision of whether using local mode and stable scheduling

set `externalTrafficPolicy` of service to `Local`. A side-effect is the service of
TiDB will be accessible only on the nodes which a running TiDB pod. To avoid
manual intervention to update IP addresses in load balancer when performing
a rolling update, we prefer to schedule new pod of TiDB member to its previous node.
Copy link
Contributor

@onlymellb onlymellb May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary to make sure that the new tidb pod is deployed to the previous node, because we can add all the nodes of the k8s cluster to the LB's backend, LB can automatically remove these nodes without tidb pods through the LB's health check function. No need to manually update the backend IP of the LB when the tidb pod is deployed to other nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some scenarios, users may need Local mode for NodePort service and want to only bind the exact node other than all of the nodes to the external load balancer. Because when the k8s cluster scales large enough, binding all nodes to the load balancer and relies on LB heath check is too heavy for LB.

Copy link
Contributor Author

@cofyc cofyc May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best solution is to implement an operator for the load balancer, but it's beyond the scope of this document. Adding all nodes into load balancer is a good alternative solution, but it depends on load balancer:

  • maybe too heavy for LB if the Kubernetes cluster is large (described by @tennix)
    • NumberOfTiDBCluster x NumberOfNodes ports must be health checked
  • need to add every new node into the backend of LB
  • hard to monitor LB (not all failed backends must be fixed)

Try to schedule new pod of TiDB member to its previous node is the best we can do in tidb-operator and will work in certain circumstances. No harm if failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think we can improve this by restricting TiDB pods in a fixed set of nodes (by using NodeSelector/PodAffinity/Taints&Tolerations). Then, we only need to add part of the nodes into the load balancer and no need to update them in the future.

In my understanding, the proposed solution in this doc is like this except it does not require the user to pre-select nodes for TiDB pods and the fixed set of nodes are limited (equals to the number of TiDB pod instances) and may change if it's not possible to run the new pod in which scenario we need manual intervention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the stable scheduling policy makes it easier to understand for users not familiar with Kubernetes.

@cofyc cofyc force-pushed the fix332-doc branch 2 times, most recently from c7b4056 to d3995a2 Compare May 9, 2019 06:01
@cofyc cofyc requested review from tennix, aylei, onlymellb and xiaojingchen and removed request for aylei, onlymellb and xiaojingchen May 9, 2019 11:03
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
docs/design-proposals/tidb-stable-scheduling.md Outdated Show resolved Hide resolved
@gregwebs
Copy link
Contributor

gregwebs commented May 9, 2019

There is an issue here of loss of availabiliy during the rolling update. A Deployment RollingUpdate will "surge" and add an new pod, temporarily increasing the replica set replication by 1. Then an old pod will be taken down, and the replication will be at its original setting N. So the replication is always N or N + 1.
For a stateful set, I believe the replication will be N or N -1 during rollout, meaning a loss of availability.

Our operator could increase the statefulset capacity by 1 before rollout to maintain the same level of availability. However, the current proposal's goal is to avoids automatically updating a load balancer, so that would not help. To avoid a loss of availability we would need to convince Kubernetes to schedule the new TiDB pod onto the same node while the previous TiDB is still running and wait to cutover to the new TiDB when the new TiDB is ready. However, it doesn't seem possible to schedule two TiDB to the same node due to capacity limitations, and there could be OOM during the cutover.

I am wondering what the specific issue is with the scalability of the load balancer? Why can't we announce just the nodes running TiDB to the load balancer?

@cofyc
Copy link
Contributor Author

cofyc commented May 10, 2019

Our operator could increase the statefulset capacity by 1 before rollout to maintain the same level of availability. However, the current proposal's goal is to avoids automatically updating a load balancer, so that would not help. To avoid a loss of availability we would need to convince Kubernetes to schedule the new TiDB pod onto the same node while the previous TiDB is still running and wait to cutover to the new TiDB when the new TiDB is ready. However, it doesn't seem possible to schedule two TiDB to the same node due to capacity limitations, and there could be OOM during the cutover.

In my testing, the operator does not increase TiDB statefulset capacity by 1 (cc @tennix to confirm). To avoid a loss of availability is not the purpose of the design. Purpose of this design is to add a functionality in our scheduler extender (tidb-scheduler) to schedule the new pod of TiDB member back to its original node if possible. One of the clients needs to configure TiDB instances in an existing load balancer. They do rolling update manually (lower weight and remove backend in LB before terminating TiDB instance to reduce the impact of connection disruptions, related feature: pause the rolling update) and want to avoid manual intervention as much as possible.

Workflow:

  • demo-tidb-2 is running on the node kube-node-2
  • the user performs an update
  • after the pod of demo-tidb-2 is terminated, the new pod of demo-tidb-2 is created
  • kube-scheduler sends feasible nodes which can run the pod demo-tidb-2 to tidb-scheduler (scheduler extender)
  • tidb-scheduler filters out other nodes if the original node exists in these nodes, kube-scheduler will choose kube-node-2 to run demo-tidb-2
    • note that if kube-node-2 exist in the nodes sent from kube-scheduler, it meets all criteria to demo-tidb-2
  • tidb-scheduler does nothing if the original node does not exist in these nodes (e.g. not enough resources left for demo-tidb-2 if another pod is assigned to kube-node-2 after kube-demo-2 is deleted), kube-scheduler will prioritize all feasible nodes to find the best match

I am wondering what the specific issue is with the scalability of the load balancer? Why can't we announce just the nodes running TiDB to the load balancer?

We have a discussion about this here.

@cofyc cofyc requested a review from weekface May 10, 2019 04:06
@tennix
Copy link
Member

tennix commented May 10, 2019

@gregwebs @cofyc Yes, using StatefulSet means there will be a pod unavailable during the upgrade. We can change this behavior by using Deployment. The rolling update of Deployment ensures no service degradation.

However, Deployment doesn't make sure the upgrade order which is required for tidb-server. Tidb-server internally has the DDL owner, and this requires special upgrading order which cannot be fulfilled by Deployment.

Besides, upgrading tidb-server will definitely close the client connection which results a service degradation. So it's reasonable to use StatefulSet for tidb-server.

@tennix
Copy link
Member

tennix commented May 10, 2019

For bare-metal deployment, there is usually no automatic load balancer to use. These users have to use NodePort type service. And TiDB has source IP based permission system, to use this feature we have to use Local mode of NodePort service. However, this mode has some drawbacks:

  • imbalanced traffic spreading as documented here
  • it's too heavy for the external load balancer to do health check when there are too many k8s nodes and NodePort services.

So in these scenarioes, users would prefer to only announce the nodes tidb pods are running.

weekface
weekface previously approved these changes May 10, 2019
Copy link
Contributor

@weekface weekface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tennix
tennix previously approved these changes May 10, 2019
Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gregwebs
Copy link
Contributor

Besides, upgrading tidb-server will definitely close the client connection which results a service degradation. So it's reasonable to use StatefulSet for tidb-server.
We will eventually need to support connection draining. If someone is running a long analytic query, it will be interrupted, but we can guarantee that transactions will not be.

I still don't understand the health check overhead issue. But it seems the issue comes down to the load balancer not knowing which nodes have TiDB? I would think that it is possible for software to solve this, but maybe users don't want to configure api access to update the load balancer? Maybe it is possible for the load balancer to make an outward request for an updated configuration rather than do its own health checks?

@tennix
Copy link
Member

tennix commented May 10, 2019

Users would think it's unreasonable to attach all the k8s nodes to the load balancer. Suppose there are a thousand nodes (of course there are no such users right now) and one of the tidb cluster only runs one tidb pod, if we announce all nodes, then the load balancer has to do health check on these thousand nodes every now and then. This sounds absurd to them.

Besides, not all users can or want to make kubernetes to automatically configure their external load balancer. For these users, they need to configure the load balancer backends manually according to where we put tidb pods on.

Copy link
Contributor

@LinuxGit LinuxGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

meets all criteria to `demo-tidb-2`
- tidb-scheduler does nothing if the original node does not exist in these
nodes (e.g. not enough resources left for demo-tidb-2 if another pod is
assigned to kube-node-2 after kube-demo-2 is deleted), kube-scheduler will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kube-demo-2/demo-tidb-2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, Thanks!


### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements

If we upgrade TiDB pods to request more resources, it is possible that its node node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/its node node/its node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed too.

@cofyc cofyc dismissed stale reviews from tennix and weekface via 1a9b392 May 13, 2019 02:25
Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cofyc
Copy link
Contributor Author

cofyc commented May 15, 2019

/run-e2e-tests

@cofyc
Copy link
Contributor Author

cofyc commented May 15, 2019

hi, @weekface PR has been rebased on the master, PTAL again.

Copy link
Contributor

@weekface weekface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@weekface weekface merged commit 4f574f8 into pingcap:master May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants