Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kubernetes topo implementation #5703

Merged
merged 2 commits into from
Mar 18, 2020

Conversation

carsonoid
Copy link
Contributor

@carsonoid carsonoid commented Jan 13, 2020

I wanted to get some initial feedback on my implementation of the Kubernetes topology service. This resolves #5701.

The code currently works to bring up a basic cluster on a linux host using:

TOPO=k8s ./101_initial_cluster.sh

Known Issues that warrant discussion:

  • The current deployment uses k3s running locally to provide a minimal Kubernetes API. This means that the topology is not available to anyone but Linux users.
  • The binaries are much larger now thanks to the size of client-go.
  • Scale concerns:
    • All components currently maintain a cache of all the configmaps because they all start the topo server. This may not scale and could slow down vtctl or other operations that are done against the topology with processes that are not long-lived if there is too much topology data.
      • This may actually be fine. I'm not sure how much data is actually in the topology service of a large Vitess installation.
    • Client-go watching does not support objects > 1MB in size. How big can individual documents get in the topology?

@carsonoid
Copy link
Contributor Author

@derekperkins feedback and discussion appreciated.

Copy link
Contributor

@morgo morgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small comments.

bootstrap.sh Outdated Show resolved Hide resolved
examples/local/k3s-up.sh Outdated Show resolved Hide resolved
@derekperkins
Copy link
Member

I think this is awesome and I think all the getting started docs should be updated after this merges to remove the etcd-operator dependency.

I'm very unfamiliar with the topo, so I'm not the best person to comment on the actual implementation, but I've got a couple of questions:

  1. How much larger are the binaries? Is it worth the effort to use a smaller library like https://github.com/ericchiang/k8s?
  2. Should this and the other topo implementations be put behind build tags? We would have to include the existing clients by default I think to not break BC, but it'd be nice to be able to exclude them from a custom build.

Copy link
Contributor Author

@carsonoid carsonoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer @derekperkins question its ~20MB to all binaries that include client-go. Here are the size increases in detail.

COMPONENT BEFORE AFTER INCREASE
bin/vtcombo 42M 63M 67%
bin/vtctl 41M 61M 67%
bin/vtctld 47M 67M 70%
bin/vtgate 30M 50M 60%
bin/vttablet 41M 62M 66%
bin/vtworker 33M 53M 62%

node.value = string(contents)

// Create "file" object
return &corev1.ConfigMap{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using ConfigMaps, I think you should probably define a CustomResourceDefinition for these. Doing it via a custom resource definition means that:

  • You can easily define separate ACLs for ConfigMaps and the Vitess Topology objects
  • You could create a schema to define your data if it is well-defined
  • You could make kubectl display useful default columns for vitess topology objects
  • Cluster administrators can shard each object type into separate etcd clusters behind the scenes (ex: massive kube clusters may actually put events objects in a different etcd cluster due to the update volume)
  • You don't pollute the output of kubectl get configmaps with thousands of topology-related objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, there's a separate Vitess operator that uses CRDs to manage the whole cluster, so not using them is a choice, not a lack of understanding. @carsonoid actually spoke about operators at Kubecon in December.

The purpose of this is to have a dead simple topo available for free, especially for quick onboarding and smaller Vitess installations. If you care to shard etcd or do anything else, I think you would then just use the direct etcd topo implementation and manage it yourself, which is what the current helm implementation uses via etcd-operator.

vtctlclient is the Vitess way to connect to the topo, so managing resources via kubectl seems unnecessary here. Maybe I'm incorrect, but I believe that using CRDs would require us to deploy a separate controller to manage them, which is what this PR is specifically trying to avoid.

All that being said, maybe there is a good enough reason to use CRDs or change how we're using ConfigMaps, I just wanted to step back to make sure we're looking at the overall reasons for adding a new topo implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using ConfigMaps, I think you should probably define a CustomResourceDefinition for these. ...

I 100% agree that there are a lot of benefits that come with using a CRD for the resource instead of a ConfigMap. In fact, I have gone back and forth on using them here. As @derekperkins stated: the goal is to allow for a dead-simple topology server for small Kubernetes-based Vitess clusters. But we do need to balance that with scale and stability concerns.

There is added overhead when using a CRD for the objects:

  • We would have to use https://github.com/kubernetes/code-generator to create the API and client code. Making sure to fit that into the current build system.
  • We would need to figure out how to the CRD definition when deploying Vitess to a new cluster. This may be added work for the end-user or for the code maintainers, depending on who ends up owning this process. I.E. do we have a vitess component make and maintain the CRD def automatically? Or do we require that it be done as part of the first-time work of running Vitess in a cluster?

TLDR: The CM/CRD decision is still up in the air

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, there's a separate Vitess operator that uses CRDs to manage the whole cluster, so not using them is a choice, not a lack of understanding. @carsonoid actually spoke about operators at Kubecon in December.

Yeah, for background on my end, I lead the team that owns the Kubernetes clusters over at HubSpot so I definitely keep up with vitess development a bit.

The purpose of this is to have a dead simple topo available for free, especially for quick onboarding and smaller Vitess installations. If you care to shard etcd or do anything else, I think you would then just use the direct etcd topo implementation and manage it yourself, which is what the current helm implementation uses via etcd-operator.

There's no real reason this implementation cannot scale, however, and even for larger clusters there is a lot of benefit of doing this via the kubernetes API versus etcd. The etcd-operator alone isn't suitable to give larger orgs production-grade etcd clusters with appropriate auditing, backups, and authentication. The SQL team must then either own their own topology server, or depend on an etcd, zookeeper, or consul team. Removing that requirement and using the kubernetes API directly eliminates that burden.

vtctlclient is the Vitess way to connect to the topo, so managing resources via kubectl seems unnecessary here.

It's not totally unprecedented for debugging to take us into the underlying topology store to debug or fix issues by hand, however. But yeah, the fields to display as kubectl columns would be pretty limited (like maybe just the keyspace and tablet id)

Maybe I'm incorrect, but I believe that using CRDs would require us to deploy a separate controller to manage them, which is what this PR is specifically trying to avoid.

Creating a new CRD definition in the cluster is just a kubectl apply away, and then it's just another object you can put data in, so no additional controller code is necessary.

We would need to figure out how to the CRD definition when deploying Vitess to a new cluster. This may be added work for the end-user or for the code maintainers, depending on who ends up owning this process. I.E. do we have a vitess component make and maintain the CRD def automatically? Or do we require that it be done as part of the first-time work of running Vitess in a cluster?

For the case of creating the CRD definitions in the API, it seems reasonable enough to make someone kubectl apply it, but also seems pretty reasonable to automatically create it if it doesn't exist and tell the user to do it manually if you hit a permission error. They already have to apply other things, so it shouldn't be too much of a burden on the user's end (and far far less work than setting etcd or zk) so long as the error messaging is clear if when it hasn't been created yet.

Anyway, I think we probably agree that CRDs are the correct technical solution, but I do understand that ConfigMaps are slightly easier to use as they don't require the code generation bit or applying the definition into the cluster; but if you can nail those two pieces down, they're really the only cons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real reason this implementation cannot scale, however, and even for larger clusters there is a lot of benefit of doing this via the kubernetes API versus etcd. The etcd-operator alone isn't suitable to give larger orgs production-grade etcd clusters with appropriate auditing, backups, and authentication. The SQL team must then either own their own topology server, or depend on an etcd, zookeeper, or consul team. Removing that requirement and using the kubernetes API directly eliminates that burden.

That's great feedback, I hadn't expected someone at HubSpot scale to be interested in putting the topo into k8s, we can definitely revisit the expectations for this topo implementation. Questions I have around that:

  • Curious how that scales when using GKE or another hosted system that controls the masters
  • What about global vs separate zone topos? Unless you did something to shard those under the hood, you'd have a SPOF. (to be fair, that's how helm does it with everything in a single etcd-operator cluster)
  • Where do we draw the line between this topo implementation and the actual operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important question - what happens when you're upgrading a cluster? How does that impact topo availability?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how that scales when using GKE or another hosted system that controls the masters

What is your update frequency on these objects? Our kubernetes clusters push infinitely more QPS through etcd than vitess does to its topology backend. I doubt we'd even perceive a load increase. Consider that every pod or node status change or event incurs a write to the kubernetes API. We don't even run particularly large etcd servers and often push 100-400MiB/sec through them from kubernetes without breaking a sweat. A given pod is likely more load on the kubernetes API than your topology data for that pod. Best of all, in this case someone else owns your production grade etcd cluster and the ops aren't your problem.

What about global vs separate zone topos? Unless you did something to shard those under the hood, you'd have a SPOF. (to be fair, that's how helm does it with everything in a single etcd-operator cluster)

As it currently stands, we do not make use of global vs local as our current intention for multi-region is that every region is a fully independent shard of our entire stack. However, we actually fo operate a "global" kubernetes API which is distributed across regions - it's etcd + a kubernetes apiserver with no kubernetes nodes that we use to stash custom resources in that we then replicate into region-local kubernetes clusters. So in our case, we actually would be capable of using a global kubernetes api for topology if we wanted to, but as long as the global topology can use a different backend configuration than local topology, I think you'd cover all use cases.

Where do we draw the line between this topo implementation and the actual operator?

The topology backend should definitely be independent from the operator. You wouldn't want to force people onto the operator in order to use it. In general, I perceive the kubernetes API as a CRUD service on typed objects that all other components then use to stash their states in, so it's as good of a topology store as any other. When you create a pod (or anything) against the kubernetes API, nothing occurs synchronously - the scheduler, controller, and nodes all just watch the API to see if there are any declarative objects they are supposed to take action on.

Another important question - what happens when you're upgrading a cluster? How does that impact topo availability?

Kubernetes cluster upgrades are zero API downtime. The controller-manager and scheduler may have a blip in activity while all masters aren't on the same version, but the API is up the whole time.

Copy link
Member

@derekperkins derekperkins Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as the global topology can use a different backend configuration than local topology, I think you'd cover all use cases.

If someone wanted to use different topology types, I'm pretty sure that would require an upstream change since there's only the one topo flag I'm aware of that controls the global and local plugin: -topo_implementation. In your case where you'd still be using the k8s topo for both global and local, that shouldn't require any extra work.

This all sounds great. I think I'm onto team CRD. Whether someone is getting started with helm or the operator when it's more available, we should be able to register the CRDs silently so that the user doesn't have to know about the topo configuration. For more advanced users, they can just apply them directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enisoc any feedback on this discussion? Anything that we're not considering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm also convinced about the CRD change (I was already leaning there anyway). I will work on moving the current implementation over. Using ConfigMaps worked as a quick test of the concept. But the CRD path will be much better now that this is getting more traction.

@carsonoid carsonoid force-pushed the kubernetestopo branch 6 times, most recently from b6dcc42 to c238e28 Compare February 5, 2020 00:15
@carsonoid
Copy link
Contributor Author

I've got the ConfigMap based implementation working. I'm going to start work on converting it to a CustomResource

@carsonoid
Copy link
Contributor Author

I've got it completely moved over to CRDs now. I just want to run through the local example and make sure everything works and then we can talk about next steps.

@carsonoid carsonoid changed the title [WIP] Add Kubernetes topo implementation Add Kubernetes topo implementation Feb 29, 2020
@carsonoid
Copy link
Contributor Author

The full test is done! I've been able to run through the entire local example using the k8s api as a topology store and all of the examples worked!

@deepthi
Copy link
Member

deepthi commented Mar 11, 2020

@carsonoid can you update the copyright on all the files to 2020, and also fix the DCO (--signoff)? That will kick the build so that we can get test status (it doesn't seem to have run properly the last time).

@carsonoid carsonoid force-pushed the kubernetestopo branch 2 times, most recently from 86c592e to 7b272ea Compare March 11, 2020 15:54
@carsonoid
Copy link
Contributor Author

@deepthi updated as requested. It looks like some of the tests are currently failing due to an unrelated maven repo timeout though. I'll retry them in a bit.

@carsonoid carsonoid force-pushed the kubernetestopo branch 6 times, most recently from 3385ff2 to 29a0fe1 Compare March 11, 2020 22:36
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work.
I understand that we need to run some kind of k8s server to act as the topo, and k3s is the most lightweight choice, though it is limited to Linux.
It's great that you have included all the topo tests that are available for the other flavors.
My comments are mostly nits and/or naming except for the incorrect copyright notices.

go/cmd/topo2topo/plugin_kubernetestopo.go Outdated Show resolved Hide resolved
go/vt/topo/kubernetestopo/VitessTopoNodes-crd.yaml Outdated Show resolved Hide resolved
go/vt/topo/kubernetestopo/VitessTopoNodes-crd.yaml Outdated Show resolved Hide resolved
go/cmd/topo2topo/plugin_kubernetestopo.go Outdated Show resolved Hide resolved
go/vt/topo/kubernetestopo/config.go Outdated Show resolved Hide resolved
go/vt/topo/kubernetestopo/directory.go Outdated Show resolved Hide resolved
vendor/vendor.json Outdated Show resolved Hide resolved
@carsonoid carsonoid force-pushed the kubernetestopo branch 3 times, most recently from 4cb91aa to eec1c82 Compare March 13, 2020 22:18
@carsonoid
Copy link
Contributor Author

@deepthi thanks for all the nits :) I have updated nearly all of them as requested. I have also explicitly set up the tests to report as skipped when running the tests on a non-linux machine.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with the changes (except for one copyright notice).
I can approve once tests are all passing.

go/vt/vtctl/plugin_kubernetestopo.go Outdated Show resolved Hide resolved
Signed-off-by: Carson Anderson <ca@carsonoid.net>
Signed-off-by: Carson Anderson <ca@carsonoid.net>
@carsonoid
Copy link
Contributor Author

@deepthi updated as requested and the check page shows all green.

@morgo
Copy link
Contributor

morgo commented Mar 16, 2020

Note to all: There was one check that was showing as yellow, which is local example on Ubuntu. This is because this test is required (and renamed in this PR to local example using etcd on ubuntu).

I have temporarily disabled the required test. We can re-enable it a couple of days after this PR merges so that it gives in-flight PRs a chance to merge without having to rebase on master (and successfully pick up this rename).

Copy link
Member

@derekperkins derekperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm excited to have this merged!

@derekperkins
Copy link
Member

@PaulFurtado Any last feedback before we merge this?

@derekperkins derekperkins merged commit f97eef3 into vitessio:master Mar 18, 2020
ajm188 added a commit to tinyspeck/vitess that referenced this pull request Jun 22, 2020
This reverts commit f97eef3, reversing
changes made to 9e74b07.

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support the Kubernetes API as a topology service
5 participants