-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend/kubernetes: Add Kubernetes as a backend #19525
Conversation
c7b959b
to
f6b4be6
Compare
@apparentlymart Can you (or someone) take a look at the build failure? I can't repo the failure locally and I am unsure how it relates to my changes. Thanks. |
You don't need lots of infra for this kind of test. Simply running |
Any chance of getting this in 0.12? |
The v0.12.0 release cycle is already active and so there will only be bug fixes merged until it is released. We will review other proposals once v0.12 is released and stable. Sorry for the delay! |
First of all I love it and it's exactly what I need now that the kubernetes-provider has proper support for ingress and affinity.
Nonetheless I have to say these changes make terraform finally a viable option for bare metal clusters (especially now that version 1.7 of the kubernetes provider has been released as well), and I hope that a release with this feature included will happen in the near future now that 0.12.0 has been released. Also as it makes testing for others easier here's a simple
here's the secret
and just to be complete here's the
|
a0e49aa
to
5f1e983
Compare
@Bobonium Thanks for the feedback. To address your points:
|
@dramich thanks for considering my two cents on the matter, it's absolutely awesome. @apparentlymart now that terraform 0.12.X has been released, can we expect this to be reviewed/merged in the near future or is there any to help to get this merged soon? I'd like to go back to the official release but I (and probably everyone else who's using the kubernetes provider and found this PR) just can't go back without this feature |
Hi @dramich, Thanks for the PR! This looks good on a first pass. For backends we are still manually running the tests which require external resources. You can add the verbose test output for this package into the initial comment of the PR (you can strip the logs and just have the test output itself if you want). |
Hey @jbardin, I have appended the initial comment with the test output, let me know if you need anything else. |
39efc27
to
22012ae
Compare
@jbardin Hello, any updates on this? |
Hi @dramich, We're currently evaluating how we want to handle backend contributions, and the ensuing support of the backend code. While we would like to be able to separate them from the core code, there is a lot of work to do before that can happen. I think the biggest concern with this particular backend is the >300k lines of dependency code it brings along; which is a significant surface area, and also impacts all terraform releases with a 30% increase in binary size. Do you know if this could be done with a simpler client implementation of any sort, which doesn't rely on the bulk of the k8s.io packages? |
Hey @jbardin, I understand your concerns. I have updated the PR to use a different client which has significantly cut down the PR size while still keeping some of the niceties of dealing with the kubeconfig and some help from clients. The tests are still passing as well. Let me know what you think. (I will squash commits after you take a look) |
Thanks for the extra effort here @dramich! @alexsomesan, could I get another 👀 from someone more familiar with k8s? Thanks! |
Sure, I'm having a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the K8s client code looks good and clean.
I've dropped a couple of comments, but most importantly I'd like to understand the handling of the result of Create
calls (which aren't atomic).
if !k8serrors.IsAlreadyExists(err) { | ||
return "", err | ||
} | ||
// The secret was created between the get and create, grab it and keep going |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this.
It's a known fact that the Create
operation in K8s' client-go is not atomic. And so it makes sense to check whether "our" Create call actually produced the object we're later manipulating.
So what confuses me is why is the next line just reading the already existing object (supposedly produced outside of this client) and moving on using that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attempt to create the secret and it already exists, it just needs to be fetched and then fall back to the same flow as if we found the secret the first time we tried to get it.
} | ||
|
||
// Overriding with static configuration | ||
cfg.UserAgent = fmt.Sprintf("HashiCorp/1.0 Terraform/%s", terraform.VersionString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a "TODO" comment around terraform.VersionString()
calling for future use-cases to just directly use the Version
string in the github.com/hashicorp/terraform/version
package (unless I'm misunderstanding it).
Maybe it's worth doing that. @jbardin might know more about the lifespans of these APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex. That's probably copied from other backends, which haven't been updated. Those strings have been moved around as we've refactored, but we haven't gotten to the point of actually removing the old bits. It can simply be switched out for version.String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use version.String
@jrhouston We've recently updated the Contributing guide to clarify that coverage won't necessarily block a PR: https://github.com/hashicorp/terraform/blob/master/.github/CONTRIBUTING.md#pr-checks So no, that is not blocking here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think we should test the locking process a bit closer.
I have an uneasy feeling it might not behave quite atomically.
DefaultFunc: schema.EnvDefaultFunc("KUBE_NAMESPACE", "default"), | ||
Description: "Namespace to store the secret in.", | ||
}, | ||
"in_cluster_config": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we should align this change in the providers too, but being a breaking change we have to plan it for the next major release.
return "", err | ||
} | ||
|
||
if li != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unfortunately does not provide a safe lock. The big clue is that the lock is handled in multiple operations, but "check then create" (or the reverse) cannot work with multiple concurrent clients. Some form of a single test-and-set operation is needed to create a lock (we don't require blocking locks for mutual exclusion, so more advanced primitives are not required)
Multiple clients can reach this point with the expectation that their lock info is going to be saved. The following Update
call will then update the secret for both with the last client's info being saved, but both clients assuming they have a lock.
@alexsomesan, the comments above lead me to believe that kubernetes secrets may not have the ability to use this storage as a lock, but I'm not familiar with the data model of the secrets storage to know for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbardin That is not how the k8s API works. If you hold an object, in this case a secret, which has a version ID, and then attempt to make an update to that secret, the k8s API will validate you have the most recent version of the secret and if you do not, reject your update request. The lock is being enforced by the API by assuring the client is only able to make updates to the most recent object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dramich, I'll look into that. Testing this has consistently resulted in multiple clients obtaining locks concurrently, so I'll have to investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to briefly update on this; I realized the client-go exposes the same locking mechanism used by kubernetes' controller-manager to do leader election, so I'm reworking this PR to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but this cannot be merged in the current state, because the lock implementation is not safe. We will have to review if this is possible to implement with kubernetes secrets.
Co-authored-by: Dan Ramich <danold215@gmail.com>
OK - so after doing a bunch of sleuthing in the k8s code I have reworked the backend to use the Lease resource from the Initially I looked to see if it was feasible to take the leaderelection or resourcelock packages (this is what kube-controller-manager uses internally) from client-go and just use those, but they seem to assume that you are going to use them in a long running service, so I am using the coordination API directly here. What this now does is creates a Lease resource and uses the For my own piece of mind, I created a soak test function alongside the usual locking test. This creates 1000 concurrent backend clients and fails if more than one succeeds in getting a lock. It is worth noting that if the API server of the cluster is not using a strongly consistent store like etcd, i.e k3s which allows you to use sqlite and postgres IIRC, then this locking mechanism may not be reliable. running Referenceshttps://kubernetes.io/blog/2016/01/simple-leader-election-with-kubernetes/ |
I have updated the Unlock function to catch those error cases, and changed the soak test to create 100 clients that each make 100 attempts on the lock and error if Unlock() produces a failure. |
would you be interested / willing in writing a blog article for https://k8s.io/blog/abut that @jrhouston ? Or if not writing it yourself, you could sketch out how you'd like the article to look and I could help author something (I'm part of Kubernetes SIG Docs). What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrhouston!
@sftim Sure! I'll drop you a message on the kubernetes slack and we can chat. |
Thanks @dramich and all, sorry this took so long to include. This is now merged, and we will be shipping this with the next 0.13 beta release 🚀 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
This pr will add Kubernetes as a backend
Details:
The state will be stored gzipped in a kubernetes secret. The secret name is
terrastate-<workspace>-<key>
. If issues arise with locking or state the secret name is in the errors.Locking is handled in the same secret used to hold this state. This is done so we can use references to ensure secrets aren't getting changed as we do state changes.
This pulls in the k8s.io/client-go along with all its friends.
Added docs for the website
Example of the secret
Open Questions:
How to run the tests? Is there a k8s infra already available in the builds?
Issue: #19371
Test output: