Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd v3 backend with lock support. #15680

Merged
merged 16 commits into from
Oct 3, 2017
Merged

Conversation

bmcustodio
Copy link
Contributor

@bmcustodio bmcustodio commented Aug 1, 2017

As a newcomer to Terraform, and having a few etcd clusters under my supervision, I thought it would be a nice idea to add an etcdv3 backend with lock support. This is my first attempt at writing one, done mainly as an exercise and by replicating the existing consul backend and making whatever changes are required.

However, I need a little guidance in what concerns locking, because the lock APIs of consul and etcd are somewhat different and give different guarantees. Besides, I'm not really familiar with Consul.

Could anyone please be so kind as to have a look at the attached code and point out what can/must be changed in the Lock / Unlock methods? What is the expected "behaviour" of each method?

@bmcustodio bmcustodio force-pushed the etcdv3-backend branch 3 times, most recently from cb3fbfd to 19b61d3 Compare August 2, 2017 17:25
@jbardin jbardin self-assigned this Aug 2, 2017
@jbardin
Copy link
Member

jbardin commented Aug 2, 2017

Hi @brunomcustodio,

Thanks for putting this together.
I'll take a look as soon as I can, and we can go over the semantics of the Lock methods then.

One question I have to start, since I'm not up to date on the current state of etcd, can we "update" the existing etcd remote state or is there something fundamentally different about etcdv3?

@bmcustodio
Copy link
Contributor Author

bmcustodio commented Aug 2, 2017

@jbardin thank you very much for taking the time :-) I'm looking forward to it.

Yes, there are some fundamental differences between v2 and v3 of etcd. For instance, the API's underlying protocol changed from HTTP to gRPC, and the internal structure of the data changed from a file-system-like structure to a flat binary key space. IMHO the existing etcd can be kept, of course, but only for compatibility with existing code.

@bmcustodio bmcustodio force-pushed the etcdv3-backend branch 5 times, most recently from a3da2b0 to 2a2dc89 Compare August 3, 2017 17:10
Copy link
Member

@jbardin jbardin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start! I made a number of comments inline, some of which may not apply to etcd.

s := &schema.Backend{
Schema: map[string]*schema.Schema{
"endpoints": &schema.Schema{
Type: schema.TypeString,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're now using a schema, it might be nice to make this a TypeList, rather than relying on splitting a string. We would still need to split the string from an env variable, but the config would be cleaner.

Copy link
Contributor Author

@bmcustodio bmcustodio Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, many thanks 👍 70aad79

for _, kv := range res.Kvs {
result = append(result, strings.TrimPrefix(string(kv.Key), prefix))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to sort.String(result[1:]) to make sure we get deterministic results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 👍 b896348

return stateMgr, nil
}

func (b *Backend) determineKey(name string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the changes in etcd means that there isn't going to be a direct upgrade from etcd2, you can probably get rid of the legacy default state path, and assume everything has a "workspace" and get rid of the conditionals around keyEnvPrefixand such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(...) you can probably get rid of the legacy default state path (...) and get rid of the conditionals around keyEnvPrefix and such.

@jbardin I'm not sure I understand what you mean. 😶 Could you please clarify/exemplify?

Copy link
Member

@jbardin jbardin Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the reason the other backends have the "default" state handled separately is solely for backwards compatibility with state files that existed before envs/workspaces.

So rather than dealing with the keyEnvPrefix you've adopted from the others, you could use the same hierarchy in all cases of prefix/name, and just include "default" in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done @ 3c21b9c. Thank you for your guidance. 🙂

const (
lockAcquireTimeout = 2 * time.Second
lockInfoSuffix = ".lockinfo"
lockSuffix = ".lock"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lockSuffix was strictly a consul implementation detail, so that the terraform lock works just like the consul cli locks. This may be different for etcd, or maybe it doesn't matter at all.

Copy link
Contributor Author

@bmcustodio bmcustodio Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🙂 Removed in 038f5eb.

}

func (c *RemoteClient) lock() (string, error) {
session, err := etcdv3sync.NewSession(c.Client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what etcd's behavior would be here, but terraform should never recursively lock, so you can check for an existing session and return an error if there is already a lock outstanding.

Copy link
Contributor Author

@bmcustodio bmcustodio Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍 bb4dec6

}

c.etcdMutex = mutex
c.etcdSession = session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm not sure about etcd, but consul sessions favored liveness, and were easier to lose than desired and susceptible to network timeouts. I'm not sure if we need to do the same here, but the consul backend watches the session status and reconnects when necessary.

Copy link
Contributor Author

@bmcustodio bmcustodio Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's not the case withetcd, judging by a number of factors including this

https://github.com/coreos/etcd/blob/master/clientv3/concurrency/session.go#L26
https://github.com/coreos/etcd/blob/master/clientv3/concurrency/session.go#L64

and some tests I ran locally using etcdctl. For example, try running

$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:12379 lock my-terraform-lock -- watch date

This will sit on a loop printing the date every two seconds. If you start a second command against a different etcd member of the same cluster you will (obviously) be left waiting to acquire the lock:

$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:22379 lock my-terraform-lock -- watch date

Now, if you take down 127.0.0.1:12379 (the first member), the second command will still be left waiting to acquire the lock. If you take 127.0.0.1:12379 up again the first process resumes printing the date. It is only when the first process is stopped that the second one takes over the lock and starts printing the date. Here's the code for the lock CLI command, which of course uses the same API as this implementation:

https://github.com/coreos/etcd/blob/master/etcdctl/ctlv3/command/lock_command.go

In the end I'd say we don't need to keep our own goroutine. WDYT @jbardin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let's favor the simpler solution for now under the assumption that the etc client package takes care of all the details.

if err := c.etcdSession.Close(); err != nil {
errs = multierror.Append(errs, err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to remove the lockInfo too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course 🙂 b8f4f6d

@jbardin
Copy link
Member

jbardin commented Aug 3, 2017

Hi @brunomcustodio,

Thais looks very good after a preliminary pass through the code. Did you have any specific questions about the locking semantics?

@bmcustodio
Copy link
Contributor Author

bmcustodio commented Sep 9, 2017

@jbardin I believe this to be at a good point for another review. I've addressed your suggestions and in the meantime added TLS support (for secure client connections) and documentation.

I've added a comment to .travis.yml about testing, though I don't really know if that's the best place. I think the best way for you to run the etcdv3 tests on your machine is to spin up a local etcd 3.x instance and run the test suite using

TF_ETCDV3_ENDPOINTS="localhost:2379" TF_ETCDV3_TEST=1 make test

I see the consul tests are disabled in Travis. Do we want the etcdv3 ones to run? If so we will need to somehow spin up a local etcd instance or host an external one. PTAL and let me know what your thoughts on all of this are (and, once again, thank you very much for your help). 🙂

@bmcustodio
Copy link
Contributor Author

@jbardin have you had the time to look into this? Do you consider merging it in the short-term?

@pires
Copy link

pires commented Sep 20, 2017

I've ran this manually and it seemed to work smoothly. Great job, people!

@bmcustodio bmcustodio closed this Sep 26, 2017
@bmcustodio bmcustodio deleted the etcdv3-backend branch September 26, 2017 08:36
@bmcustodio bmcustodio restored the etcdv3-backend branch September 26, 2017 08:37
@bmcustodio bmcustodio reopened this Sep 26, 2017
@bmcustodio
Copy link
Contributor Author

bmcustodio commented Sep 26, 2017

( Sorry for the closing/reopening, caused by a slight mistake on my part. )

@jbardin
Copy link
Member

jbardin commented Oct 2, 2017

Hi @brunomcustodio,

Thanks for hanging in there while we've been so busy!

This is looking great! Since backend tests still need to be run manually, we usually paste the test output into the PR comments, to record the acceptance tests as passing.
Since you have etcd already setup and running, could you run the full set of tests for me? There's still some extra log output in these tests, so something like TF_ACC=1 go test -v 2>/dev/null usually gives a concise list of the tests that were run.

@bmcustodio
Copy link
Contributor Author

@jbardin once again thank you very much for taking the time to review this. Here's the concise version:

=== RUN   TestBackend_impl
--- PASS: TestBackend_impl (0.00s)
=== RUN   TestBackend
--- PASS: TestBackend (2.16s)
	backend_test.go:39: Cleaned up 0 keys.
	backend_test.go:69: TestBackend: testing state locking for *etcd.Backend
	backend_test.go:39: Cleaned up 2 keys.
=== RUN   TestBackend_lockDisabled
--- PASS: TestBackend_lockDisabled (0.06s)
	backend_test.go:39: Cleaned up 0 keys.
	backend_test.go:92: TestBackend: testing state locking for *etcd.Backend
	backend_test.go:92: TestBackend: *etcd.Backend: empty string returned for lock, assuming disabled
	backend_test.go:39: Cleaned up 3 keys.
=== RUN   TestRemoteClient_impl
--- PASS: TestRemoteClient_impl (0.00s)
=== RUN   TestRemoteClient
--- PASS: TestRemoteClient (0.04s)
	backend_test.go:39: Cleaned up 0 keys.
	backend_test.go:39: Cleaned up 0 keys.
=== RUN   TestEtcdv3_stateLock
--- PASS: TestEtcdv3_stateLock (2.09s)
	backend_test.go:39: Cleaned up 0 keys.
	backend_test.go:39: Cleaned up 1 keys.
=== RUN   TestEtcdv3_destroyLock
--- PASS: TestEtcdv3_destroyLock (0.05s)
	backend_test.go:39: Cleaned up 0 keys.
	backend_test.go:39: Cleaned up 1 keys.
PASS
ok  	github.com/hashicorp/terraform/backend/remote-state/etcdv3	4.422s

I pasted the full version (a second run) in this Gist in case you want to take a look.

@jbardin
Copy link
Member

jbardin commented Oct 3, 2017

Thanks for all the work @brunomcustodio!

Let's drop this into master so people can start trying it out!

@jbardin jbardin merged commit 91442b7 into hashicorp:master Oct 3, 2017
@jbardin jbardin changed the title [WIP] etcd v3 backend with lock support. etcd v3 backend with lock support. Oct 3, 2017
@bmcustodio
Copy link
Contributor Author

Most welcome @jbardin 🙂 Thanks for taking the time!

@ghost
Copy link

ghost commented Apr 7, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants