Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always deregister from Consul #136

Closed
far-blue opened this issue Jul 26, 2016 · 23 comments
Closed

Always deregister from Consul #136

far-blue opened this issue Jul 26, 2016 · 23 comments
Labels
Milestone

Comments

@far-blue
Copy link

I've hunted through the wiki documentation but can't find anywhere referring to how to shutdown fabio in a graceful manner that deregisters the service and associated health check from Consul. I know with other apps you can send a different sig (e.g. sigint) for a graceful shutdown. Does Fabio support such behaviour and, if not, what's the recommended approach for graceful shutdown that also cleans up Consul records?

@magiconair
Copy link
Contributor

SIGQUIT or SIGTERM should do this. Let me check whether this is a regression. How are you shutting it down?

@far-blue
Copy link
Author

far-blue commented Jul 26, 2016

I'm just running the docker container (in the foreground) via docker-compose and when I want to make changes I'm using ctl-c to shut it down and just running docker-compose up again.

@magiconair
Copy link
Contributor

Looks like a shutdown race. I'll try to make it more robust and provide a patch. Can you work with that and build a fabio binary yourself or do you need a full Docker container to test this?

@far-blue
Copy link
Author

What's the race? Is it that docker-compose isn't giving enough time for the shutdown? If that's the case I can tell docker-compose to wait longer before sending the kill signal.

If it's something internal then yes, I can prob. build my own docker image from the source or, if not, I can wait for a new image release. As long as I know the problem is being sorted I can cope with cleaning out the Consul data manually during testing :)

@magiconair
Copy link
Contributor

Actually, I don't think there is a race. startListeners blocks until it gets a shutdown signal which is triggered by signal.Notify(sigchan, os.Interrupt, os.Kill, syscall.SIGTERM). That closes the quit channel which returns startListeners and then triggers registry.Default.Deregister() as the last call. And that one waits for completion, AFAICT.

You should have the following log output:

2016/07/26 16:50:53 [INFO] Updated config to
2016/07/26 16:50:53 [INFO] consul: Registered fabio with id "fabio-LM-AMS-00963965-9998"
2016/07/26 16:50:53 [INFO] consul: Registered fabio with address "10.249.126.25"
2016/07/26 16:50:53 [INFO] consul: Registered fabio with tags ""
2016/07/26 16:50:53 [INFO] consul: Registered fabio with health check to "http://[10.249.126.25]:9998/health"
2016/07/26 16:50:53 [INFO] consul: Health changed to #3360
2016/07/26 16:50:54 [INFO] consul: Health changed to #3361
2016/07/26 16:51:00 [INFO] Graceful shutdown over 0
2016/07/26 16:51:00 [INFO] Down
2016/07/26 16:51:00 [INFO] consul: Deregistering fabio
2016/07/26 16:51:00 [INFO] consul: Health changed to #3363

@far-blue
Copy link
Author

Ah, right, I think I see the problem. My mistake. It must be that it only leaves the data in Consul if it dies for some reason while processing the config file or it fails to talk to Vault. I think I'd miss-counted the number of left-over healthchecks vs the number of times I'd started and stopped the container.

@far-blue
Copy link
Author

It also doesn't help that the 'Down' line and all after it are not visible via docker-compose during the shutdown - although if you run docker-compose logs after the container has terminated you can then see the log output :)

Thanks for looking into things and sorry it was my error!

@magiconair
Copy link
Contributor

I think by moving the deregistration into a defer block should take care of a stale registration if fabio fatals during startup but after registration. I'll have a look.

@far-blue
Copy link
Author

ah, that would be great :) I guess leaving the health check in place to turn critical when/if Fabio fails is a good thing as long as Fabio has started up correctly and has been running for a few seconds.

As I run consul-alert I find out pretty quickly if health checks are left dangling!

@magiconair
Copy link
Contributor

magiconair commented Jul 26, 2016

Yeah, that's the edge case. If fabio FATALs after initBackend it does not clean up after itself. defer won't help since FATAL calls os.Exit(1) I'll see how to improve that.

@magiconair magiconair changed the title Deregistering from Consul on 'polite' shutdown Always deregister from Consul Jul 26, 2016
@magiconair magiconair added bug and removed looking labels Jul 26, 2016
magiconair added a commit that referenced this issue Jul 26, 2016
fabio was not deregistering from consul if there was a
critical failure (FATAL) after the initial consul registration.
This patch removes the calls to log.Fatal() and replaces them
with exit.Fatal() which waits until all registered exit listeners
have been called.
@magiconair
Copy link
Contributor

I've pushed a patch which seems to address the issue. log.Fatal has been replaced with exit.Fatal which waits until all registered exit handlers have been called and then exits. The line numbers of the last log lines should be off and I need to think about the approach in general a bit more but it would be great if you could test it.

magiconair added a commit that referenced this issue Jul 26, 2016
fabio was not deregistering from consul if there was a
critical failure (FATAL) after the initial consul registration.
This patch removes the calls to log.Fatal() and replaces them
with exit.Fatal() which waits until all registered exit listeners
have been called.
@far-blue
Copy link
Author

far-blue commented Jul 26, 2016

I've tried building but I'm not a go developer and the build is failing with a weird error I don't understand:

src/github.com/eBay/fabio/admin/server.go:17: cannot use cfg (type *"github.com/eBay/fabio/vendor/github.com/eBay/fabio/config".Config) as type *"github.com/eBay/fabio/vendor/github.com/eBay/fabio/vendor/github.com/eBay/fabio/config".Config in assignment

I suspect this is just because something is set up wrong for building but it might be better for someone else to test ;)

For reference, if you want to repeat the exit I was getting that was leaving the service registration in Consul I think you can replicate it by simply pointing a vault cs config at an invalid url.

@far-blue
Copy link
Author

e.g. set an invalid VAULT_ADDR and VAULT_TOKEN and setup a cs in the config

proxy.cs = cs=foo;type=vault;cert=wibble
proxy.addr = :443;cs=foo

@magiconair
Copy link
Contributor

Try this:

mkdir ~/gopath
export GOPATH=~/gopath
go get -u github.com/eBay/fabio
cd ~/gopath/src/github.com/eBay/fabio
git checkout issue-136-always-deregister-from-consul
go build (or make)
./fabio

If you are on OSX and need to build a linux/amd64 binary run make linux

@far-blue
Copy link
Author

ah, great, that did it. Managed to build and package into a docker image on my home machine and push up to the work server for testing and it now appears to correctly deregister after it encounters an error. I tested it with the setup I was using with the invalid vault token:

fabio_1 | 2016/07/26 18:40:21 [INFO] Version v1.2-4-g75c5c70 starting
fabio_1 | 2016/07/26 18:40:21 [INFO] Go runtime is go1.6.3
fabio_1 | 2016/07/26 18:40:21 [INFO] Setting GOGC=800
fabio_1 | 2016/07/26 18:40:21 [INFO] Setting GOMAXPROCS=8
fabio_1 | 2016/07/26 18:40:21 [INFO] Metrics disabled
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Connecting to "bar:8500" in datacenter "foo"
fabio_1 | 2016/07/26 18:40:21 [INFO] Admin server listening on ":9998"
fabio_1 | 2016/07/26 18:40:21 [INFO] Using routing strategy "rnd"
fabio_1 | 2016/07/26 18:40:21 [INFO] Using routing matching "prefix"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Using dynamic routes
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Using tag prefix "urlprefix-"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Watching KV path "/fabio/config"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Manual config changed to #8719119
fabio_1 | 2016/07/26 18:40:21 [INFO] Updated config to
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Health changed to #8719118
fabio_1 | 2016/07/26 18:40:21 [FATAL] vault: client: Error making API request.
fabio_1 |
fabio_1 | URL: POST http://ping:8200/v1/auth/token/create
fabio_1 | Code: 400. Errors:
fabio_1 |
fabio_1 | * root or sudo privileges required to create orphan token
fabio_1 | 2016/07/26 18:40:21 [INFO] Graceful shutdown over 0
fabio_1 | 2016/07/26 18:40:21 [INFO] Down
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Registered fabio with id "fabio-c49206c22615-9998"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Registered fabio with address "192.168.5.3"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Registered fabio with tags ""
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Registered fabio with health check to "http://[192.168.5.3]:9998/health"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Deregistering fabio
fabio_1 | 2016/07/26 18:40:21 [INFO] Updated config to
fabio_1 | route add Test-Nginx wibble/ http://192.168.5.2:80/ tags "urlprefix-wibble/"
fabio_1 | 2016/07/26 18:40:21 [INFO] consul: Health changed to #8719122

@magiconair
Copy link
Contributor

OK, then I'll let that rest for a couple of days. Helps me to reason about the approach a bit longer. If nothing comes up then I'll merge it. Looking at #134 now.

magiconair added a commit that referenced this issue Jul 30, 2016
fabio was not deregistering from consul if there was a critical failure
(FATAL) after the initial consul registration. This patch replaces the
calls to log.Fatal()/log.Fatalf() and with exit.Fatal()/exit.Fatalf()
which wait until all registered exit listeners have been called.
@magiconair
Copy link
Contributor

I've had a look at it again and think the approach is sound. I've added a test for the exit behavior and made sure that exit.FatalF can be called concurrently. Just merged this to master.

@magiconair
Copy link
Contributor

Good way to test this is ./fabio -proxy.addr 1.2.3.4 assuming that you don't have 1.2.3.4 on that machine.

@far-blue
Copy link
Author

One question - does the health check always get removed on any termination or does your patch only mean any problems during startup are caught?

I'm pretty new to health checks so maybe my thinking is a bit wrong but to my mind if a process has been running for more than a 'grace' period of maybe a few mins and then terminates there are situations where leaving the health check in place (so it then fails) is a good thing. If you are not using any form of orchestration system and Fabio fails unexpectedly after it's been running for a while then the health check is a way to be informed of this so shouldn't be cleared up. If you are using an orchestration system then I assume it will notice if the container exits and re-instantiate it so Fabio will just update its details and health check anyhow on restart.

Or, as I said, maybe I'm thinking this wrong :)

@magiconair
Copy link
Contributor

Health check disappears with the service since you can't register a health check without a service.

@far-blue
Copy link
Author

ok, so if Fabio crashes out with an exception then the service and health check will be deregistered in all cases? Maybe it would be good to document what situations will cause the health check to fail and what situations will cause fabio to exit and deregister the health check so the scope of the health check as a warning of incorrect operation is understood :)

@magiconair
Copy link
Contributor

No, in that case fabio will not de-register itself but consul will mark fabio as down as the health check fails. Just restart fabio or clear the registration manually. If you use consul for things other than fabio (e.g. service discovery between your services) then this is not a problem. I also don't recall consul crashing on us either.

However, after having fabio in production for a year on several fairly heavily loaded public websites with tens of fabio instances and hundreds of micro-services we literally never had that happen with any of the releases (knock on wood). This doesn't mean it can't happen but this is not the kind of thing you have to deal with on a regular basis and it is fairly easy to fix if it does happen.

I've come to call fabio the piece of infrastructure we forget it exists.

@far-blue
Copy link
Author

far-blue commented Aug 1, 2016

Great, that's what I hoped would happen! I'd like to be alerted when / if Fabio was ever to die unexpectedly and just wanted to make sure the changes you made to clean up on exit after config error etc. didn't change that behaviour :)

I'm glad to hear Fabio has been so stable and I've had the same experiences as you with Consul - I've never had any unexpected behaviour from it and it's been as stable as a rock.

@magiconair magiconair added this to the 1.2.1 milestone Oct 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants