nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

mreiferson · 2015-07-14T00:46:18Z

I'd love it if NSQ could support DNS-based discovery for nsqlookupd instances. This would be a light layer on top of the already-proven architecture that would allow ops people to abstract away some of the configuration required to get NSQ working.

Basically, rather than hardcoding nsqlookupd addresses, I'd much rather just say "use all the A records returned by this DNS name". There's also an RFC for using DNS for service discovery. This supports things like Prometheus and makes deployment a snap on Kubernetes. It basically makes it dead simple to use NSQ in a cluster environment, where you don't know what IP address or machine any of your services will be available at, but know you can reach them through a DNS lookup.

More docs on the relevant portion of Kubernetes.

Here's the desired user experience:

I set up a service (logical grouping) of pods (group of containers, think "application cookie cutter") and say I want 4 of them. Kubernetes automatically does the right thing for me, setting them up, tearing them down, restarting them, moving them, whatever.
I configure nsqd to say "do a DNS lookup on nsqlookupd.cluster, and connect to every IP address returned as an A record. Also, every minute/five minutes/ten minutes, check again, and update your connections"
I configure my consumers to say "do a DNS lookup on nsqlookupd.cluster, and connect to every IP address returned as an A record. Also, every minute/five minutes/ten minutes, check again and update your connections"
Everything just works. Containers get restarted, machines get removed, machines get added, containers get moved, and everything just points to the right thing.

Most of this (the DNS querying part) can be achieved simply with wrappers/helpers. The big missing piece is some sort of runtime live-reloading/editing of configuration. Any of the following would work beautifully:

A way to easily wrap nsqd inside another Go binary, and use a channel/function/whatever to say "reload configuration/here are new configuration values"
A way to say "Here's a configuration file, read the nsqlookupd addresses out of it, and re-check whenever I HUP you" (I'm unsure whether one container can HUP a process in another. If not, this is less ideal, but still probably workable. It'll just take some noodling on how to make it happen. The unix socket approach would be more desirable, honestly)
A TCP/unix socket/http/whatever interface that we can use to pass new configuration values into.

Or I'm sure there are myriad better ways to accomplish this. Basically, the important bit is "here are the new nsqlookupd addresses, use them instead, but don't stop running/serving requests, and please don't catch on fire".

I'm making the following assumptions when imagining how this would be run:

A fresh nsqlookupd will be able to gather shared state from the registering consumers/nsqd instances quickly-ish (maybe a few seconds, at most?)
Assuming each nsqd/consumer connects to at least two nsqlookupd instances, one going away doesn't actually affect anything except your redundancy level.

paddycarver · 2015-07-11T02:27:02Z

Also, this comes after a lengthy Twitter conversation with @mreiferson: https://twitter.com/paddyforan/status/619679434992447488

My initial investigations made this look like a lot of work and an involved feature. Matt says it's easier than I believed, which I will take his word for, because he probably has a better grasp on the codebase than I do.

mreiferson · 2015-07-11T20:44:39Z

A way to easily wrap nsqd inside another Go binary, and use a channel/function/whatever to say "reload configuration/here are new configuration values"

nsqd is already importable but it lacks a public API to modify nsqlookupd configuration. I don't really like the idea of having to support that kind of API compatibility, so while it's a good suggestion and we should consider it, I'm 👎 on this.

A way to say "Here's a configuration file, read the nsqlookupd addresses out of it, and re-check whenever I HUP you" (I'm unsure whether one container can HUP a process in another. If not, this is less ideal, but still probably workable. It'll just take some noodling on how to make it happen. The unix socket approach would be more desirable, honestly)

nsqd also already has config file support but lacks any means of reloading it at runtime. HUP signal support feels like something it should have anyway, we would just need to be very clear on what values are sane to change while it's already running. This could be pretty confusing and complicated to implement. A first step could be restrict this to just the nsqlookupd addresses. It doesn't solve the usability issue for you, though. I'm on the fence.

A TCP/unix socket/http/whatever interface that we can use to pass new configuration values into.

This would address the same use case as the suggestion above (ability to dynamically reconfigure a running nsqd) but would have the benefit of being conceptually simpler in terms of implementation. If we took the HTTP endpoint route, it would be a much clearer contract what values you could or could not modify. I'm curious how this would fit into k8s though, e.g. what process would actually make the request to this endpoint? It would also need to do this for all nsqd? Promising, but there are questions.

A fresh nsqlookupd will be able to gather shared state from the registering consumers/nsqd instances quickly-ish (maybe a few seconds, at most?)

This can be tuned by changing the interval between nsqd -> nsqlookupd pings/reconnections.

Assuming each nsqd/consumer connects to at least two nsqlookupd instances, one going away doesn't actually affect anything except your redundancy level.

Yep.

Lastly, none of the above actually talk about your DNS-based discovery suggestion, which is intriguing. It seems like a reasonably encapsulated change for nsqd to adhere to the DNS-based service discovery RFC for nsqlookupd interaction when endpoints are not IP addresses. This seems like the most promising path forward from a usability standpoint and would seem to avoid most (all?) of the problems outlined above.

paddycarver · 2015-07-11T20:59:44Z

This would address the same use case as the suggestion above (ability to dynamically reconfigure a running nsqd) but would have the benefit of being conceptually simpler in terms of implementation. If we took the HTTP endpoint route, it would be a much clearer contract what values you could or could not modify. I'm curious how this would fit into k8s though, e.g. what process would actually make the request to this endpoint? It would also need to do this for all nsqd? Promising, but there are questions.

This is, actually, my preferred implementation, as it is the most versatile when it comes to containers, and is the least-likely to require weird hacks to make it work. I also thought it'd be the one that would be hardest to implement, so I'm glad I'm super wrong.

K8s has the concept of "sidecar containers". Basically, a k8s thinks in "pods", and a "pod" is a group of containers that all get deployed together, and k8s makes sure run together. I'd basically define an nsqd pod that consists of an nsqd container, and a "nsqlookupd-updater" container that's just a tiny little Go program that polls the k8s DNS, gets the lookupd addresses, and turns that into an nsqd update through the API we define. So one container just runs nsqd, without knowing anything about kubernetes, and the other container takes care of turning the kubernetes-specific information into updates to the nsqd config. They get deployed together, and kubernetes makes sure they're both running, and we neatly divide the responsibilities.

Lastly, none of the above actually talk about your DNS-based discovery suggestion, which is intriguing. It seems like a reasonably encapsulated change for nsqd to adhere to the DNS-based service discovery RFC for nsqlookupd interaction when endpoints are not IP addresses. This seems like the most promising path forward from a usability standpoint and would seem to avoid most (all?) of the problems outlined above.

My concern is that this is semi-new and there seems to be some missing consensus around how to do this. Also, there are some minor, annoying variations between implementations. E.g., right now kubernetes uses A records while other things use SRV records, though they plan on updating later. By just providing the tool that says "update nsqlookupd addresses", we don't end up in a position where people want nsqd to know how to update based on:

A records
SRV records
Consul
Etcd
Changing config files
The phase of the moon
Etc.

I feel like providing the tools is a better approach than providing the solution in this case, just because there are a lot of possible configurations for the solution, and the amount of software it takes to wrap the tools to work with these configurations is pretty trivial. At the very least, I'd say provide the tools in nsqd itself, and then provide a few implementations as separate apps, much like nsq_to_file and nsq_to_http are implemented.

mreiferson · 2015-07-11T21:25:35Z

I feel like providing the tools is a better approach than providing the solution in this case, just because there are a lot of possible configurations for the solution, and the amount of software it takes to wrap the tools to work with these configurations is pretty trivial. At the very least, I'd say provide the tools in nsqd itself, and then provide a few implementations as separate apps, much like nsq_to_file and nsq_to_http are implemented.

Yes, I agree, and would align best with how everything has been designed and built so far. To play devil's advocate, this isn't necessarily the simplest approach from the user's perspective, though. We're talking about needing to bundle another app to sit between nsqd and mediate. This feels a little bit like way back in the early days when we decided to directly support statsd format metrics rather than some other intermediate machinery (and that seemed to have turned out alright).

Curious if any other lurkers have any thoughts (cc @jehiah).

Thanks for taking the time to chat through this @paddyforan!

paddycarver · 2015-07-11T21:40:39Z

Two things about the devil's advocate position:

This kind of thing will be most useful, I believe, to people in cluster environments, where "deploy another little thing" is not as big a deal as it is in setups where each server is a pet. I know I'd certainly rather ship another container than find out my discovery method isn't supported and either need to use a hacky solution, or not be able to take advantage of it. And pretty much every conversation I've seen about integrating an application with fleet or kubernetes has always included "...and use another container to bridge..." at one point or another. This is totally unscientific and based on impressions, though.
statsd has enough marketshare in the monitoring space that new entrants feel the need to provide a solution to make running both or migrating from statsd feasible. There's no such dominant player in service discovery yet, so there's no obvious choice for which to support. So that's why this feels different from the statsd support.

But I dunno. Even if we get to a point where one is supported, I can always just use a convoluted wrapper or a light fork to make it fit my use case (which is what I was investigating before I opened this issue).

mreiferson · 2015-07-12T17:44:03Z

Cool, let's prototype what an HTTP API might look like for this. My initial thought is that it should be more general than just nsqlookupd_tcp_addresses as there are a few other config variables that would be nice to be able to change at runtime, perhaps something like:

GET /config
    returns JSON body of current config

POST /config
    body is JSON dictionary where keys are the config fields you want to update

jehiah · 2015-07-12T18:52:19Z

I dig a HTTP config query/update API endpoint (as mentioned before it clarifies what values are dynamic vs a HUP reloading of the config file). That seems to be the base needed to make it possible to tackle any sort of new discovery mechanism.

I like the general idea/usability of discovery via DNS. Handling multiple DNS A/AAAA records seems like a straightforward general way of providing some externally managed cluster state (ie: it feels like something i would begin to use); I feel i need to read up on rfc6763 and other common implementations to have a better opinion there.

paddycarver · 2015-07-14T03:58:34Z

    GET /config
         returns JSON body of current config
    POST /config
        body is JSON dictionary where keys are the config fields you want to update

My only complaint is the POST. I'd, personally, prefer PATCH. But I'm nit-picking here. I'd be ecstatic to get this, even with POST.

paddycarver · 2015-07-14T04:03:37Z

nsqd/http.go

@@ -596,3 +600,41 @@ func (s *httpServer) printStats(stats []TopicStats, health string, startTime tim
 	}
 	return buf.Bytes()
 }
+
+type allowedOpts struct {
+	NSQLookupdTCPAddresses []string `json:"nsqlookupd_tcp_addresses"`


It may be nice to support these as atomic add/remove or overwrite. The way I've done this in the past, which isn't too onerous, is

type allowedOpts struct { NSQLookupdTCPAddresses []string `json:"nsqlookupd_tcp_addresses"` NSQLookupdTCPAddressesAdd []string `json:"nsqlookupd_tcp_addresses_add"` NSQLookupdTCPAddressesRemove []string `json:"nsqlookupd_tcp_addresses_remove"` }

That way I have the option of overwriting the values completely, or I can just add a new address and remove a different one.

I understand why that would be valuable but the API feels wrong. The requirements really argue for specific nsqlookupd config endpoints, which I was hoping to avoid.

I mean, where I'm using this elsewhere, I just continue adding pointers to allowedOpts. The only difference is nsqlookupdtcpaddresses gets 3 properties, instead of 1, and a use specifies at most 2 in any request.

However, I'm nitpicking. This is a Useful Thing because it avoids weird race conditions and provides atomicity, which is Always Useful once you start distributing things. (Look at me, explaining distributing things to you. ) In practice, I feel like it would be an edge case where this would be an issue. If you feel it's not worth the API bloat, that's totally cool. I thought I'd bring it up while I had the chance :P

Yes, I'm happy to discuss alternative APIs because now is the time to do it. It's not that I disagree with the benefits, it's that I think we should try to come up with an even better API if we want those semantics.

What about HTTP verbs? What about individual endpoints? If being more specific helps improve the API then I'm all for it...

Implementation wise, the HTTP side of this PR is the trivial stuff, most of the noise here is cleanup and proper handling of (now dynamic) state.

Verbs:

PATCH uses the add/remove properties

PUT uses the replace properties

GET returns, as it does now

DELETE is unused, because that would be silly

I'm :-/ about the semantics because it's not really REST and is more RPC in spirit. However, the REST way to do it would be to do

POST /config/nsqlookupdtcpaddresses/ to add an address

PUT /config/nsqlookupdtcpaddresses/ to replace the entire list

DELETE /config/nsqlookupdtcpaddresses/{address} to remove an address

But that seems excessive, and as I'm thinking about it, I'm probably overcomplicating matters. Let's step back for a second:

Can we think of a single service discovery mechanism that would give us an atomic view of the world? If we know about a change, we're almost certainly going to want to update the state of the world (in DNS and etcd, at the very least).

As long as nsqd is intelligent about diffing updates (e.g., if it's connected to A & B, and you update the addresses to be A, B, and C, it just connects to C. it doesn't disconnect from A & B) then I'm having trouble thinking of a scenario where I'd actually use this.

Sorry for the knee jerk here, I may be off base about whether this is actually useful.

TL;DR: If we can be declarative instead of imperative, that's probably better.

After a discussion offline, what if we just implemented PUT /config/<option> for now (the body is the JSON encoded value)? This leaves room for some of these future decisions.

The <option> would match the equivalent key in the config file.

That would fit my use case. 👍

mreiferson · 2015-07-17T16:43:33Z

RFR @jehiah @paddyforan

I did a ton of cleanup and the overall diff is noisy. The individual commits are useful...

jehiah · 2015-07-18T01:46:45Z

nsqd/http.go

+		opts := *s.ctx.nsqd.getOpts()
+		switch opt {
+		case "nsqlookupd_tcp_addresses":
+			var addrs []string


I feel like this would be cleaner to Unmarshal directly to the expected type inside this switch (ie: all the type checks go away)

var addrs []string if err := json.Unmarshal(body, &addrs); err != nil { return nil, http_api.Err{400, "INVALID_VALUE"} } opts.NSQLookupdTCPAddresses = addrs

any reason not to?

actually i think we can Unmarshal directly to json.Unmarshal(body, &opts.NSQLookupdTCPAddresses) ??

hah, that is much simpler 👍

jehiah · 2015-07-18T01:50:36Z

LGTM aside from comments.

paddycarver · 2015-07-18T05:12:05Z

Looks super great to me.

paddycarver · 2015-07-18T05:22:51Z

nsqd/lookup.go

@@ -129,6 +140,13 @@ func (n *NSQD) lookupLoop() {
 					break
 				}
 			}
+		case <-n.optsNotificationChan:


I may be reading this wrong, but it looks like on every config change, the lookupds will be disconnected. So if I want to change the verbosity, that triggers a lookupd reconnection?

Would it be possible to diff the current lookupPeers and the list in opts to only add/remove the peers necessary? That would at least keep us from triggering a total reconnect each and every time any config value changes, which seems like (if this use case gets expanded upon) something that may happen more and more frequently over time.

you're right, good catch, I'll fix

mreiferson · 2015-07-18T06:04:14Z

PTAL @jehiah @paddyforan

paddycarver · 2015-07-18T06:25:29Z

My only concern was addressed 👍

jehiah · 2015-07-18T11:21:31Z

🔨 ?

mreiferson · 2015-07-18T16:01:06Z

ready

nsqd: ability to dynamically reconfigure nsqlookupd addresses

ekristen · 2015-09-10T20:27:06Z

The only problem I see with having an API only update approach and not supporting DNS is that you still have to have an outside force reconfigure each nsqd instance vs nsqd being able to reconfigure itself for multiple lookupd instances based on DNS.

Just my two cents.

mreiferson added feature question request labels Jul 11, 2015

mreiferson changed the title ~~Support DNS-based discovery~~ nsqd: support DNS-based discovery for nsqlookupd addresses Jul 11, 2015

mreiferson changed the title ~~nsqd: support DNS-based discovery for nsqlookupd addresses~~ nsqd: ability to dynamically reconfigure nsqlookupd addresses Jul 11, 2015

mreiferson added 2 commits July 13, 2015 17:26

nsqd: atomic get/set for opts

b15d0ad

nsqd: only close lookupPeer connection when non-nil

36819b6

paddycarver reviewed Jul 14, 2015
View reviewed changes

mreiferson force-pushed the nsqd_config_endpoint_601 branch 9 times, most recently from 3775cde to 8d804c0 Compare July 17, 2015 16:41

jehiah reviewed Jul 18, 2015
View reviewed changes

paddycarver reviewed Jul 18, 2015
View reviewed changes

mreiferson added 10 commits July 18, 2015 08:59

nsqd: ability to reconfigure lookup peers

14af528

nsqd/nsqlookupd: export Options

63a94c0

nsqd: add test for reconfiguring nsqlookupd addresses

39e4e20

nsqd: pre-create empty slices

b3d404a

nsqd: add /config endpoint

a73ad9b

nsqd: add test for /config

06e054c

nsqd: move cluster test/don't rely on external scaffold

5adf12b

test: bump travis to Go 1.4.2 and use container based infra

16d8ca7

nsq*: cleanup HTTP routing (+httprouter)

7e1ff48

nsqd: remove leftover go1.0 compatible files

f387012

mreiferson force-pushed the nsqd_config_endpoint_601 branch from 795dafe to f387012 Compare July 18, 2015 15:59

jehiah added a commit that referenced this pull request Jul 18, 2015

Merge pull request #601 from mreiferson/nsqd_config_endpoint_601

9376ebc

nsqd: ability to dynamically reconfigure nsqlookupd addresses

jehiah merged commit 9376ebc into nsqio:master Jul 18, 2015

mreiferson deleted the nsqd_config_endpoint_601 branch July 18, 2015 16:58

mreiferson mentioned this pull request Aug 25, 2015

nsqd: HTTP pprof endpoints 404 #641

Merged

jehiah mentioned this pull request Sep 10, 2015

nsqd: expand DNS records for "service discovery" #646

Open

harlow mentioned this pull request Jan 15, 2016

nsqd: documentation for dynamic configuration #708

Closed

harlow mentioned this pull request Jan 23, 2016

nsqadmin: config endpoint similar to nsqd #714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

mreiferson commented Jul 14, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 11, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 11, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 12, 2015

jehiah commented Jul 12, 2015

paddycarver commented Jul 14, 2015

paddycarver Jul 14, 2015

mreiferson Jul 14, 2015

paddycarver Jul 14, 2015

mreiferson Jul 14, 2015

paddycarver Jul 14, 2015

mreiferson Jul 15, 2015

paddycarver Jul 15, 2015

mreiferson commented Jul 17, 2015

jehiah Jul 18, 2015

jehiah Jul 18, 2015

mreiferson Jul 18, 2015

jehiah commented Jul 18, 2015

paddycarver commented Jul 18, 2015

paddycarver Jul 18, 2015

mreiferson Jul 18, 2015

mreiferson commented Jul 18, 2015

paddycarver commented Jul 18, 2015

jehiah commented Jul 18, 2015

mreiferson commented Jul 18, 2015

ekristen commented Sep 10, 2015

nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

Conversation

mreiferson commented Jul 14, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 11, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 11, 2015

paddycarver commented Jul 11, 2015

mreiferson commented Jul 12, 2015

jehiah commented Jul 12, 2015

paddycarver commented Jul 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Jul 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jehiah commented Jul 18, 2015

paddycarver commented Jul 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Jul 18, 2015

paddycarver commented Jul 18, 2015

jehiah commented Jul 18, 2015

mreiferson commented Jul 18, 2015

ekristen commented Sep 10, 2015