Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: ability to dynamically reconfigure nsqlookupd addresses #601

Merged
merged 12 commits into from
Jul 18, 2015

Conversation

mreiferson
Copy link
Member

I'd love it if NSQ could support DNS-based discovery for nsqlookupd instances. This would be a light layer on top of the already-proven architecture that would allow ops people to abstract away some of the configuration required to get NSQ working.

Basically, rather than hardcoding nsqlookupd addresses, I'd much rather just say "use all the A records returned by this DNS name". There's also an RFC for using DNS for service discovery. This supports things like Prometheus and makes deployment a snap on Kubernetes. It basically makes it dead simple to use NSQ in a cluster environment, where you don't know what IP address or machine any of your services will be available at, but know you can reach them through a DNS lookup.

More docs on the relevant portion of Kubernetes.

Here's the desired user experience:

  1. I set up a service (logical grouping) of pods (group of containers, think "application cookie cutter") and say I want 4 of them. Kubernetes automatically does the right thing for me, setting them up, tearing them down, restarting them, moving them, whatever.
  2. I configure nsqd to say "do a DNS lookup on nsqlookupd.cluster, and connect to every IP address returned as an A record. Also, every minute/five minutes/ten minutes, check again, and update your connections"
  3. I configure my consumers to say "do a DNS lookup on nsqlookupd.cluster, and connect to every IP address returned as an A record. Also, every minute/five minutes/ten minutes, check again and update your connections"
  4. Everything just works. Containers get restarted, machines get removed, machines get added, containers get moved, and everything just points to the right thing.

Most of this (the DNS querying part) can be achieved simply with wrappers/helpers. The big missing piece is some sort of runtime live-reloading/editing of configuration. Any of the following would work beautifully:

  • A way to easily wrap nsqd inside another Go binary, and use a channel/function/whatever to say "reload configuration/here are new configuration values"
  • A way to say "Here's a configuration file, read the nsqlookupd addresses out of it, and re-check whenever I HUP you" (I'm unsure whether one container can HUP a process in another. If not, this is less ideal, but still probably workable. It'll just take some noodling on how to make it happen. The unix socket approach would be more desirable, honestly)
  • A TCP/unix socket/http/whatever interface that we can use to pass new configuration values into.

Or I'm sure there are myriad better ways to accomplish this. Basically, the important bit is "here are the new nsqlookupd addresses, use them instead, but don't stop running/serving requests, and please don't catch on fire".

I'm making the following assumptions when imagining how this would be run:

  • A fresh nsqlookupd will be able to gather shared state from the registering consumers/nsqd instances quickly-ish (maybe a few seconds, at most?)
  • Assuming each nsqd/consumer connects to at least two nsqlookupd instances, one going away doesn't actually affect anything except your redundancy level.

@paddycarver
Copy link
Contributor Author

Also, this comes after a lengthy Twitter conversation with @mreiferson: https://twitter.com/paddyforan/status/619679434992447488

My initial investigations made this look like a lot of work and an involved feature. Matt says it's easier than I believed, which I will take his word for, because he probably has a better grasp on the codebase than I do.

@mreiferson
Copy link
Member

A way to easily wrap nsqd inside another Go binary, and use a channel/function/whatever to say "reload configuration/here are new configuration values"

nsqd is already importable but it lacks a public API to modify nsqlookupd configuration. I don't really like the idea of having to support that kind of API compatibility, so while it's a good suggestion and we should consider it, I'm 👎 on this.

A way to say "Here's a configuration file, read the nsqlookupd addresses out of it, and re-check whenever I HUP you" (I'm unsure whether one container can HUP a process in another. If not, this is less ideal, but still probably workable. It'll just take some noodling on how to make it happen. The unix socket approach would be more desirable, honestly)

nsqd also already has config file support but lacks any means of reloading it at runtime. HUP signal support feels like something it should have anyway, we would just need to be very clear on what values are sane to change while it's already running. This could be pretty confusing and complicated to implement. A first step could be restrict this to just the nsqlookupd addresses. It doesn't solve the usability issue for you, though. I'm on the fence.

A TCP/unix socket/http/whatever interface that we can use to pass new configuration values into.

This would address the same use case as the suggestion above (ability to dynamically reconfigure a running nsqd) but would have the benefit of being conceptually simpler in terms of implementation. If we took the HTTP endpoint route, it would be a much clearer contract what values you could or could not modify. I'm curious how this would fit into k8s though, e.g. what process would actually make the request to this endpoint? It would also need to do this for all nsqd? Promising, but there are questions.

A fresh nsqlookupd will be able to gather shared state from the registering consumers/nsqd instances quickly-ish (maybe a few seconds, at most?)

This can be tuned by changing the interval between nsqd -> nsqlookupd pings/reconnections.

Assuming each nsqd/consumer connects to at least two nsqlookupd instances, one going away doesn't actually affect anything except your redundancy level.

Yep.

Lastly, none of the above actually talk about your DNS-based discovery suggestion, which is intriguing. It seems like a reasonably encapsulated change for nsqd to adhere to the DNS-based service discovery RFC for nsqlookupd interaction when endpoints are not IP addresses. This seems like the most promising path forward from a usability standpoint and would seem to avoid most (all?) of the problems outlined above.

@mreiferson mreiferson changed the title Support DNS-based discovery nsqd: support DNS-based discovery for nsqlookupd addresses Jul 11, 2015
@paddycarver
Copy link
Contributor Author

This would address the same use case as the suggestion above (ability to dynamically reconfigure a running nsqd) but would have the benefit of being conceptually simpler in terms of implementation. If we took the HTTP endpoint route, it would be a much clearer contract what values you could or could not modify. I'm curious how this would fit into k8s though, e.g. what process would actually make the request to this endpoint? It would also need to do this for all nsqd? Promising, but there are questions.

This is, actually, my preferred implementation, as it is the most versatile when it comes to containers, and is the least-likely to require weird hacks to make it work. I also thought it'd be the one that would be hardest to implement, so I'm glad I'm super wrong.

K8s has the concept of "sidecar containers". Basically, a k8s thinks in "pods", and a "pod" is a group of containers that all get deployed together, and k8s makes sure run together. I'd basically define an nsqd pod that consists of an nsqd container, and a "nsqlookupd-updater" container that's just a tiny little Go program that polls the k8s DNS, gets the lookupd addresses, and turns that into an nsqd update through the API we define. So one container just runs nsqd, without knowing anything about kubernetes, and the other container takes care of turning the kubernetes-specific information into updates to the nsqd config. They get deployed together, and kubernetes makes sure they're both running, and we neatly divide the responsibilities.

Lastly, none of the above actually talk about your DNS-based discovery suggestion, which is intriguing. It seems like a reasonably encapsulated change for nsqd to adhere to the DNS-based service discovery RFC for nsqlookupd interaction when endpoints are not IP addresses. This seems like the most promising path forward from a usability standpoint and would seem to avoid most (all?) of the problems outlined above.

My concern is that this is semi-new and there seems to be some missing consensus around how to do this. Also, there are some minor, annoying variations between implementations. E.g., right now kubernetes uses A records while other things use SRV records, though they plan on updating later. By just providing the tool that says "update nsqlookupd addresses", we don't end up in a position where people want nsqd to know how to update based on:

  • A records
  • SRV records
  • Consul
  • Etcd
  • Changing config files
  • The phase of the moon
  • Etc.

I feel like providing the tools is a better approach than providing the solution in this case, just because there are a lot of possible configurations for the solution, and the amount of software it takes to wrap the tools to work with these configurations is pretty trivial. At the very least, I'd say provide the tools in nsqd itself, and then provide a few implementations as separate apps, much like nsq_to_file and nsq_to_http are implemented.

@mreiferson
Copy link
Member

I feel like providing the tools is a better approach than providing the solution in this case, just because there are a lot of possible configurations for the solution, and the amount of software it takes to wrap the tools to work with these configurations is pretty trivial. At the very least, I'd say provide the tools in nsqd itself, and then provide a few implementations as separate apps, much like nsq_to_file and nsq_to_http are implemented.

Yes, I agree, and would align best with how everything has been designed and built so far. To play devil's advocate, this isn't necessarily the simplest approach from the user's perspective, though. We're talking about needing to bundle another app to sit between nsqd and mediate. This feels a little bit like way back in the early days when we decided to directly support statsd format metrics rather than some other intermediate machinery (and that seemed to have turned out alright).

Curious if any other lurkers have any thoughts (cc @jehiah).

Thanks for taking the time to chat through this @paddyforan!

@mreiferson mreiferson changed the title nsqd: support DNS-based discovery for nsqlookupd addresses nsqd: ability to dynamically reconfigure nsqlookupd addresses Jul 11, 2015
@paddycarver
Copy link
Contributor Author

Two things about the devil's advocate position:

  1. This kind of thing will be most useful, I believe, to people in cluster environments, where "deploy another little thing" is not as big a deal as it is in setups where each server is a pet. I know I'd certainly rather ship another container than find out my discovery method isn't supported and either need to use a hacky solution, or not be able to take advantage of it. And pretty much every conversation I've seen about integrating an application with fleet or kubernetes has always included "...and use another container to bridge..." at one point or another. This is totally unscientific and based on impressions, though.
  2. statsd has enough marketshare in the monitoring space that new entrants feel the need to provide a solution to make running both or migrating from statsd feasible. There's no such dominant player in service discovery yet, so there's no obvious choice for which to support. So that's why this feels different from the statsd support.

But I dunno. Even if we get to a point where one is supported, I can always just use a convoluted wrapper or a light fork to make it fit my use case (which is what I was investigating before I opened this issue).

@mreiferson
Copy link
Member

Cool, let's prototype what an HTTP API might look like for this. My initial thought is that it should be more general than just nsqlookupd_tcp_addresses as there are a few other config variables that would be nice to be able to change at runtime, perhaps something like:

GET /config
    returns JSON body of current config

POST /config
    body is JSON dictionary where keys are the config fields you want to update

@jehiah
Copy link
Member

jehiah commented Jul 12, 2015

I dig a HTTP config query/update API endpoint (as mentioned before it clarifies what values are dynamic vs a HUP reloading of the config file). That seems to be the base needed to make it possible to tackle any sort of new discovery mechanism.

I like the general idea/usability of discovery via DNS. Handling multiple DNS A/AAAA records seems like a straightforward general way of providing some externally managed cluster state (ie: it feels like something i would begin to use); I feel i need to read up on rfc6763 and other common implementations to have a better opinion there.

@paddycarver
Copy link
Contributor Author

    GET /config
         returns JSON body of current config
    POST /config
        body is JSON dictionary where keys are the config fields you want to update

My only complaint is the POST. I'd, personally, prefer PATCH. But I'm nit-picking here. I'd be ecstatic to get this, even with POST.

@@ -596,3 +600,41 @@ func (s *httpServer) printStats(stats []TopicStats, health string, startTime tim
}
return buf.Bytes()
}

type allowedOpts struct {
NSQLookupdTCPAddresses []string `json:"nsqlookupd_tcp_addresses"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to support these as atomic add/remove or overwrite. The way I've done this in the past, which isn't too onerous, is

type allowedOpts struct {
    NSQLookupdTCPAddresses []string `json:"nsqlookupd_tcp_addresses"`
    NSQLookupdTCPAddressesAdd []string `json:"nsqlookupd_tcp_addresses_add"`
    NSQLookupdTCPAddressesRemove []string `json:"nsqlookupd_tcp_addresses_remove"`
}

That way I have the option of overwriting the values completely, or I can just add a new address and remove a different one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why that would be valuable but the API feels wrong. The requirements really argue for specific nsqlookupd config endpoints, which I was hoping to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, where I'm using this elsewhere, I just continue adding pointers to allowedOpts. The only difference is nsqlookupdtcpaddresses gets 3 properties, instead of 1, and a use specifies at most 2 in any request.

However, I'm nitpicking. This is a Useful Thing because it avoids weird race conditions and provides atomicity, which is Always Useful once you start distributing things. (Look at me, explaining distributing things to you. :trollface:) In practice, I feel like it would be an edge case where this would be an issue. If you feel it's not worth the API bloat, that's totally cool. I thought I'd bring it up while I had the chance :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy to discuss alternative APIs because now is the time to do it. It's not that I disagree with the benefits, it's that I think we should try to come up with an even better API if we want those semantics.

What about HTTP verbs? What about individual endpoints? If being more specific helps improve the API then I'm all for it...

Implementation wise, the HTTP side of this PR is the trivial stuff, most of the noise here is cleanup and proper handling of (now dynamic) state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbs:

  • PATCH uses the add/remove properties
  • PUT uses the replace properties
  • GET returns, as it does now
  • DELETE is unused, because that would be silly

I'm :-/ about the semantics because it's not really REST and is more RPC in spirit. However, the REST way to do it would be to do

  • POST /config/nsqlookupdtcpaddresses/ to add an address
  • PUT /config/nsqlookupdtcpaddresses/ to replace the entire list
  • DELETE /config/nsqlookupdtcpaddresses/{address} to remove an address

But that seems excessive, and as I'm thinking about it, I'm probably overcomplicating matters. Let's step back for a second:

Can we think of a single service discovery mechanism that would give us an atomic view of the world? If we know about a change, we're almost certainly going to want to update the state of the world (in DNS and etcd, at the very least).

As long as nsqd is intelligent about diffing updates (e.g., if it's connected to A & B, and you update the addresses to be A, B, and C, it just connects to C. it doesn't disconnect from A & B) then I'm having trouble thinking of a scenario where I'd actually use this.

Sorry for the knee jerk here, I may be off base about whether this is actually useful.

TL;DR: If we can be declarative instead of imperative, that's probably better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion offline, what if we just implemented PUT /config/<option> for now (the body is the JSON encoded value)? This leaves room for some of these future decisions.

The <option> would match the equivalent key in the config file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would fit my use case. 👍

@mreiferson mreiferson force-pushed the nsqd_config_endpoint_601 branch 9 times, most recently from 3775cde to 8d804c0 Compare July 17, 2015 16:41
@mreiferson
Copy link
Member

RFR @jehiah @paddyforan

I did a ton of cleanup and the overall diff is noisy. The individual commits are useful...

opts := *s.ctx.nsqd.getOpts()
switch opt {
case "nsqlookupd_tcp_addresses":
var addrs []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be cleaner to Unmarshal directly to the expected type inside this switch (ie: all the type checks go away)

var addrs []string
if err := json.Unmarshal(body, &addrs); err != nil {
    return nil, http_api.Err{400, "INVALID_VALUE"}
}
opts.NSQLookupdTCPAddresses = addrs

any reason not to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think we can Unmarshal directly to json.Unmarshal(body, &opts.NSQLookupdTCPAddresses) ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, that is much simpler 👍

@jehiah
Copy link
Member

jehiah commented Jul 18, 2015

LGTM aside from comments.

@paddycarver
Copy link
Contributor Author

Looks super great to me.

@@ -129,6 +140,13 @@ func (n *NSQD) lookupLoop() {
break
}
}
case <-n.optsNotificationChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be reading this wrong, but it looks like on every config change, the lookupds will be disconnected. So if I want to change the verbosity, that triggers a lookupd reconnection?

Would it be possible to diff the current lookupPeers and the list in opts to only add/remove the peers necessary? That would at least keep us from triggering a total reconnect each and every time any config value changes, which seems like (if this use case gets expanded upon) something that may happen more and more frequently over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, good catch, I'll fix

@mreiferson
Copy link
Member

PTAL @jehiah @paddyforan

@paddycarver
Copy link
Contributor Author

My only concern was addressed 👍

@jehiah
Copy link
Member

jehiah commented Jul 18, 2015

🔨 ?

@mreiferson mreiferson force-pushed the nsqd_config_endpoint_601 branch from 795dafe to f387012 Compare July 18, 2015 15:59
@mreiferson
Copy link
Member

ready

jehiah added a commit that referenced this pull request Jul 18, 2015
nsqd: ability to dynamically reconfigure nsqlookupd addresses
@jehiah jehiah merged commit 9376ebc into nsqio:master Jul 18, 2015
@mreiferson mreiferson deleted the nsqd_config_endpoint_601 branch July 18, 2015 16:58
@ekristen
Copy link

The only problem I see with having an API only update approach and not supporting DNS is that you still have to have an outside force reconfigure each nsqd instance vs nsqd being able to reconfigure itself for multiple lookupd instances based on DNS.

Just my two cents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants