-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: not properly creating channels registered in lookupd on seeing a new topic #826
Comments
Going to close this, If you configure nsqd with nsqlookupd it should already do as you describe. If you are running into a bug w/ that configuration please re-open w/ steps to reproduce. On the first message for a topic nsqd will query all configured nsqlookupds and will create the right channels as registered in nsqlookupd. If you pre-register topic/channels in nsqlookupd, those will get picked up by nsqd on the first message published. The |
As you can see, the clients connect over some time, and when they do, they don't both end up getting the first message. I poked around the lookupd code a bit, and it didn't seem like the code in the http api would end up hitting connected nsqds, but I could have missed it. |
@jehiah it doesn't seem like I can reopen the issue myself. (I do remember github allowing that in the past, but they must have removed it) |
Thanks for the detail; can you also provide the nsqd run command you are using, nsqd logs, and the output from nsqlookupd |
oops actually output from For reference, the logic that implements this feature is here |
Might be a bug in nsqd. I got a pretty small reproducer #!/bin/sh
rm -f nsqd.* t1.*
nsqlookupd &
L_PID=$!
nsqd --lookupd-tcp-address=127.0.0.1:4160 &
N_PID=$!
sleep 1
curl -X POST 'localhost:4161/channel/create?topic=t1&channel=c1'
curl -X POST 'localhost:4161/channel/create?topic=t1&channel=c2'
curl -s localhost:4161/channels?topic=t1 | jq .
curl -s localhost:4151/stats
curl -X POST 'http://localhost:4151/pub?topic=t1' --data '{}'
echo
sleep 1
curl -s localhost:4151/stats
kill $L_PID $N_PID
wait
(EDIT: small updates to script so it's a bit less dirty) |
ok that dumps out all the configs we want, I think. I'm collecting now. It seems like sometimes I get the result we'd expect and sometimes I don't. |
Ok, I've got both cases logged here: https://gist.github.com/stephensearles/3358571bb5eebdc62aab566dd6e51f80. Not sure what causes the two different cases to occur. Just rerunning the script above produces one or the other. |
I think that your case where it works out is luck of timing, with both clients connecting and creating their channels in nsqd at almost the same time, before the goroutine for the topic starts draining messages into the channels. |
I tried running with the race detector, and I'm seeing this:
not sure if related |
So we acquire the topic lock on line 450 of nsqd, just before we do the lookupd scan, but we release it before indicating the channel update to the message pump. I'm wondering if just moving that unlock to below that select would do it. Nope, not quite, but I do think I'm on the right track |
I don't think the segfault (during an The fix for your setup is to give nsqlookupd The explanation is that nsqd uses http requests to nsqlookupds to get the list of channels for a topic. What it has is nsq binary protocol connections, so it takes the broadcast address and http port that it got from each nsqlookupd to form the http address. Since you didn't specify, nsqlookupd guessed its public hostname/address based on local system hostname. On your test systems (and mine), that didn't work ;) (to clarify, my test "passes" for me, if I specify nsqlookupd |
@ploxiln , not sure I understand how that would cause the message to only go to one consumer when two connected on different channels, both created in lookupd before the message was published. I do think I found a flaw in the messagePump loop where it is potentially using an array of channels that has become out of date. PR forthcoming. |
nsqd didn't look up the channels at all, because it can't figure out the http address for nsqlookupd, because it can't figure out the correct broadcast address of nsqlookupd. In your transcript, nsqd does not request the channels from nsqlookupd. If you tell nsqlookupd its correct broadcast address (ip-address or hostname) for your setup, you'll see that it will, immediately, like so:
|
If that's the case, how do the nsq_to_file instances ever find the nsqd? I think you may be right that the hostname issue is involved here, but I think that just encourages the timing to run afoul of this data race. |
Oh, I see, it's using my os hostname as the default for lookupd, which is a .local mdns address. That takes a few extra milliseconds to resolve, even though it's still just localhost. That said, that request is still in flight when the message pump is pushing messages through, and it sounds like that is unexpected behavior? |
You give the nsq_to_file the nsqlookupd-http-address so that's how they contact nsqlookupd. nsqlookupd and nsqd use a different algorithm to determine the nsqd tcp address to return to nsq_to_file. I think the .local hostname just doesn't work for these go binaries on OS X, possibly because the binary release build does not use the system libc resolver (so they only work with real dns names or ip addresses, not mdns)
|
So, my take-away from all this: maybe there should be a better error message in here somewhere:
(notice |
I guess my question put better: how did lookupd tell nsq_to_file anything about my nsqd if nsqd and lookupd couldn't communicate? |
The tcp connection nsqd made to nsqlookupd worked, because you specified that address as localhost:4160. An argument could be made that nsqd should parse that address to get the host part for the http address. But it didn't, it asked nsqlookupd over the tcp connection. And the answer didn't work for it. |
I guess it's also just not obvious that nsqd uses lookupd's http interface at all. It doesn't even have a flag to specify it. |
Ok, now I'm pretty sure the mdns is not causing it to fail. I added a call to lookupd's /nodes just after it came up, and it shows the producer:
|
Pasting in the full up-to-date script:
|
Yes, nsqlookupd knows the address of nsqd (from examining the tcp connection). Please adapt your test with the trivial work-around which I mentioned many messages ago at this point:
EDIT: here's my direct quote:
|
Oh, ok, I think I figured out what I was misunderstanding. The cluster overall figures it out because the TCP connection from nsqd to lookupd is enough to wire things up, but I guess not as gracefully as I expected. I poked in at wireshark and yeah, lookupd responds to nsqd's IDENTIFY basically by parroting back the info it was given, but including none of the other good info it could be returning. |
So moving back over to my linux environment where this originally arose, I'm wondering why the BroadcastAddress ends up empty there too, or at least why that breaks this:
On that machine, the os hostname is "developer.corp.shipwire.com". That isn't a real DNS name in the strict sense, but it's defined in the /etc/hosts file on the machine. |
Ok, I think I figured out why the broadcast address isn't working in that case: the default doesn't seem to be pulled correctly. In github.com/mreiferson/go-options, the Resolve function is documented to say that the final choice is the given default struct value. Looking at the code closer to the action there, it seems like it actually pulls the default flag value instead, and a comment near there says that's what it's doing. That also explains why I see some parts of NSQ resolving and using mdns names, but this having trouble. It can resolve mdns fine, it just wasn't getting a hostname at all for this piece unless explicitly given. Of course, I could be wrong! I've been known to do that sometimes ;) I'm working on a code change to try it out. Btw, @ploxiln thank you for your help through this. I had tried what you said and it worked, but I wanted to keep chasing to understand why. Sorry if I was a bit slow to give up on my original understanding of the bug. |
It's intended for the default to be the hostname: but it does seem possible that there's a mixup in there, introduced while refactoring options a year or so ago... |
So as this stands, there are a few potential angles here:
Thoughts? I think (3) makes (2) unnecessary and seems more convenient. |
(I see you figured out the intended default options already, sorry :) In general, I think there are a number of "known opportunities for improvement" in this project, but it works well enough for the maintainers' deployments, and not many other people are stepping up to really dig in (myself definitely included there). I think all three of those ideas, if implemented in "good" PRs, would likely be accepted. But I'd wait a bit for @mreiferson to chime in |
I think all 3 areas are great improvements. (And thanks @ploxiln for your help troubleshooting here). 3 doesn't completely negate 2 because we want to properly support mixed version topologies.
|
Nice! I'll be happy to help out with these, but I'm going away for a couple weeks and not sure I'll be able to during that time. So, if somebody else gets there before me, cheers! Otherwise, just wanted to call that out instead of just disappearing. :) |
@stephensearles thanks for digging into this (and @ploxiln)! I'm in favor of (1) and (2) as I've commented on mreiferson/go-options#14 |
sufficiently handled in #831 |
#831 didn't seem to change the behavior for me Some highlights from the full setup/test log:
|
@ploxiln you're expecting me to validate it too?!?! |
I'll do some debugging, it'll just be a few hours til I get around to it ;) |
I have a bunch of topics that are produced on a variety of nodes. It's not always clear to me ahead of time which nodes will be producing which topics, but it is certainly some overlapping subset. I have two consumer channels: "archive," for nsq_to_file, and "events," for part of the application. The problem is bringing up a new node: I don't know exactly which topics to create in
nsqd
, but I do know which topics it might produce, and of those, which channels they should go to.As it stands, the first message on a topic will get to nsqd, and whichever consumer happens to find it first will see that message on its channel. Since nsqd doesn't yet know about the other channel, it will drop the message before the second consumer has connected.
Some thoughts on how to potentially deal with this:
Let me know if there's something I'm misunderstanding! Thanks!
The text was updated successfully, but these errors were encountered: