Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: not properly creating channels registered in lookupd on seeing a new topic #826

Closed
stephensearles opened this issue Dec 22, 2016 · 37 comments · Fixed by #839
Closed
Labels

Comments

@stephensearles
Copy link

I have a bunch of topics that are produced on a variety of nodes. It's not always clear to me ahead of time which nodes will be producing which topics, but it is certainly some overlapping subset. I have two consumer channels: "archive," for nsq_to_file, and "events," for part of the application. The problem is bringing up a new node: I don't know exactly which topics to create in nsqd, but I do know which topics it might produce, and of those, which channels they should go to.

As it stands, the first message on a topic will get to nsqd, and whichever consumer happens to find it first will see that message on its channel. Since nsqd doesn't yet know about the other channel, it will drop the message before the second consumer has connected.

Some thoughts on how to potentially deal with this:

  • The lookupd http apis seemed promising (though missing some documentation). If information created there were to flow in reverse back to nsqd, say, when a topic/channel is created, all the nsqd's peering with that lookupd could know about that topic-channel association. A fresh nsqd won't register itself as publishing the topic, but if the does get a message to publish on that topic, it could create the known channels immediately to ensure a consumer on those channels will definitely get that first message.
  • As an alternative solution: on a new topic, when lookupd is configured, nsqd could wait until lookupd has (somehow) confirmed all known peers have seen the new producer information and had a chance to connect on any additional channels before dropping messages on the new topic.

Let me know if there's something I'm misunderstanding! Thanks!

@stephensearles stephensearles changed the title nsqd channels pre-warmed by lookupd proposal: nsqd channels pre-warmed by lookupd Dec 22, 2016
@jehiah
Copy link
Member

jehiah commented Dec 22, 2016

Going to close this, If you configure nsqd with nsqlookupd it should already do as you describe. If you are running into a bug w/ that configuration please re-open w/ steps to reproduce.

On the first message for a topic nsqd will query all configured nsqlookupds and will create the right channels as registered in nsqlookupd. If you pre-register topic/channels in nsqlookupd, those will get picked up by nsqd on the first message published.

The nsqadmin /lookup page is also designed to facilitate this, but the nsqlookdup API is available as well. If you'd like to contribute updates to the docs for http://nsq.io/components/nsqlookupd.html that would be awesome =)

@jehiah jehiah closed this as completed Dec 22, 2016
@stephensearles
Copy link
Author

  1. Start a fresh nsqd, pointing to a fresh lookupd
  2. Run a script like:
#!/bin/bash

set -e

curl -X POST --data '{}' 'localhost:4161/channel/create?topic=t1&channel=c1'
curl -X POST --data '{}' 'localhost:4161/channel/create?topic=t1&channel=c2'

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

mkdir -p /tmp/a
mkdir -p /tmp/b

nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c1 --output-dir /tmp/a &
nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c2 --output-dir /tmp/b &

tail -f /dev/null & wait

exit 0
  1. publish a message: curl -X POST 'http://localhost:4151/pub?topic=t1' --data '{}'
  2. call /stats on nsqd to see when the message is delivered
$ curl localhost:4151/stats
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T13:47:58-08:00
uptime 41.216422588s

Health: OK

   [t1             ] depth: 1     be-depth: 0     msgs: 1        e2e%:

...

[vagrant@developer ~]$ curl localhost:4151/stats
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T13:47:58-08:00
uptime 1m17.400538125s

Health: OK

   [t1             ] depth: 1     be-depth: 0     msgs: 1        e2e%:
[vagrant@developer ~]$ curl localhost:4151/stats
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T13:47:58-08:00
uptime 1m22.607213837s

Health: OK

   [t1             ] depth: 0     be-depth: 0     msgs: 1        e2e%:
      [c1                       ] depth: 0     be-depth: 0     inflt: 1    def: 0    re-q: 0     timeout: 0     msgs: 1        e2e%:
        [V2 developer:53003      ] state: 3 inflt: 1    rdy: 200  fin: 0        re-q: 0        msgs: 1        connected: 3s
[vagrant@developer ~]$ curl localhost:4151/stats
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T13:47:58-08:00
uptime 1m25.03099741s

Health: OK

   [t1             ] depth: 0     be-depth: 0     msgs: 1        e2e%:
      [c1                       ] depth: 0     be-depth: 0     inflt: 1    def: 0    re-q: 0     timeout: 0     msgs: 1        e2e%:
        [V2 developer:53003      ] state: 3 inflt: 1    rdy: 200  fin: 0        re-q: 0        msgs: 1        connected: 6s
[vagrant@developer ~]$ curl localhost:4151/stats
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T13:47:58-08:00
uptime 1m26.85534685s

Health: OK

   [t1             ] depth: 0     be-depth: 0     msgs: 1        e2e%:
      [c1                       ] depth: 0     be-depth: 0     inflt: 1    def: 0    re-q: 0     timeout: 0     msgs: 1        e2e%:
        [V2 developer:53003      ] state: 3 inflt: 1    rdy: 200  fin: 0        re-q: 0        msgs: 1        connected: 7s
      [c2                       ] depth: 0     be-depth: 0     inflt: 0    def: 0    re-q: 0     timeout: 0     msgs: 0        e2e%:
        [V2 developer:53010      ] state: 3 inflt: 0    rdy: 200  fin: 0        re-q: 0        msgs: 0        connected: 0

As you can see, the clients connect over some time, and when they do, they don't both end up getting the first message.

I poked around the lookupd code a bit, and it didn't seem like the code in the http api would end up hitting connected nsqds, but I could have missed it.

@stephensearles
Copy link
Author

@jehiah it doesn't seem like I can reopen the issue myself. (I do remember github allowing that in the past, but they must have removed it)

@stephensearles stephensearles changed the title proposal: nsqd channels pre-warmed by lookupd nsqd not properly creating channels registered in lookupd on seeing a new topic Dec 22, 2016
@jehiah jehiah reopened this Dec 22, 2016
@jehiah
Copy link
Member

jehiah commented Dec 22, 2016

Thanks for the detail; can you also provide the nsqd run command you are using, nsqd logs, and the output from nsqlookupd /topics endpoint prior to publishing the first message?

@jehiah
Copy link
Member

jehiah commented Dec 22, 2016

oops actually output from http://127.0.0.1:4161/channels?topic=t1 is what i'm interested in not /topics. That should be what nsqd queries.

For reference, the logic that implements this feature is here

@ploxiln
Copy link
Member

ploxiln commented Dec 22, 2016

Might be a bug in nsqd. I got a pretty small reproducer test.sh:

#!/bin/sh
rm -f nsqd.* t1.*

nsqlookupd &
L_PID=$!
nsqd --lookupd-tcp-address=127.0.0.1:4160 &
N_PID=$!
sleep 1

curl -X POST 'localhost:4161/channel/create?topic=t1&channel=c1'
curl -X POST 'localhost:4161/channel/create?topic=t1&channel=c2'
curl -s localhost:4161/channels?topic=t1 | jq .
curl -s localhost:4151/stats
curl -X POST 'http://localhost:4151/pub?topic=t1' --data '{}'
echo
sleep 1

curl -s localhost:4151/stats
kill $L_PID $N_PID
wait
[pierce@plo-pro tmp]$ ./test.sh
[nsqlookupd] 2016/12/22 17:21:43.330919 nsqlookupd v0.3.8 (built w/go1.6.2)
[nsqlookupd] 2016/12/22 17:21:43.331326 TCP: listening on [::]:4160
[nsqlookupd] 2016/12/22 17:21:43.331361 HTTP: listening on [::]:4161
[nsqd] 2016/12/22 17:21:43.331451 nsqd v0.3.8 (built w/go1.6.2)
[nsqd] 2016/12/22 17:21:43.331598 ID: 602
[nsqd] 2016/12/22 17:21:43.331632 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqd] 2016/12/22 17:21:43.339603 TCP: listening on [::]:4150
[nsqd] 2016/12/22 17:21:43.339713 HTTP: listening on [::]:4151
[nsqd] 2016/12/22 17:21:43.339896 LOOKUP(127.0.0.1:4160): adding peer
[nsqd] 2016/12/22 17:21:43.339909 LOOKUP connecting to 127.0.0.1:4160
[nsqlookupd] 2016/12/22 17:21:43.340240 TCP: new client(127.0.0.1:52446)
[nsqlookupd] 2016/12/22 17:21:43.340266 CLIENT(127.0.0.1:52446): desired protocol magic '  V1'
[nsqlookupd] 2016/12/22 17:21:43.340470 CLIENT(127.0.0.1:52446): IDENTIFY Address:plo-pro.local TCP:4150 HTTP:4151 Version:0.3.8
[nsqlookupd] 2016/12/22 17:21:43.340500 DB: client(127.0.0.1:52446) REGISTER category:client key: subkey:
[nsqd] 2016/12/22 17:21:43.340782 LOOKUPD(127.0.0.1:4160): peer info {TCPPort:4160 HTTPPort:4161 Version:0.3.8 BroadcastAddress:}
[nsqlookupd] 2016/12/22 17:21:44.342738 DB: adding channel(c1) in topic(t1)
[nsqlookupd] 2016/12/22 17:21:44.342754 DB: adding topic(t1)
[nsqlookupd] 2016/12/22 17:21:44.342768 200 POST /channel/create?topic=t1&channel=c1 ([::1]:52447) 131.387µs
[nsqlookupd] 2016/12/22 17:21:44.354372 DB: adding channel(c2) in topic(t1)
[nsqlookupd] 2016/12/22 17:21:44.354393 DB: adding topic(t1)
[nsqlookupd] 2016/12/22 17:21:44.354408 200 POST /channel/create?topic=t1&channel=c2 ([::1]:52448) 69.842µs
[nsqlookupd] 2016/12/22 17:21:44.366965 200 GET /channels?topic=t1 ([::1]:52449) 38.996µs
{
  "status_code": 200,
  "status_txt": "OK",
  "data": {
    "channels": [
      "c1",
      "c2"
    ]
  }
}
[nsqd] 2016/12/22 17:29:26.756678 200 GET /stats ([::1]:52540) 321.305µs
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T17:29:25-05:00
uptime 1.012459106s

NO_TOPICS
[nsqd] 2016/12/22 17:21:44.379020 TOPIC(t1): created
[nsqd] 2016/12/22 17:21:44.379071 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqd] 2016/12/22 17:21:44.379263 LOOKUPD(127.0.0.1:4160): topic REGISTER t1
[nsqlookupd] 2016/12/22 17:21:44.379466 DB: client(127.0.0.1:52446) REGISTER category:topic key:t1 subkey:
OK
[nsqd] 2016/12/22 17:21:45.396703 200 GET /stats ([::1]:52451) 172.198µs
nsqd v0.3.8 (built w/go1.6.2)
start_time 2016-12-22T17:21:43-05:00
uptime 2.065170256s

Health: OK

   [t1             ] depth: 1     be-depth: 0     msgs: 1        e2e%: 
[nsqd] 2016/12/22 17:21:45.398026 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqlookupd] 2016/12/22 17:21:45.398052 HTTP: closing [::]:4161
[nsqlookupd] 2016/12/22 17:21:45.398053 TCP: closing [::]:4160
[nsqd] 2016/12/22 17:21:45.398221 TCP: closing [::]:4150
[nsqd] 2016/12/22 17:21:45.398297 HTTP: closing [::]:4151
[nsqd] 2016/12/22 17:21:45.406739 NSQ: closing topics
[nsqd] 2016/12/22 17:21:45.406765 TOPIC(t1): closing
[nsqd] 2016/12/22 17:21:45.406798 TOPIC(t1): closing ... messagePump
[nsqd] 2016/12/22 17:21:45.406809 TOPIC(t1): flushing 1 memory messages to backend
[nsqd] 2016/12/22 17:21:45.406889 DISKQUEUE(t1): writeOne() opened t1.diskqueue.000000.dat
[nsqd] 2016/12/22 17:21:45.407050 DISKQUEUE(t1): readOne() opened t1.diskqueue.000000.dat
[nsqd] 2016/12/22 17:21:45.407099 DISKQUEUE(t1): closing
[nsqd] 2016/12/22 17:21:45.407109 DISKQUEUE(t1): closing ... ioLoop
[nsqd] 2016/12/22 17:21:45.407657 QUEUESCAN: closing
[nsqd] 2016/12/22 17:21:45.407677 LOOKUP: closing
[nsqd] 2016/12/22 17:21:45.407683 ID: closing

(EDIT: small updates to script so it's a bit less dirty)
(EDIT2: added one more /stats request before the /pub)

@stephensearles
Copy link
Author

#!/bin/bash

set -e

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

# clear out any previous state
rm -rf /var/lib/nsqd/*

# start lookupd and nsqd
nsqlookupd &
sleep 1

nsqd --broadcast-address api.developer.corp.shipwire.com --lookupd-tcp-address localhost:4160 --data-path=/var/lib/nsqd &
sleep 1

# create the topic/channels in lookupd
curl -s -X POST 'localhost:4161/channel/create?topic=t1&channel=c1'
curl -s -X POST 'localhost:4161/channel/create?topic=t1&channel=c2'

# prepare dirs for nsq_to_file
rm -rf /tmp/a /tmp/b
mkdir -p /tmp/a
mkdir -p /tmp/b

# verify lookupd has the topics
curl -s localhost:4151/stats
curl -s 'localhost:4161/channels?topic=t1'

nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c1 --output-dir /tmp/a &
nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c2 --output-dir /tmp/b &

curl -s -X POST 'localhost:4151/pub?topic=t1' --data '{}'


while :
do
	curl -s localhost:4151/stats
	curl -s 'localhost:4161/channels?topic=t1'
	sleep 1
done

ok that dumps out all the configs we want, I think. I'm collecting now. It seems like sometimes I get the result we'd expect and sometimes I don't.

@stephensearles
Copy link
Author

stephensearles commented Dec 22, 2016

Ok, I've got both cases logged here: https://gist.github.com/stephensearles/3358571bb5eebdc62aab566dd6e51f80. Not sure what causes the two different cases to occur. Just rerunning the script above produces one or the other.

@ploxiln
Copy link
Member

ploxiln commented Dec 22, 2016

I think that your case where it works out is luck of timing, with both clients connecting and creating their channels in nsqd at almost the same time, before the goroutine for the topic starts draining messages into the channels.

@stephensearles
Copy link
Author

I tried running with the race detector, and I'm seeing this:

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x34e3958ca1e5 pc=0x10fb90]

goroutine 48 [running]:
runtime.throw(0x57c3a0, 0x2a)
	/usr/local/go/src/runtime/panic.go:547 +0x90 fp=0xc820173720 sp=0xc820173708
runtime.sigpanic()
	/usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a fp=0xc820173770 sp=0xc820173720
sync.(*Pool).Get(0x72da30, 0x0, 0x0)
	/usr/local/go/src/sync/pool.go:102 +0x40 fp=0xc8201737c0 sp=0xc820173770
fmt.newPrinter(0xc820194300)
	/usr/local/go/src/fmt/print.go:133 +0x27 fp=0xc820173800 sp=0xc8201737c0
fmt.Sprintf(0x5865a0, 0x33, 0xc820173898, 0x1, 0x1, 0x0, 0x0)
	/usr/local/go/src/fmt/print.go:202 +0x2b fp=0xc820173850 sp=0xc820173800
github.com/nsqio/nsq/internal/http_api.Respond(0x85da60, 0xc82016cdd0, 0xc8, 0x4e4c58, 0x2, 0x39f2e0, 0xc8201942a0)
	/Users/stephen/go/src/github.com/nsqio/nsq/internal/http_api/api_response.go:111 +0x43d fp=0xc820173950 sp=0xc820173850
github.com/nsqio/nsq/internal/http_api.NegotiateVersion.func1(0x85da60, 0xc82016cdd0, 0xc82018ba40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/stephen/go/src/github.com/nsqio/nsq/internal/http_api/api_response.go:72 +0x2d7 fp=0xc8201739f0 sp=0xc820173950
github.com/nsqio/nsq/internal/http_api.Decorate.func1(0x85da60, 0xc82016cdd0, 0xc82018ba40, 0x0, 0x0, 0x0)
	/Users/stephen/go/src/github.com/nsqio/nsq/internal/http_api/api_response.go:161 +0x5a fp=0xc820173a48 sp=0xc8201739f0
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc820132640, 0x85da60, 0xc82016cdd0, 0xc82018ba40)
	/Users/stephen/go/src/github.com/julienschmidt/httprouter/router.go:344 +0x195 fp=0xc820173b38 sp=0xc820173a48
github.com/nsqio/nsq/nsqlookupd.(*httpServer).ServeHTTP(0xc820130380, 0x85da60, 0xc82016cdd0, 0xc82018ba40)
	/Users/stephen/go/src/github.com/nsqio/nsq/nsqlookupd/http.go:77 +0x58 fp=0xc820173b70 sp=0xc820173b38
net/http.serverHandler.ServeHTTP(0xc820088100, 0x85da60, 0xc82016cdd0, 0xc82018ba40)
	/usr/local/go/src/net/http/server.go:2081 +0x19e fp=0xc820173bd0 sp=0xc820173b70
net/http.(*conn).serve(0xc820162f00)
	/usr/local/go/src/net/http/server.go:1472 +0xf2e fp=0xc820173f98 sp=0xc820173bd0
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc820173fa0 sp=0xc820173f98
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2137 +0x44e

not sure if related

@stephensearles
Copy link
Author

stephensearles commented Dec 22, 2016

So we acquire the topic lock on line 450 of nsqd, just before we do the lookupd scan, but we release it before indicating the channel update to the message pump. I'm wondering if just moving that unlock to below that select would do it.

Nope, not quite, but I do think I'm on the right track

@ploxiln
Copy link
Member

ploxiln commented Dec 22, 2016

I don't think the segfault (during an Sprintf("... %s ...", err)) or the lock is related. I think I figured it out:

The fix for your setup is to give nsqlookupd --broadcast-address=127.0.0.1

The explanation is that nsqd uses http requests to nsqlookupds to get the list of channels for a topic. What it has is nsq binary protocol connections, so it takes the broadcast address and http port that it got from each nsqlookupd to form the http address. Since you didn't specify, nsqlookupd guessed its public hostname/address based on local system hostname. On your test systems (and mine), that didn't work ;)

(to clarify, my test "passes" for me, if I specify nsqlookupd --broadcast-address of either "127.0.0.1" or "localhost")

@stephensearles
Copy link
Author

stephensearles commented Dec 22, 2016

@ploxiln , not sure I understand how that would cause the message to only go to one consumer when two connected on different channels, both created in lookupd before the message was published.

I do think I found a flaw in the messagePump loop where it is potentially using an array of channels that has become out of date. PR forthcoming.

@ploxiln
Copy link
Member

ploxiln commented Dec 22, 2016

nsqd didn't look up the channels at all, because it can't figure out the http address for nsqlookupd, because it can't figure out the correct broadcast address of nsqlookupd.

In your transcript, nsqd does not request the channels from nsqlookupd. If you tell nsqlookupd its correct broadcast address (ip-address or hostname) for your setup, you'll see that it will, immediately, like so:

[nsqd] 2016/12/22 18:34:07.696150 TOPIC(t1): created
[nsqd] 2016/12/22 18:34:07.696197 CI: querying nsqlookupd http://localhost:4161/channels?topic=t1
[nsqd] 2016/12/22 18:34:07.696331 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqd] 2016/12/22 18:34:07.696371 LOOKUPD(127.0.0.1:4160): topic REGISTER t1
[nsqlookupd] 2016/12/22 18:34:07.696547 DB: client(127.0.0.1:52898) REGISTER category:topic key:t1 subkey:
[nsqlookupd] 2016/12/22 18:34:07.698986 200 GET /channels?topic=t1 ([::1]:52904) 29.754µs
[nsqd] 2016/12/22 18:34:07.699661 TOPIC(t1): new channel(c1)
[nsqd] 2016/12/22 18:34:07.699710 TOPIC(t1): new channel(c2)

@stephensearles
Copy link
Author

stephensearles commented Dec 22, 2016

If that's the case, how do the nsq_to_file instances ever find the nsqd? I think you may be right that the hostname issue is involved here, but I think that just encourages the timing to run afoul of this data race.

@stephensearles
Copy link
Author

Oh, I see, it's using my os hostname as the default for lookupd, which is a .local mdns address. That takes a few extra milliseconds to resolve, even though it's still just localhost. That said, that request is still in flight when the message pump is pushing messages through, and it sounds like that is unexpected behavior?

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

You give the nsq_to_file the nsqlookupd-http-address so that's how they contact nsqlookupd.

nsqlookupd and nsqd use a different algorithm to determine the nsqd tcp address to return to nsq_to_file.

I think the .local hostname just doesn't work for these go binaries on OS X, possibly because the binary release build does not use the system libc resolver (so they only work with real dns names or ip addresses, not mdns)

$ nsq_to_file --topic=t1 --channel=c1 --nsqd-tcp-address=plo-pro.local:4150
2016/12/22 19:02:59 INF    1 [t1/c1] (plo-pro.local:4150) connecting to nsqd
2016/12/22 19:02:59 dial tcp: lookup plo-pro.local on 208.67.222.222:53: no such host

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

So, my take-away from all this: maybe there should be a better error message in here somewhere:

[nsqd] 2016/12/22 17:29:25.755419 LOOKUP connecting to 127.0.0.1:4160
[nsqlookupd] 2016/12/22 17:29:25.755745 TCP: new client(127.0.0.1:52536)
[nsqlookupd] 2016/12/22 17:29:25.755799 CLIENT(127.0.0.1:52536): desired protocol magic '  V1'
[nsqlookupd] 2016/12/22 17:29:25.756791 CLIENT(127.0.0.1:52536): IDENTIFY Address:plo-pro.local TCP:4150 HTTP:4151 Version:0.3.8
[nsqlookupd] 2016/12/22 17:29:25.757206 DB: client(127.0.0.1:52536) REGISTER category:client key: subkey:
[nsqd] 2016/12/22 17:29:25.758006 LOOKUPD(127.0.0.1:4160): peer info {TCPPort:4160 HTTPPort:4161 Version:0.3.8 BroadcastAddress:}

(notice BroadcastAddress:})

@stephensearles
Copy link
Author

I guess my question put better: how did lookupd tell nsq_to_file anything about my nsqd if nsqd and lookupd couldn't communicate?

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

The tcp connection nsqd made to nsqlookupd worked, because you specified that address as localhost:4160.

An argument could be made that nsqd should parse that address to get the host part for the http address. But it didn't, it asked nsqlookupd over the tcp connection. And the answer didn't work for it.

@stephensearles
Copy link
Author

I guess it's also just not obvious that nsqd uses lookupd's http interface at all. It doesn't even have a flag to specify it.

@stephensearles
Copy link
Author

Ok, now I'm pretty sure the mdns is not causing it to fail. I added a call to lookupd's /nodes just after it came up, and it shows the producer:

[nsqlookupd] 2016/12/22 16:32:00.609101 200 GET /nodes ([::1]:62355) 543.374µs
{"status_code":200,"status_txt":"OK","data":{"producers":[{"remote_address":"[::1]:62354","hostname":"ssmbpsw.local","broadcast_address":"ssmbpsw.local","tcp_port":4150,"http_port":4151,"version":"0.3.8","tombstones":[],"topics":[]}]}}

@stephensearles
Copy link
Author

Pasting in the full up-to-date script:

#!/bin/bash

set -e

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

# clear out any previous state
rm -rf *.dat

# start lookupd and nsqd
nsqlookupd &
sleep 1

./nsqd --lookupd-tcp-address localhost:4160 &
sleep 1

# verify lookupd sees nsqd
curl -s localhost:4161/nodes

# create the topic/channels in lookupd
curl -s -X POST 'localhost:4161/channel/create?topic=t1&channel=c1'
curl -s -X POST 'localhost:4161/channel/create?topic=t1&channel=c2'


# verify lookupd has the topics
curl -s localhost:4151/stats
curl -s 'localhost:4161/channels?topic=t1'

nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c1 &
nsq_to_file --lookupd-http-address localhost:4161 --topic t1 --channel c2 &

curl -s -X POST 'localhost:4151/pub?topic=t1' --data '{}'


while :
do
	curl -s localhost:4151/stats
	curl -s 'localhost:4161/channels?topic=t1'
	sleep 1
done

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

Yes, nsqlookupd knows the address of nsqd (from examining the tcp connection).

Please adapt your test with the trivial work-around which I mentioned many messages ago at this point:

nsqlookupd --broadcast-address=localhost &

EDIT: here's my direct quote:

The fix for your setup is to give nsqlookupd --broadcast-address=127.0.0.1

@stephensearles
Copy link
Author

stephensearles commented Dec 23, 2016

Oh, ok, I think I figured out what I was misunderstanding. The cluster overall figures it out because the TCP connection from nsqd to lookupd is enough to wire things up, but I guess not as gracefully as I expected. I poked in at wireshark and yeah, lookupd responds to nsqd's IDENTIFY basically by parroting back the info it was given, but including none of the other good info it could be returning.

@stephensearles
Copy link
Author

So moving back over to my linux environment where this originally arose, I'm wondering why the BroadcastAddress ends up empty there too, or at least why that breaks this:

[nsqd] 2016/12/22 16:49:36.193780 LOOKUPD(localhost:4160): peer info {TCPPort:4160 HTTPPort:4161 Version:0.3.8 BroadcastAddress:}

On that machine, the os hostname is "developer.corp.shipwire.com". That isn't a real DNS name in the strict sense, but it's defined in the /etc/hosts file on the machine.

@stephensearles
Copy link
Author

stephensearles commented Dec 23, 2016

Ok, I think I figured out why the broadcast address isn't working in that case: the default doesn't seem to be pulled correctly. In github.com/mreiferson/go-options, the Resolve function is documented to say that the final choice is the given default struct value. Looking at the code closer to the action there, it seems like it actually pulls the default flag value instead, and a comment near there says that's what it's doing.

That also explains why I see some parts of NSQ resolving and using mdns names, but this having trouble. It can resolve mdns fine, it just wasn't getting a hostname at all for this piece unless explicitly given.

Of course, I could be wrong! I've been known to do that sometimes ;) I'm working on a code change to try it out.

Btw, @ploxiln thank you for your help through this. I had tried what you said and it worked, but I wanted to keep chasing to understand why. Sorry if I was a bit slow to give up on my original understanding of the bug.

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

It's intended for the default to be the hostname:
https://github.com/nsqio/nsq/blob/master/apps/nsqlookupd/nsqlookupd.go#L69-L71
https://github.com/nsqio/nsq/blob/master/nsqlookupd/options.go#L28-L31

but it does seem possible that there's a mixup in there, introduced while refactoring options a year or so ago...

@jehiah jehiah added the bug label Dec 23, 2016
@stephensearles
Copy link
Author

stephensearles commented Dec 23, 2016

So as this stands, there are a few potential angles here:

  1. The options package should probably behave as documented and return the options struct value. I have a PR open there to do that.
  2. Perhaps nsqd should yell louder if it is talking to a lookupd instance but can't get channel information out of it. (as it stands, the missing broadcast address is only one of many possible problems there)
  3. Perhaps lookupd should be able to inform nsqd about topic/channel relationships over the tcp connection they already have open

Thoughts?

I think (3) makes (2) unnecessary and seems more convenient.

@ploxiln
Copy link
Member

ploxiln commented Dec 23, 2016

(I see you figured out the intended default options already, sorry :)

In general, I think there are a number of "known opportunities for improvement" in this project, but it works well enough for the maintainers' deployments, and not many other people are stepping up to really dig in (myself definitely included there). I think all three of those ideas, if implemented in "good" PRs, would likely be accepted. But I'd wait a bit for @mreiferson to chime in

@jehiah
Copy link
Member

jehiah commented Dec 23, 2016

I think all 3 areas are great improvements. (And thanks @ploxiln for your help troubleshooting here). 3 doesn't completely negate 2 because we want to properly support mixed version topologies.

  1. we might need to think through backwards compatibility issues, but it's a clear bug.
    2a) nsqd should log/yell when it gets a missing broadcast address back from nsqlookupd, and describe implications
    2b) yell more clearly when it can't query any nsqlookupd and describe the implications of what failed. I don't think we want to reject the original msg PUT in this case; it's expected to run multiple nsqlookupd's for HA to mitigate, but there is a degraded case where normal operations would result in multiple channels being pre-created and they will instead be created lazily as consumers connect, causing message loss for some channels. I'm not sure there is a better tradeoff to make here.
  2. nsqd can/should ask over tcp when available; this needs to be feature negotiated so we don't break compatibility w/ existing lookupd's which wouldn't support that new message type. It should be straight forward to do this w/ the nsqlookupd identify response. It will probably be simplest to do that when all provided nsqlookupd's support the feature, and fall back to http query if any don't.

@stephensearles
Copy link
Author

Nice! I'll be happy to help out with these, but I'm going away for a couple weeks and not sure I'll be able to during that time. So, if somebody else gets there before me, cheers! Otherwise, just wanted to call that out instead of just disappearing. :)

@mreiferson mreiferson changed the title nsqd not properly creating channels registered in lookupd on seeing a new topic nsqd: not properly creating channels registered in lookupd on seeing a new topic Dec 23, 2016
@mreiferson
Copy link
Member

@stephensearles thanks for digging into this (and @ploxiln)! I'm in favor of (1) and (2) as nsqlookupd's future is uncertain and doing (3) seems like a poor use of time.

I've commented on mreiferson/go-options#14

@mreiferson
Copy link
Member

sufficiently handled in #831

@ploxiln
Copy link
Member

ploxiln commented Dec 28, 2016

#831 didn't seem to change the behavior for me
https://gist.github.com/ploxiln/da888917cf5b74022fa73a5f990bcd49

Some highlights from the full setup/test log:

[pierce@plo-pro nsq]$ git log --pretty=oneline -n8
866aa081d499a389e372955ed83911bd8742a939 Merge pull request #835 from mreiferson/max-messages
15cc418ce138b6f7b3a15b0eca41b5f0f21c5a28 Merge pull request #831 from mreiferson/get-channel-error
d6965e2dfe9a004c60fcbf9d0f300d3ddea86f1c Merge pull request #834 from mreiferson/dist-owner
da2e4bd4ed1ae43078fb8ac65332198896039d19 nsqd: bound MPUB messages to derived limit
6bb7f54093d5f2d0119e38d6f82a25909e1c0b9f dist: scrub file ownership for tarball
7a485a5883376b4c4178d6e267bb2234eca4171e deps: bump go-options
8e4a9ed0a2aad45100fceb9e59e1a614a9023abb nsqd: log when we can't get channels for pre-creation
135f7f34a2cc49f354eb9728a3da6f912b3cefcd Merge pull request #836 from mreiferson/deflake-diskqueue-test
 ...
[pierce@plo-pro nsq]$ ./tmpbuild.sh 
>> Getting package github.com/BurntSushi/toml
>> Getting package github.com/bitly/go-hostpool
>> Getting package github.com/nsqio/go-nsq
>> Getting package github.com/bitly/go-simplejson
>> Getting package github.com/bmizerany/perks/quantile
>> Getting package github.com/mreiferson/go-options
>> Getting package github.com/golang/snappy
>> Getting package github.com/bitly/timer_metrics
>> Getting package github.com/blang/semver
>> Getting package github.com/julienschmidt/httprouter
>> Getting package github.com/judwhite/go-svc/svc
>> Setting github.com/BurntSushi/toml to version 2dff11163ee667d51dcc066660925a92ce138deb
>> Setting github.com/bitly/go-hostpool to version 58b95b10d6ca26723a7f46017b348653b825a8d6
>> Setting github.com/bitly/go-simplejson to version 18db6e68d8fd9cbf2e8ebe4c81a78b96fd9bf05a
>> Setting github.com/nsqio/go-nsq to version 642a3f9935f12cb3b747294318d730f56f4c34b4
>> Setting github.com/bmizerany/perks/quantile to version 6cb9d9d729303ee2628580d9aec5db968da3a607
>> Setting github.com/mreiferson/go-options to version 2e28e74c79f31a5a0e807baad7f4429ea2703dcb
 ...
[pierce@plo-pro tmp]$ ./test.sh 
[nsqlookupd] 2016/12/28 18:10:41.761355 nsqlookupd v0.3.8 (built w/go1.7.4)
[nsqlookupd] 2016/12/28 18:10:41.761750 TCP: listening on [::]:4160
[nsqlookupd] 2016/12/28 18:10:41.761826 HTTP: listening on [::]:4161
2016/12/28 18:10:41 []
[nsqd] 2016/12/28 18:10:41.762372 nsqd v0.3.8 (built w/go1.7.4)
[nsqd] 2016/12/28 18:10:41.762379 ID: 602
[nsqd] 2016/12/28 18:10:41.762401 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqd] 2016/12/28 18:10:41.771287 TCP: listening on [::]:4150
[nsqd] 2016/12/28 18:10:41.771467 HTTP: listening on [::]:4151
[nsqd] 2016/12/28 18:10:41.771605 LOOKUP(127.0.0.1:4160): adding peer
[nsqd] 2016/12/28 18:10:41.771617 LOOKUP connecting to 127.0.0.1:4160
[nsqlookupd] 2016/12/28 18:10:41.772067 TCP: new client(127.0.0.1:65014)
[nsqlookupd] 2016/12/28 18:10:41.772113 CLIENT(127.0.0.1:65014): desired protocol magic '  V1'
[nsqlookupd] 2016/12/28 18:10:41.772444 CLIENT(127.0.0.1:65014): IDENTIFY Address:plo-pro.local TCP:4150 HTTP:4151 Version:0.3.8
[nsqlookupd] 2016/12/28 18:10:41.772475 DB: client(127.0.0.1:65014) REGISTER category:client key: subkey:
[nsqd] 2016/12/28 18:10:41.772930 LOOKUPD(127.0.0.1:4160): peer info {TCPPort:4160 HTTPPort:4161 Version:0.3.8 BroadcastAddress:}
[nsqlookupd] 2016/12/28 18:10:42.770051 DB: adding channel(c1) in topic(t1)
[nsqlookupd] 2016/12/28 18:10:42.770073 DB: adding topic(t1)
[nsqlookupd] 2016/12/28 18:10:42.770097 200 POST /channel/create?topic=t1&channel=c1 ([::1]:65015) 61.093µs
 ...
[nsqd] 2016/12/28 18:10:42.818273 TOPIC(t1): created
[nsqd] 2016/12/28 18:10:42.818310 NSQ: persisting topic/channel metadata to nsqd.602.dat
[nsqd] 2016/12/28 18:10:42.818371 LOOKUPD(127.0.0.1:4160): topic REGISTER t1
[nsqlookupd] 2016/12/28 18:10:42.818536 DB: client(127.0.0.1:65014) REGISTER category:topic key:t1 subkey:
[nsqd] 2016/12/28 18:10:43.837384 200 GET /stats ([::1]:65020) 92.669µs
nsqd v0.3.8 (built w/go1.7.4)
start_time 2016-12-28T18:10:41-05:00
uptime 2.075013854s

Health: OK

   [t1             ] depth: 1     be-depth: 0     msgs: 1        e2e%: 
[nsqd] 2016/12/28 18:10:43.838611 NSQ: persisting topic/channel metadata to nsqd.602.dat

@mreiferson
Copy link
Member

@ploxiln you're expecting me to validate it too?!?!

@ploxiln
Copy link
Member

ploxiln commented Dec 28, 2016

I'll do some debugging, it'll just be a few hours til I get around to it ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants