Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd/nsqlookupd: properly handle fatal accept errors #1140

Merged
merged 13 commits into from
Mar 3, 2019

Conversation

mdh67899
Copy link
Contributor

@mdh67899 mdh67899 commented Feb 26, 2019

#1138

An alternative way to approach this, that would avoid having to protect Exit() from multiple calls would be to create a fatalErrCh channel, that gets passed to svc.Run. We could close() it here.

@mdh67899 mdh67899 force-pushed the fix-nsqd-tcp-accept-error branch 4 times, most recently from 29b1580 to 2d41590 Compare February 26, 2019 07:08
Copy link
Member

@mreiferson mreiferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a pass at this, I think we can make this a slightly smaller set of changes and still achieve the desired result.

apps/nsqd/nsqd.go Outdated Show resolved Hide resolved
@mdh67899
Copy link
Contributor Author

/ping @ploxiln @mreiferson. PTAL. :)

@ploxiln
Copy link
Member

ploxiln commented Feb 27, 2019

it looks to me like, after all this, we ended up with something equivalent to #1138 (comment) (check if already doing exit, call NSQD.Exit(), call os.Exit(1))

@mreiferson
Copy link
Member

Indeed, thank you for pointing that out. I still feel like the application (not nsqd "library") should have the responsibility of handling "exiting" and that the library is just missing a way to signal that.

I'm going to try to modify this PR on my own to see if I can get it to a place where I'm reasonably happy with it and then we can discuss.

@ploxiln
Copy link
Member

ploxiln commented Feb 27, 2019

I do have one more idea: send SIGINT to getpid(). Not sure if it works on windows though.

@mreiferson mreiferson force-pushed the fix-nsqd-tcp-accept-error branch from 1ee8ecc to 47c0c54 Compare February 27, 2019 19:21
@mreiferson mreiferson changed the title nsqd/nsqlookupd: send signal to svc.run and use svc.stop to exit when tcp server accept error nsqd/nsqlookupd: properly handle fatal accept errors Feb 27, 2019
@mreiferson mreiferson added the bug label Feb 27, 2019
@mreiferson
Copy link
Member

pushed up my changes, if this seems reasonable I'll apply the same approach to nsqlookupd

@mreiferson
Copy link
Member

(it does not yet even compile, mostly due to logging related bullshit)

@mreiferson mreiferson dismissed their stale review February 27, 2019 19:29

took over PR

@ploxiln
Copy link
Member

ploxiln commented Feb 27, 2019

NSQD.New() returning error makes good sense.

NSQD.Main() becoming synchronous ... is going to change things for tests

@mreiferson
Copy link
Member

NSQD.Main() becoming synchronous ... is going to change things for tests

Yea, for sure, I'll clean all that up. I think the semantics of it being synchronous make more sense to me. I've come around to the idea that libraries shouldn't try to own concurrency in this way, and we've made this mistake in a few places (mostly thinking go-nsq). Also, we've never promised an API guarantee for nsqd as a library, so we don't need to consider backwards compat.

apps/nsqd/nsqd.go Outdated Show resolved Hide resolved
@ploxiln
Copy link
Member

ploxiln commented Feb 27, 2019

I do like the "return err instead of just os.Exit(1)" pattern
(as you say, so the caller can handle appropriately)

@andyxning
Copy link
Member

I do have one more idea: send SIGINT to getpid(). Not sure if it works on windows though.

This is what i think, also. This way we do not need to any api.

nsqd/nsqd.go Outdated Show resolved Hide resolved
@ploxiln
Copy link
Member

ploxiln commented Feb 28, 2019

This way looks OK to me.

@mreiferson
Copy link
Member

Deeper down the rabbit hole...

Commit d6b87a5 improves how LogLevel is parsed (moves responsibility to application side flag parsing, and uses strict types on the nsqd side Options struct). This required changes to go-options.

Commit 76f3702 should start to fix tests, which required moving some initializing to New(), which I'm interested in feedback on.

apps/nsqd/options.go Outdated Show resolved Hide resolved
internal/lg/lg.go Outdated Show resolved Hide resolved
@mreiferson mreiferson force-pushed the fix-nsqd-tcp-accept-error branch from 69be08a to 2c5e657 Compare March 1, 2019 08:07
@mreiferson
Copy link
Member

Finally have a green test run, PTAL and I'll clean up these commits.

@ploxiln
Copy link
Member

ploxiln commented Mar 2, 2019

$ ./build/nsqadmin --verbose --nsqd-http-address=http://127.0.0.1:4151
panic: interface conversion: *app.StringArray is not flag.Getter: missing method Get

goroutine 1 [running]:
github.com/nsqio/nsq/vendor/github.com/mreiferson/go-options.Resolve(0x13a69a0, 0xc0000cc580, 0xc00002e360, 0x0)
	/Users/pierce/go/src/github.com/nsqio/nsq/vendor/github.com/mreiferson/go-options/options.go:86 +0xa3e
main.(*program).Start(0xc00000c440, 0x14be420, 0x17e0bf0)
	/Users/pierce/go/src/github.com/nsqio/nsq/apps/nsqadmin/main.go:105 +0x38a
github.com/nsqio/nsq/vendor/github.com/judwhite/go-svc/svc.Run(0x14c0e40, 0xc00000c440, 0xc00000c460, 0x2, 0x2, 0xc000066f20, 0x0)
	/Users/pierce/go/src/github.com/nsqio/nsq/vendor/github.com/judwhite/go-svc/svc/svc_other.go:20 +0x74
main.main()
	/Users/pierce/go/src/github.com/nsqio/nsq/apps/nsqadmin/main.go:72 +0xbf

@mreiferson
Copy link
Member

Sonofa! Thanks, will fix.

@mreiferson mreiferson force-pushed the fix-nsqd-tcp-accept-error branch from 599e6c0 to 5b58d22 Compare March 2, 2019 06:47
@mreiferson
Copy link
Member

OK, cleaned up the commits 🙏

@ploxiln
Copy link
Member

ploxiln commented Mar 2, 2019

As far as I can tell, this is pretty good. It's not easy to trigger the failure paths ... this is as far as I got with very low ulimit and parallel http requests:

$ ./build/nsqd --tcp-address 0.0.0.0:530 --log-level=debug
[nsqd] 2019/03/02 11:27:03.212215 INFO: nsqd v1.1.1-alpha (built w/go1.11.5)
[nsqd] 2019/03/02 11:27:03.212320 INFO: ID: 136
[nsqd] 2019/03/02 11:27:03.212645 INFO: NSQ: persisting topic/channel metadata to nsqd.dat
[nsqd] 2019/03/02 11:27:03.214231 INFO: TCP: listening on [::]:530
[nsqd] 2019/03/02 11:27:03.214297 INFO: HTTP: listening on [::]:4151
[nsqd] 2019/03/02 11:27:21.991592 WARNING: http: Accept error: accept tcp [::]:4151: accept: too many open files; retrying in 5ms
[nsqd] 2019/03/02 11:27:29.912527 INFO: 200 GET /stats ([::1]:56231) 214.582µs
[nsqd] 2019/03/02 11:27:41.040258 WARNING: http: Accept error: accept tcp [::]:4151: accept: too many open files; retrying in 5ms
[nsqd] 2019/03/02 11:27:45.535751 WARNING: http: Accept error: accept tcp [::]:4151: accept: too many open files; retrying in 10ms
[nsqd] 2019/03/02 11:27:45.814051 WARNING: http: Accept error: accept tcp [::]:4151: accept: too many open files; retrying in 20ms
[nsqd] 2019/03/02 11:27:46.626209 WARNING: http: Accept error: accept tcp [::]:4151: accept: too many open files; retrying in 40ms
[nsqd] 2019/03/02 11:27:52.394460 INFO: 200 GET /stats ([::1]:56230) 204.084µs

Gopkg.lock Outdated Show resolved Hide resolved
@ploxiln
Copy link
Member

ploxiln commented Mar 2, 2019

other than that 👍

@mreiferson mreiferson force-pushed the fix-nsqd-tcp-accept-error branch from 5b58d22 to 9d332eb Compare March 3, 2019 06:23
@mreiferson mreiferson merged commit 3eb619f into nsqio:master Mar 3, 2019
if env.IsWindowsService() {
dir := filepath.Dir(os.Args[0])
return os.Chdir(dir)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mreiferson It doesn't hurt anything to have this, but the package does this bit for you now. I generally consider it impolite to os.Chdir but in this case I thought it was warranted. https://github.com/judwhite/go-svc/blame/d83e900e4a688aad49f41263bebf8793f0bf6add/svc/svc_windows.go#L59-L66

@mdh67899 mdh67899 deleted the fix-nsqd-tcp-accept-error branch March 4, 2019 13:47
@mdh67899 mdh67899 restored the fix-nsqd-tcp-accept-error branch March 4, 2019 13:47
@mdh67899 mdh67899 deleted the fix-nsqd-tcp-accept-error branch March 4, 2019 13:47
@mdh67899 mdh67899 restored the fix-nsqd-tcp-accept-error branch March 4, 2019 13:47
@mdh67899 mdh67899 deleted the fix-nsqd-tcp-accept-error branch June 19, 2019 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants