Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: /stats endpoint causes runtime problems #720

Closed
tecbot opened this issue Feb 23, 2016 · 7 comments
Closed

nsqd: /stats endpoint causes runtime problems #720

tecbot opened this issue Feb 23, 2016 · 7 comments
Labels

Comments

@tecbot
Copy link

tecbot commented Feb 23, 2016

Hey,

we have a really strange issue in our nsq cluster. If we call the /stats route, it will cause a runtime issue in the nsqd process which will blocks other operations. The end result is that nsqd didn't response to new incoming jobs in a reasonable time so our callers get stuck after some time, consumers can't connect or flicker, also our nsqadmin can't reach the servers anymore because the call to /stats timeout. A first look in the code seems that there a lot of locks involved. Im curious whether other nsq users have seen something like this.

@jehiah
Copy link
Member

jehiah commented Feb 23, 2016

@tecbot what version are you running? This sounds like something already resolved by #700 #701 #703 #709

@tecbot
Copy link
Author

tecbot commented Feb 23, 2016

@jehiah thanks for your fast response. We are using the official 0.3.6 release so we don't have the fixes. Is there any reason why there is no new official release which includes the bug fixes? There are 47 new commits after this release.

@mreiferson
Copy link
Member

we're gonna stamp a release after #718 lands

@tecbot
Copy link
Author

tecbot commented Feb 23, 2016

@mreiferson thanks for this information, but you should rethink about your release strategy, it's not good to have non bug fix releases for such a critical thing. Just my two cents...

@mreiferson
Copy link
Member

@tecbot definitely, but from what I can tell it's been "broken" for a really long time. I don't want to suggest that it's not an important set of fixes, but I wouldn't go so far as saying it's "critical".

@tecbot
Copy link
Author

tecbot commented Feb 23, 2016

@mreiferson we didn't see this issue in the beginning after we upgraded to 0.3.6 from 0.2.x (2 months ago). Now we created a nsq prometheus exporter which scatter the stats every 10s, we destroyed completely our nsq cluster with this. So was not funny for our platform team ;)

@mreiferson
Copy link
Member

fun times! 🎲

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants