Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help wanted: Bug & MySQL C API review #57

Closed
hedenface opened this issue Sep 8, 2019 · 7 comments
Closed

Help wanted: Bug & MySQL C API review #57

hedenface opened this issue Sep 8, 2019 · 7 comments

Comments

@hedenface
Copy link
Contributor

hedenface commented Sep 8, 2019

Currently working through a complete rewrite of NDOUtils (check the ndo-3 branch).

One of the goals is to remove the necessity of the kernel message queue, as this is the source of many admin headaches in larger nagios systems.

I've currently hit a roadblock, and was curious if any previous or new contributors to core or ndo would be willing to take a look and get some fresh eyes on it. @knweiss comes to mind immediately.

Here's the main issue - I'm attempting to save a lot of individual insert calls to mysql by building several bulk inserts on a loop. Well, originally during the rewrite we were doing individual inserts for brevity and to get it working, but initial performance testing once complete revealed that something needed to change immediately. https://github.com/NagiosEnterprises/ndoutils/blob/ndo-3/src/ndo-startup.c#L527-L810 Here is ndo_write_hosts - all of the ancillary data (host's parent hosts/contacts/contactgroups/customvars) revolve on the host already existing in the nagios_hosts table. So we loop over all hosts, build the appropriate queries, insert the data, repeat until all hosts have been inserted. THEN we loop over them again, and build numerous queries for each of the related objects.

This all works, except once it gets to the custom variables, I get a segfault. I've narrowed this down and what seems like is happening is that (char *) var_query_on_update is simply not readable any longer. On a large system, 15k+ hosts, it usually will start erroring around the 500th host (no matter how big (or small) the ndo_max_insert_values integer is set to (via ndo.cfg).

If anyone has time to review the code and help out - we'd certainly appreciate it.

Likewise, if anyone has any experience with the mysql c api and can point out some flaw or something that is going to blow up one day with this code, that would also be appreciated. (Keep in mind that all of the functions in ndo-startup.c are currently undergoing being re-written to the ndo_write_hosts and ndo_write_services pattern of insertion)

Thanks!

@hedenface
Copy link
Contributor Author

started adding an isolation case: 77320ba

@hedenface
Copy link
Contributor Author

I believe this issue was successfully replicated here (commit 7d946d9) https://github.com/NagiosEnterprises/ndoutils/blob/7d946d9dadfda89b927a88c26b07099595b5c51b/src/bug-test.c - and this also contains the fix.

Which ...is a bit silly on my part, but that's how these things go, I suppose

*query_len = 0;

@nook24
Copy link

nook24 commented Sep 16, 2019

I found this issue via your reddit post.
How ever. Did you you already checked out Statusengine 2? It's a fully working NDO drop-in replacement. No kernel message queues, no other random issues anymore.
https://statusengine.org/oldstable/
nook24/statusengine#41

@sawolf
Copy link
Contributor

sawolf commented Sep 16, 2019

@nook24 I hadn't seen that website before, that's pretty nice!

I think for now we still need to have a solution that we "control", but the approach for your project is certainly interesting.

@nook24
Copy link

nook24 commented Mar 9, 2020

Any news about ndo-3 ? :)

@sawolf
Copy link
Contributor

sawolf commented Mar 9, 2020

Yea, I suppose I can give an update:

  • The reason it's still not released is because there were quite a few segfaults there were only reproduced non-deterministically. @jomann09 and I have been testing and we think these are all handled, but it's hard to be sure.
  • While going over the old profiling data from when @hedenface was still here, I noticed that we may not have been measuring speedup correctly. CPU load on 50k hosts/services dropped from a peak of 100+ to ~1.27. This is really great if we're processing as many events as we were before, but it's more likely to me that the worker processes aren't getting enough work to do. I still need to do more work on this to verify.
  • You may have noticed that the ndo-3 branch isn't on Github. I was asked to remove it last week because we weren't sure whether we would release the finished product as open-source software. At this point, we don't think there are many people using NDOUtils outside of Nagios XI, and there's not much point in maintaining a public repository if that's true.

@nook24
Copy link

nook24 commented Mar 11, 2020

Is @hedenface not working on Nagios projects anymore?

CPU load on 50k hosts/services dropped from a peak of 100+ to ~1.27.

This sounds great. But was NDOUtils producing this high load or Nagios Core itself? Or some other process.

I was asked to remove it last week because we weren't sure whether we would release the finished product as open-source software.

That's too bad - even if I didn't used NDO since November 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants