-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Systemd refactor #1254
Systemd refactor #1254
Conversation
441cf29
to
0181a14
Compare
Why the revert of the unit |
My original idea was to setup a new dbus Conn in each sub-collector. Simpler and more predictable than having worker pool. We only have 6 sub-collectors, so it's not much more parallel connections. |
I reverted the service unit The reason I went with the worker pools is because with sub-collectors, the collection time can only be as fast as the slowest collector. We had three sub-collectors which all are kind of slow (collectUnitStatusMetrics, collectUnitStartTimeMetrics, collectUnitTasksMetrics). Running them in parallel speeds things up, but disabling one or two of them doesn't help much more. With the worker pools the improvement is more proportional when disabling certain metrics because disabling certain metrics means each go routine completes in less time. However, I agree with you that this is a complicated solution, and maybe not worth the slight performance benefit. |
WIthout a discussion with the systemd people directly on how/why the dbus interface is so slow, I think just opening a connection per sub-collector is the simplest way forward. |
@SuperQ ok, I'll work on a new PR or force push over this one using the sub-collector design. Should I combine collectUnitStatusMetrics, collectUnitTasksMetrics, and getting the service type into a single sub-collector which can be enabled/disabled? The reason for this is that in my testing it seemed faster to make one call to get all the properties vs. four individual property calls. |
We always want to get the service type, so the labels stay consistent. |
766b951
to
5a2043e
Compare
@SuperQ I made the changes you suggested, please take a look, hopefully looks better now :) |
Looks better. Wouldn't it be better to open a new dbus Conn in each collector? Re-using the same Conn is thread safe and doesn't slow things down? |
Using multiple connections didn't seem to make a difference when I tested it. And according to the godbus docs, it's ok to share a connection between multiple go routines (https://godoc.org/github.com/godbus/dbus#Conn). |
I added one more flag patch on top of this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
Let's add a CHANGELOG entry for this mentioning the breaking change to the default list of metrics exported. |
prometheus#1229)" This reverts commit 40dce45. Signed-off-by: Paul Gier <pgier@redhat.com>
Add timing information to the systemd gathering. Signed-off-by: Ben Kochie <superq@gmail.com>
Move the gathering of systemd information to collection time. * Use a single dbus connection. * Move `getSystemState()` into `collectSystemState()` function. Signed-off-by: Ben Kochie <superq@gmail.com>
This reduces the system metric collection time by using a wait group and go routines to allow the systemd metric calls happen concurrently. Also, makes the tasks_max and tasks_current metrics disabled by default because these can be time consuming to gather. And re-adds the service unit "type" label. It was reverted previously to avoid a merge conflict. Signed-off-by: Paul Gier <pgier@redhat.com>
Patch from SuperQ. Signed-off-by: Paul Gier <pgier@redhat.com>
Signed-off-by: Paul Gier <pgier@redhat.com>
a2d5cd9
to
f3f8875
Compare
Updated the changelog, had to rebase due to changelog conflict with another recent merge. |
This reduces the system metric collection time by using a wait group and go routines to allow the systemd metric calls happen concurrently. Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default because these can be time consuming to gather. Signed-off-by: Paul Gier <pgier@redhat.com>
This reduces the system metric collection time by using a wait group and go routines to allow the systemd metric calls happen concurrently. Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default because these can be time consuming to gather. Signed-off-by: Paul Gier <pgier@redhat.com>
The purpose of this change is primarily to address issue #1201. The
addition of several new systemd metrics in v0.17.0
(pr #1098, #968, #992, #952) caused the overall collection time
to increase by about a factor of 10. The increase is not due
to any particular change, just a consequence of making a lot
more calls over dbus.
The two main design changes here are (1) turn off many of the dbus
metrics by default and (2) execute dbus calls in parallel to speed up
collection. This also organizes metric collection closer to how the
data is organized in systemd. And it gathers metrics which are specific
to "service" unit types in bulk instead of making three separate calls
for TasksMax, TasksCurrent, and NRestarts. This bulk request also allows
for enabling more service unit data in the future without extra dbus calls,
for example memory usage information (#1036).
The parallelization seems to reduce the original collection time to about
50-60% (with all metrics enabled), and the default behaviour with the
additional metrics disabled brings collection time back in line with
v0.16.0.
Closes: #1201