Systemd refactor #1254

pgier · 2019-02-06T22:34:28Z

The purpose of this change is primarily to address issue #1201. The
addition of several new systemd metrics in v0.17.0
(pr #1098, #968, #992, #952) caused the overall collection time
to increase by about a factor of 10. The increase is not due
to any particular change, just a consequence of making a lot
more calls over dbus.

The two main design changes here are (1) turn off many of the dbus
metrics by default and (2) execute dbus calls in parallel to speed up
collection. This also organizes metric collection closer to how the
data is organized in systemd. And it gathers metrics which are specific
to "service" unit types in bulk instead of making three separate calls
for TasksMax, TasksCurrent, and NRestarts. This bulk request also allows
for enabling more service unit data in the future without extra dbus calls,
for example memory usage information (#1036).

The parallelization seems to reduce the original collection time to about
50-60% (with all metrics enabled), and the default behaviour with the
additional metrics disabled brings collection time back in line with
v0.16.0.

Closes: #1201

SuperQ · 2019-02-07T08:26:17Z

Why the revert of the unit type label?

SuperQ · 2019-02-07T08:31:41Z

My original idea was to setup a new dbus Conn in each sub-collector. Simpler and more predictable than having worker pool. We only have 6 sub-collectors, so it's not much more parallel connections.

pgier · 2019-02-07T14:15:27Z

I reverted the service unit type label because it slows down the collection even more. With the change to gather all the service unit properties, I can add it back without a performance hit when that part is enabled. Also, I wasn't sure if it would be better in its own metric, something like _info.

The reason I went with the worker pools is because with sub-collectors, the collection time can only be as fast as the slowest collector. We had three sub-collectors which all are kind of slow (collectUnitStatusMetrics, collectUnitStartTimeMetrics, collectUnitTasksMetrics). Running them in parallel speeds things up, but disabling one or two of them doesn't help much more. With the worker pools the improvement is more proportional when disabling certain metrics because disabling certain metrics means each go routine completes in less time.

However, I agree with you that this is a complicated solution, and maybe not worth the slight performance benefit.

SuperQ · 2019-02-07T14:17:53Z

WIthout a discussion with the systemd people directly on how/why the dbus interface is so slow, I think just opening a connection per sub-collector is the simplest way forward.

pgier · 2019-02-07T14:29:22Z

@SuperQ ok, I'll work on a new PR or force push over this one using the sub-collector design. Should I combine collectUnitStatusMetrics, collectUnitTasksMetrics, and getting the service type into a single sub-collector which can be enabled/disabled? The reason for this is that in my testing it seemed faster to make one call to get all the properties vs. four individual property calls.

SuperQ · 2019-02-07T15:27:19Z

We always want to get the service type, so the labels stay consistent.

pgier · 2019-02-07T21:37:01Z

@SuperQ I made the changes you suggested, please take a look, hopefully looks better now :)
I had to revert the type label before applying your changes, and then I re-added it in the final commit.

SuperQ · 2019-02-07T21:48:06Z

Looks better. Wouldn't it be better to open a new dbus Conn in each collector? Re-using the same Conn is thread safe and doesn't slow things down?

pgier · 2019-02-07T22:43:37Z

Using multiple connections didn't seem to make a difference when I tested it. And according to the godbus docs, it's ok to share a connection between multiple go routines (https://godoc.org/github.com/godbus/dbus#Conn).

SuperQ · 2019-02-08T07:29:23Z

I added one more flag patch on top of this:

diff --git a/collector/systemd_linux.go b/collector/systemd_linux.go
index 202adc4..509c03a 100644
--- a/collector/systemd_linux.go
+++ b/collector/systemd_linux.go
@@ -34,6 +34,8 @@ var (
        unitBlacklist     = kingpin.Flag("collector.systemd.unit-blacklist", "Regexp of systemd units to blacklist. Units must both match whitelist and not match blacklist to be included.").Default(".+\\.scope").String()
        systemdPrivate    = kingpin.Flag("collector.systemd.private", "Establish a private, direct connection to systemd without dbus.").Bool()
        enableTaskMetrics = kingpin.Flag("collector.systemd.enable-task-metrics", "Enables service unit tasks metrics unit_tasks_current and unit_tasks_max").Bool()
+       enableRestartsMetrics = kingpin.Flag("collector.systemd.enable-restarts-metrics", "Enables service unit metric service_restart_total").Bool()
+       enableStartTimeMetrics = kingpin.Flag("collector.systemd.enable-start-time-metrics", "Enables service unit metric unit_start_time_seconds").Bool()
 )
 
 type systemdCollector struct {
@@ -157,13 +159,15 @@ func (c *systemdCollector) Update(ch chan<- prometheus.Metric) error {
                log.Debugf("systemd collectUnitStatusMetrics took %f", time.Since(begin).Seconds())
        }()
 
-       wg.Add(1)
-       go func() {
-               defer wg.Done()
-               begin = time.Now()
-               c.collectUnitStartTimeMetrics(conn, ch, units)
-               log.Debugf("systemd collectUnitStartTimeMetrics took %f", time.Since(begin).Seconds())
-       }()
+       if *enableStartTimeMetrics {
+               wg.Add(1)
+               go func() {
+                       defer wg.Done()
+                       begin = time.Now()
+                       c.collectUnitStartTimeMetrics(conn, ch, units)
+                       log.Debugf("systemd collectUnitStartTimeMetrics took %f", time.Since(begin).Seconds())
+               }()
+       }
 
        if *enableTaskMetrics {
                wg.Add(1)
@@ -224,7 +228,7 @@ func (c *systemdCollector) collectUnitStatusMetrics(conn *dbus.Conn, ch chan<- p
                                c.unitDesc, prometheus.GaugeValue, isActive,
                                unit.Name, stateName, serviceType)
                }
-               if strings.HasSuffix(unit.Name, ".service") {
+               if *enableRestartsMetrics && strings.HasSuffix(unit.Name, ".service") {
                        // NRestarts wasn't added until systemd 235.
                        restartsCount, err := conn.GetUnitTypeProperty(unit.Name, "Service", "NRestarts")
                        if err != nil {

SuperQ

LGTM, Thanks!

SuperQ · 2019-02-09T08:56:15Z

Let's add a CHANGELOG entry for this mentioning the breaking change to the default list of metrics exported.

prometheus#1229)" This reverts commit 40dce45. Signed-off-by: Paul Gier <pgier@redhat.com>

Add timing information to the systemd gathering. Signed-off-by: Ben Kochie <superq@gmail.com>

Move the gathering of systemd information to collection time. * Use a single dbus connection. * Move `getSystemState()` into `collectSystemState()` function. Signed-off-by: Ben Kochie <superq@gmail.com>

This reduces the system metric collection time by using a wait group and go routines to allow the systemd metric calls happen concurrently. Also, makes the tasks_max and tasks_current metrics disabled by default because these can be time consuming to gather. And re-adds the service unit "type" label. It was reverted previously to avoid a merge conflict. Signed-off-by: Paul Gier <pgier@redhat.com>

Patch from SuperQ. Signed-off-by: Paul Gier <pgier@redhat.com>

Signed-off-by: Paul Gier <pgier@redhat.com>

pgier · 2019-02-11T13:34:09Z

Updated the changelog, had to rebase due to changelog conflict with another recent merge.

This reduces the system metric collection time by using a wait group and go routines to allow the systemd metric calls happen concurrently. Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default because these can be time consuming to gather. Signed-off-by: Paul Gier <pgier@redhat.com>

pgier force-pushed the systemd-refactor branch 2 times, most recently from 441cf29 to 0181a14 Compare February 6, 2019 22:37

pgier mentioned this pull request Feb 6, 2019

WIP: Refactor systemd collection #1205

Closed

pgier force-pushed the systemd-refactor branch 3 times, most recently from 766b951 to 5a2043e Compare February 7, 2019 21:36

SuperQ approved these changes Feb 9, 2019

View reviewed changes

pgier and others added 6 commits February 11, 2019 07:32

Revert "collector/systemd: add new label "type" for systemd_unit_state (

358eb68

prometheus#1229)" This reverts commit 40dce45. Signed-off-by: Paul Gier <pgier@redhat.com>

Add debugging to systemd collection

ff51959

Add timing information to the systemd gathering. Signed-off-by: Ben Kochie <superq@gmail.com>

Move systemd gathering to collection time

fd31297

Move the gathering of systemd information to collection time. * Use a single dbus connection. * Move `getSystemState()` into `collectSystemState()` function. Signed-off-by: Ben Kochie <superq@gmail.com>

collector/systemd: flag to enable start time and restarts metrics

6220a46

Patch from SuperQ. Signed-off-by: Paul Gier <pgier@redhat.com>

collector/systemd: update changelog to reflect systemd changes

f3f8875

Signed-off-by: Paul Gier <pgier@redhat.com>

pgier force-pushed the systemd-refactor branch from a2d5cd9 to f3f8875 Compare February 11, 2019 13:33

SuperQ merged commit cb9e23c into prometheus:master Feb 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systemd refactor #1254

Systemd refactor #1254

pgier commented Feb 6, 2019 •

edited by SuperQ

Loading

SuperQ commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 8, 2019

SuperQ left a comment

SuperQ commented Feb 9, 2019

pgier commented Feb 11, 2019

Systemd refactor #1254

Systemd refactor #1254

Conversation

pgier commented Feb 6, 2019 • edited by SuperQ Loading

SuperQ commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 7, 2019

pgier commented Feb 7, 2019

SuperQ commented Feb 8, 2019

SuperQ left a comment

Choose a reason for hiding this comment

SuperQ commented Feb 9, 2019

pgier commented Feb 11, 2019

pgier commented Feb 6, 2019 •

edited by SuperQ

Loading