Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systemd refactor #1254

Merged
merged 6 commits into from
Feb 11, 2019
Merged

Systemd refactor #1254

merged 6 commits into from
Feb 11, 2019

Conversation

pgier
Copy link
Contributor

@pgier pgier commented Feb 6, 2019

The purpose of this change is primarily to address issue #1201. The
addition of several new systemd metrics in v0.17.0
(pr #1098, #968, #992, #952) caused the overall collection time
to increase by about a factor of 10. The increase is not due
to any particular change, just a consequence of making a lot
more calls over dbus.

The two main design changes here are (1) turn off many of the dbus
metrics by default and (2) execute dbus calls in parallel to speed up
collection. This also organizes metric collection closer to how the
data is organized in systemd. And it gathers metrics which are specific
to "service" unit types in bulk instead of making three separate calls
for TasksMax, TasksCurrent, and NRestarts. This bulk request also allows
for enabling more service unit data in the future without extra dbus calls,
for example memory usage information (#1036).

The parallelization seems to reduce the original collection time to about
50-60% (with all metrics enabled), and the default behaviour with the
additional metrics disabled brings collection time back in line with
v0.16.0.

Closes: #1201

@pgier pgier force-pushed the systemd-refactor branch 2 times, most recently from 441cf29 to 0181a14 Compare February 6, 2019 22:37
@SuperQ
Copy link
Member

SuperQ commented Feb 7, 2019

Why the revert of the unit type label?

@SuperQ
Copy link
Member

SuperQ commented Feb 7, 2019

My original idea was to setup a new dbus Conn in each sub-collector. Simpler and more predictable than having worker pool. We only have 6 sub-collectors, so it's not much more parallel connections.

@pgier
Copy link
Contributor Author

pgier commented Feb 7, 2019

I reverted the service unit type label because it slows down the collection even more. With the change to gather all the service unit properties, I can add it back without a performance hit when that part is enabled. Also, I wasn't sure if it would be better in its own metric, something like _info.

The reason I went with the worker pools is because with sub-collectors, the collection time can only be as fast as the slowest collector. We had three sub-collectors which all are kind of slow (collectUnitStatusMetrics, collectUnitStartTimeMetrics, collectUnitTasksMetrics). Running them in parallel speeds things up, but disabling one or two of them doesn't help much more. With the worker pools the improvement is more proportional when disabling certain metrics because disabling certain metrics means each go routine completes in less time.

However, I agree with you that this is a complicated solution, and maybe not worth the slight performance benefit.

@SuperQ
Copy link
Member

SuperQ commented Feb 7, 2019

WIthout a discussion with the systemd people directly on how/why the dbus interface is so slow, I think just opening a connection per sub-collector is the simplest way forward.

@pgier
Copy link
Contributor Author

pgier commented Feb 7, 2019

@SuperQ ok, I'll work on a new PR or force push over this one using the sub-collector design. Should I combine collectUnitStatusMetrics, collectUnitTasksMetrics, and getting the service type into a single sub-collector which can be enabled/disabled? The reason for this is that in my testing it seemed faster to make one call to get all the properties vs. four individual property calls.

@SuperQ
Copy link
Member

SuperQ commented Feb 7, 2019

We always want to get the service type, so the labels stay consistent.

@pgier pgier force-pushed the systemd-refactor branch 3 times, most recently from 766b951 to 5a2043e Compare February 7, 2019 21:36
@pgier
Copy link
Contributor Author

pgier commented Feb 7, 2019

@SuperQ I made the changes you suggested, please take a look, hopefully looks better now :)
I had to revert the type label before applying your changes, and then I re-added it in the final commit.

@SuperQ
Copy link
Member

SuperQ commented Feb 7, 2019

Looks better. Wouldn't it be better to open a new dbus Conn in each collector? Re-using the same Conn is thread safe and doesn't slow things down?

@pgier
Copy link
Contributor Author

pgier commented Feb 7, 2019

Using multiple connections didn't seem to make a difference when I tested it. And according to the godbus docs, it's ok to share a connection between multiple go routines (https://godoc.org/github.com/godbus/dbus#Conn).

@SuperQ
Copy link
Member

SuperQ commented Feb 8, 2019

I added one more flag patch on top of this:

diff --git a/collector/systemd_linux.go b/collector/systemd_linux.go
index 202adc4..509c03a 100644
--- a/collector/systemd_linux.go
+++ b/collector/systemd_linux.go
@@ -34,6 +34,8 @@ var (
        unitBlacklist     = kingpin.Flag("collector.systemd.unit-blacklist", "Regexp of systemd units to blacklist. Units must both match whitelist and not match blacklist to be included.").Default(".+\\.scope").String()
        systemdPrivate    = kingpin.Flag("collector.systemd.private", "Establish a private, direct connection to systemd without dbus.").Bool()
        enableTaskMetrics = kingpin.Flag("collector.systemd.enable-task-metrics", "Enables service unit tasks metrics unit_tasks_current and unit_tasks_max").Bool()
+       enableRestartsMetrics = kingpin.Flag("collector.systemd.enable-restarts-metrics", "Enables service unit metric service_restart_total").Bool()
+       enableStartTimeMetrics = kingpin.Flag("collector.systemd.enable-start-time-metrics", "Enables service unit metric unit_start_time_seconds").Bool()
 )
 
 type systemdCollector struct {
@@ -157,13 +159,15 @@ func (c *systemdCollector) Update(ch chan<- prometheus.Metric) error {
                log.Debugf("systemd collectUnitStatusMetrics took %f", time.Since(begin).Seconds())
        }()
 
-       wg.Add(1)
-       go func() {
-               defer wg.Done()
-               begin = time.Now()
-               c.collectUnitStartTimeMetrics(conn, ch, units)
-               log.Debugf("systemd collectUnitStartTimeMetrics took %f", time.Since(begin).Seconds())
-       }()
+       if *enableStartTimeMetrics {
+               wg.Add(1)
+               go func() {
+                       defer wg.Done()
+                       begin = time.Now()
+                       c.collectUnitStartTimeMetrics(conn, ch, units)
+                       log.Debugf("systemd collectUnitStartTimeMetrics took %f", time.Since(begin).Seconds())
+               }()
+       }
 
        if *enableTaskMetrics {
                wg.Add(1)
@@ -224,7 +228,7 @@ func (c *systemdCollector) collectUnitStatusMetrics(conn *dbus.Conn, ch chan<- p
                                c.unitDesc, prometheus.GaugeValue, isActive,
                                unit.Name, stateName, serviceType)
                }
-               if strings.HasSuffix(unit.Name, ".service") {
+               if *enableRestartsMetrics && strings.HasSuffix(unit.Name, ".service") {
                        // NRestarts wasn't added until systemd 235.
                        restartsCount, err := conn.GetUnitTypeProperty(unit.Name, "Service", "NRestarts")
                        if err != nil {

systemd-timing-2019-02-08

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@SuperQ
Copy link
Member

SuperQ commented Feb 9, 2019

Let's add a CHANGELOG entry for this mentioning the breaking change to the default list of metrics exported.

pgier and others added 6 commits February 11, 2019 07:32
prometheus#1229)"

This reverts commit 40dce45.

Signed-off-by: Paul Gier <pgier@redhat.com>
Add timing information to the systemd gathering.

Signed-off-by: Ben Kochie <superq@gmail.com>
Move the gathering of systemd information to collection time.
* Use a single dbus connection.
* Move `getSystemState()` into `collectSystemState()` function.

Signed-off-by: Ben Kochie <superq@gmail.com>
This reduces the system metric collection time by using a wait group
and go routines to allow the systemd metric calls happen concurrently.

Also, makes the tasks_max and tasks_current metrics disabled by default
because these can be time consuming to gather.

And re-adds the service unit "type" label.  It was reverted previously
to avoid a merge conflict.

Signed-off-by: Paul Gier <pgier@redhat.com>
Patch from SuperQ.

Signed-off-by: Paul Gier <pgier@redhat.com>
Signed-off-by: Paul Gier <pgier@redhat.com>
@pgier
Copy link
Contributor Author

pgier commented Feb 11, 2019

Updated the changelog, had to rebase due to changelog conflict with another recent merge.

@SuperQ SuperQ merged commit cb9e23c into prometheus:master Feb 11, 2019
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
This reduces the system metric collection time by using a wait group
and go routines to allow the systemd metric calls happen concurrently.

Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default
because these can be time consuming to gather.

Signed-off-by: Paul Gier <pgier@redhat.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024
This reduces the system metric collection time by using a wait group
and go routines to allow the systemd metric calls happen concurrently.

Also, makes the start time, restarts, tasks_max, and tasks_current metrics disabled by default
because these can be time consuming to gather.

Signed-off-by: Paul Gier <pgier@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Systemd Collector much longer scrape Time after upgrade to 0.17.0
2 participants