Add setting to force enable or disable collection of Sidekiq cluster-…

…wide metrics (#20) Use anyway_config for configuration. Setting `collect_cluster_metrics` is on by default in Sidekiq worker processes and off everywhere else, but can be force enabled or disabled if default behavior poses excess load to Redis and/or monitoring system. Signed-off-by: Valentin Kiselev <mrexox@evilmartians.com> Co-authored-by: Andrey Novikov <envek@envek.name>
yabeda-rb · May 12, 2021 · fdb1d34 · fdb1d34
1 parent 91c8f4e
commit fdb1d34
Show file tree

Hide file tree

Showing 5 changed files with 91 additions and 30 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,17 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
+### Added
+
+ - Setting `collect_cluster_metrics` allowing to force enable or disable collection of global (whole Sidekiq installaction-wide) metrics. See [#20](https://github.com/yabeda-rb/yabeda-sidekiq/pull/20). [@mrexox]
+
+    By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation.
+    Client processes (everything else that is not Sidekiq worker) by default doesn't.
+
+    With this config you can override this behavior:
+    - force disable if you don't want multiple Sidekiq workers to report the same numbers (that causes excess load to both Redis and monitoring)
+    - force enable if you want non-Sidekiq process to collect them (like dedicated metric exporter process)
+
 ## 0.7.0 - 2020-07-15
 
 ### Changed
@@ -63,3 +74,4 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 [@dsalahutdinov]: https://github.com/dsalahutdinov "Salahutdinov Dmitry"
 [@asusikov]: https://github.com/asusikov "Alexander Susikov"
+[@mrexox]: https://github.com/mrexox "Valentine Kiselev"
diff --git a/README.md b/README.md
@@ -37,19 +37,30 @@ end
 
 ## Metrics
 
+### Local per-process metrics
+
+Metrics representing state of current Sidekiq worker process and stats of executed or executing jobs:
+
  - Total number of executed jobs: `sidekiq_jobs_executed_total` -  (segmented by queue and class name)
  - Number of jobs have been finished successfully: `sidekiq_jobs_success_total` (segmented by queue and class name)
  - Number of jobs have been failed: `sidekiq_jobs_failed_total` (segmented by queue and class name)
  - Time of job run: `sidekiq_job_runtime` (seconds per job execution, segmented by queue and class name)
- - Time of the queue latency `sidekiq_queue_latency` (the difference in seconds since the oldest job in the queue was enqueued)
  - Time of the job latency `sidekiq_job_latency` (the difference in seconds since the enqueuing until running job)
+ - Maximum runtime of currently executing jobs: `sidekiq_running_job_runtime` (useful for detection of hung jobs, segmented by queue and class name)
+
+### Global cluster-wide metrics
+
+Metrics representing state of the whole Sidekiq installation (queues, processes, etc):
+
  - Number of jobs in queues: `sidekiq_jobs_waiting_count` (segmented by queue)
+ - Time of the queue latency `sidekiq_queue_latency` (the difference in seconds since the oldest job in the queue was enqueued)
  - Number of scheduled jobs:`sidekiq_jobs_scheduled_count`
  - Number of jobs in retry set: `sidekiq_jobs_retry_count`
  - Number of jobs in dead set (“morgue”): `sidekiq_jobs_dead_count`
- - Active workers count: `sidekiq_active_processes`
- - Active processes count: `sidekiq_active_workers_count`
- - Maximum runtime of currently executing jobs: `sidekiq_running_job_runtime` (useful for detection of hung jobs, segmented by queue and class name)
+ - Active processes count: `sidekiq_active_processes`
+ - Active servers count: `sidekiq_active_workers_count`
+
+By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation. This can be overridden by setting `collect_cluster_metrics` config key to `true` for non-Sidekiq processes or to `false` for Sidekiq processes (e.g. by setting `YABEDA_SIDEKIQ_COLLECT_CLUSTER_METRICS` env variable to `no`, see other methods in [anyway_config] docs).
 
 ## Custom tags
 
@@ -74,6 +85,14 @@ class MyWorker
 end
 ```
 
+## Configuration
+
+Configuration is handled by [anyway_config] gem. With it you can load settings from environment variables (upcased and prefixed with `YABEDA_SIDEKIQ_`), YAML files, and other sources. See [anyway_config] docs for details.
+
+Config key                | Type     | Default                                                 | Description |
+------------------------- | -------- | ------------------------------------------------------- | ----------- |
+`collect_cluster_metrics` | boolean  | Enabled in Sidekiq worker processes, disabled otherwise | Defines whether this Ruby process should collect and expose metrics representing state of the whole Sidekiq installation (queues, processes, etc). |
+
 # Roadmap (TODO or Help wanted)
 
  - Implement optional segmentation of retry/schedule/dead sets
@@ -131,3 +150,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
 [Sidekiq]: https://github.com/mperham/sidekiq/ "Simple, efficient background processing for Ruby"
 [yabeda]: https://github.com/yabeda-rb/yabeda
 [yabeda-prometheus]: https://github.com/yabeda-rb/yabeda-prometheus
+[anyway_config]: https://github.com/palkan/anyway_config "Configuration library for Ruby gems and applications"
diff --git a/lib/yabeda/sidekiq.rb b/lib/yabeda/sidekiq.rb
@@ -7,6 +7,7 @@
 require "yabeda/sidekiq/version"
 require "yabeda/sidekiq/client_middleware"
 require "yabeda/sidekiq/server_middleware"
+require "yabeda/sidekiq/config"
 
 module Yabeda
   module Sidekiq
@@ -16,36 +17,47 @@ module Sidekiq
     ].freeze
 
     Yabeda.configure do
+      config = Config.new
+
       group :sidekiq
 
       counter :jobs_enqueued_total, tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq enqueued."
 
-      next unless ::Sidekiq.server?
-
-      counter   :jobs_executed_total,  tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq executed."
-      counter   :jobs_success_total,   tags: %i[queue worker], comment: "A counter of the total number of jobs successfully processed by sidekiq."
-      counter   :jobs_failed_total,    tags: %i[queue worker], comment: "A counter of the total number of jobs failed in sidekiq."
-
-      gauge     :jobs_waiting_count,   tags: %i[queue], comment: "The number of jobs waiting to process in sidekiq."
-      gauge     :active_workers_count, tags: [],        comment: "The number of currently running machines with sidekiq workers."
-      gauge     :jobs_scheduled_count, tags: [],        comment: "The number of jobs scheduled for later execution."
-      gauge     :jobs_retry_count,     tags: [],        comment: "The number of failed jobs waiting to be retried"
-      gauge     :jobs_dead_count,      tags: [],        comment: "The number of jobs exceeded their retry count."
-      gauge     :active_processes,     tags: [],        comment: "The number of active Sidekiq worker processes."
-      gauge     :queue_latency,        tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"
-      gauge     :running_job_runtime,  tags: %i[queue worker], aggregation: :max, unit: :seconds,
-                                       comment: "How long currently running jobs are running (useful for detection of hung jobs)"
-
-      histogram :job_latency, comment: "The job latency, the difference in seconds between enqueued and running time",
-                              unit: :seconds, per: :job,
-                              tags: %i[queue worker],
-                              buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
-      histogram :job_runtime, comment: "A histogram of the job execution time.",
-                              unit: :seconds, per: :job,
-                              tags: %i[queue worker],
-                              buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
+      if ::Sidekiq.server?
+        counter   :jobs_executed_total,  tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq executed."
+        counter   :jobs_success_total,   tags: %i[queue worker], comment: "A counter of the total number of jobs successfully processed by sidekiq."
+        counter   :jobs_failed_total,    tags: %i[queue worker], comment: "A counter of the total number of jobs failed in sidekiq."
+
+        gauge     :running_job_runtime,  tags: %i[queue worker], aggregation: :max, unit: :seconds,
+                                         comment: "How long currently running jobs are running (useful for detection of hung jobs)"
+
+        histogram :job_latency, comment: "The job latency, the difference in seconds between enqueued and running time",
+                                unit: :seconds, per: :job,
+                                tags: %i[queue worker],
+                                buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
+        histogram :job_runtime, comment: "A histogram of the job execution time.",
+                                unit: :seconds, per: :job,
+                                tags: %i[queue worker],
+                                buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
+      end
+
+      # Metrics not specific for current Sidekiq process, but representing state of the whole Sidekiq installation (queues, processes, etc)
+      # You can opt-out from collecting these by setting YABEDA_SIDEKIQ_COLLECT_CLUSTER_METRICS to falsy value (+no+ or +false+)
+      if config.collect_cluster_metrics # defaults to +::Sidekiq.server?+
+        gauge     :jobs_waiting_count,   tags: %i[queue], comment: "The number of jobs waiting to process in sidekiq."
+        gauge     :active_workers_count, tags: [],        comment: "The number of currently running machines with sidekiq workers."
+        gauge     :jobs_scheduled_count, tags: [],        comment: "The number of jobs scheduled for later execution."
+        gauge     :jobs_retry_count,     tags: [],        comment: "The number of failed jobs waiting to be retried"
+        gauge     :jobs_dead_count,      tags: [],        comment: "The number of jobs exceeded their retry count."
+        gauge     :active_processes,     tags: [],        comment: "The number of active Sidekiq worker processes."
+        gauge     :queue_latency,        tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"
+      end
 
       collect do
+        Yabeda::Sidekiq.track_max_job_runtime if ::Sidekiq.server?
+
+        next unless config.collect_cluster_metrics
+
         stats = ::Sidekiq::Stats.new
 
         stats.queues.each do |k, v|
@@ -61,8 +73,6 @@ module Sidekiq
           sidekiq_queue_latency.set({ queue: queue.name }, queue.latency)
         end
 
-        Yabeda::Sidekiq.track_max_job_runtime
-
         # That is quite slow if your retry set is large
         # I don't want to enable it by default
         # retries_by_queues =

diff --git a/lib/yabeda/sidekiq/config.rb b/lib/yabeda/sidekiq/config.rb
@@ -0,0 +1,18 @@
+# frozen_string_literal: true
+
+require "anyway"
+
+module Yabeda
+  module Sidekiq
+    class Config < ::Anyway::Config
+      config_name :yabeda_sidekiq
+
+      # By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation.
+      # Client processes (everything else that is not Sidekiq worker) by default doesn't.
+      # With this config you can override this behavior:
+      #  - force disable if you don't want multiple Sidekiq workers to report the same numbers (that causes excess load to both Redis and monitoring)
+      #  - force enable if you want non-Sidekiq process to collect them (like dedicated metric exporter process)
+      attr_config collect_cluster_metrics: ::Sidekiq.server?
+    end
+  end
+end
diff --git a/yabeda-sidekiq.gemspec b/yabeda-sidekiq.gemspec
@@ -22,6 +22,7 @@ Gem::Specification.new do |spec|
   spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
   spec.require_paths = ["lib"]
 
+  spec.add_dependency "anyway_config", ">= 1.3", "< 3"
   spec.add_dependency "sidekiq"
   spec.add_dependency "yabeda", "~> 0.6"