Skip to content

Commit

Permalink
Add setting to force enable or disable collection of Sidekiq cluster-…
Browse files Browse the repository at this point in the history
…wide metrics (#20)

Use anyway_config for configuration.

Setting `collect_cluster_metrics` is on by default in Sidekiq worker processes and off everywhere else, but can be force enabled or disabled if default behavior poses excess load to Redis and/or monitoring system.

Signed-off-by: Valentin Kiselev <mrexox@evilmartians.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
  • Loading branch information
mrexox and Envek authored May 12, 2021
1 parent 91c8f4e commit fdb1d34
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 30 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

### Added

- Setting `collect_cluster_metrics` allowing to force enable or disable collection of global (whole Sidekiq installaction-wide) metrics. See [#20](https://github.com/yabeda-rb/yabeda-sidekiq/pull/20). [@mrexox]

By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation.
Client processes (everything else that is not Sidekiq worker) by default doesn't.

With this config you can override this behavior:
- force disable if you don't want multiple Sidekiq workers to report the same numbers (that causes excess load to both Redis and monitoring)
- force enable if you want non-Sidekiq process to collect them (like dedicated metric exporter process)

## 0.7.0 - 2020-07-15

### Changed
Expand Down Expand Up @@ -63,3 +74,4 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

[@dsalahutdinov]: https://github.com/dsalahutdinov "Salahutdinov Dmitry"
[@asusikov]: https://github.com/asusikov "Alexander Susikov"
[@mrexox]: https://github.com/mrexox "Valentine Kiselev"
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,19 +37,30 @@ end

## Metrics

### Local per-process metrics

Metrics representing state of current Sidekiq worker process and stats of executed or executing jobs:

- Total number of executed jobs: `sidekiq_jobs_executed_total` - (segmented by queue and class name)
- Number of jobs have been finished successfully: `sidekiq_jobs_success_total` (segmented by queue and class name)
- Number of jobs have been failed: `sidekiq_jobs_failed_total` (segmented by queue and class name)
- Time of job run: `sidekiq_job_runtime` (seconds per job execution, segmented by queue and class name)
- Time of the queue latency `sidekiq_queue_latency` (the difference in seconds since the oldest job in the queue was enqueued)
- Time of the job latency `sidekiq_job_latency` (the difference in seconds since the enqueuing until running job)
- Maximum runtime of currently executing jobs: `sidekiq_running_job_runtime` (useful for detection of hung jobs, segmented by queue and class name)

### Global cluster-wide metrics

Metrics representing state of the whole Sidekiq installation (queues, processes, etc):

- Number of jobs in queues: `sidekiq_jobs_waiting_count` (segmented by queue)
- Time of the queue latency `sidekiq_queue_latency` (the difference in seconds since the oldest job in the queue was enqueued)
- Number of scheduled jobs:`sidekiq_jobs_scheduled_count`
- Number of jobs in retry set: `sidekiq_jobs_retry_count`
- Number of jobs in dead set (“morgue”): `sidekiq_jobs_dead_count`
- Active workers count: `sidekiq_active_processes`
- Active processes count: `sidekiq_active_workers_count`
- Maximum runtime of currently executing jobs: `sidekiq_running_job_runtime` (useful for detection of hung jobs, segmented by queue and class name)
- Active processes count: `sidekiq_active_processes`
- Active servers count: `sidekiq_active_workers_count`

By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation. This can be overridden by setting `collect_cluster_metrics` config key to `true` for non-Sidekiq processes or to `false` for Sidekiq processes (e.g. by setting `YABEDA_SIDEKIQ_COLLECT_CLUSTER_METRICS` env variable to `no`, see other methods in [anyway_config] docs).

## Custom tags

Expand All @@ -74,6 +85,14 @@ class MyWorker
end
```
## Configuration
Configuration is handled by [anyway_config] gem. With it you can load settings from environment variables (upcased and prefixed with `YABEDA_SIDEKIQ_`), YAML files, and other sources. See [anyway_config] docs for details.
Config key | Type | Default | Description |
------------------------- | -------- | ------------------------------------------------------- | ----------- |
`collect_cluster_metrics` | boolean | Enabled in Sidekiq worker processes, disabled otherwise | Defines whether this Ruby process should collect and expose metrics representing state of the whole Sidekiq installation (queues, processes, etc). |
# Roadmap (TODO or Help wanted)
- Implement optional segmentation of retry/schedule/dead sets
Expand Down Expand Up @@ -131,3 +150,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
[Sidekiq]: https://github.com/mperham/sidekiq/ "Simple, efficient background processing for Ruby"
[yabeda]: https://github.com/yabeda-rb/yabeda
[yabeda-prometheus]: https://github.com/yabeda-rb/yabeda-prometheus
[anyway_config]: https://github.com/palkan/anyway_config "Configuration library for Ruby gems and applications"
62 changes: 36 additions & 26 deletions lib/yabeda/sidekiq.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
require "yabeda/sidekiq/version"
require "yabeda/sidekiq/client_middleware"
require "yabeda/sidekiq/server_middleware"
require "yabeda/sidekiq/config"

module Yabeda
module Sidekiq
Expand All @@ -16,36 +17,47 @@ module Sidekiq
].freeze

Yabeda.configure do
config = Config.new

group :sidekiq

counter :jobs_enqueued_total, tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq enqueued."

next unless ::Sidekiq.server?

counter :jobs_executed_total, tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq executed."
counter :jobs_success_total, tags: %i[queue worker], comment: "A counter of the total number of jobs successfully processed by sidekiq."
counter :jobs_failed_total, tags: %i[queue worker], comment: "A counter of the total number of jobs failed in sidekiq."

gauge :jobs_waiting_count, tags: %i[queue], comment: "The number of jobs waiting to process in sidekiq."
gauge :active_workers_count, tags: [], comment: "The number of currently running machines with sidekiq workers."
gauge :jobs_scheduled_count, tags: [], comment: "The number of jobs scheduled for later execution."
gauge :jobs_retry_count, tags: [], comment: "The number of failed jobs waiting to be retried"
gauge :jobs_dead_count, tags: [], comment: "The number of jobs exceeded their retry count."
gauge :active_processes, tags: [], comment: "The number of active Sidekiq worker processes."
gauge :queue_latency, tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"
gauge :running_job_runtime, tags: %i[queue worker], aggregation: :max, unit: :seconds,
comment: "How long currently running jobs are running (useful for detection of hung jobs)"

histogram :job_latency, comment: "The job latency, the difference in seconds between enqueued and running time",
unit: :seconds, per: :job,
tags: %i[queue worker],
buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
histogram :job_runtime, comment: "A histogram of the job execution time.",
unit: :seconds, per: :job,
tags: %i[queue worker],
buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
if ::Sidekiq.server?
counter :jobs_executed_total, tags: %i[queue worker], comment: "A counter of the total number of jobs sidekiq executed."
counter :jobs_success_total, tags: %i[queue worker], comment: "A counter of the total number of jobs successfully processed by sidekiq."
counter :jobs_failed_total, tags: %i[queue worker], comment: "A counter of the total number of jobs failed in sidekiq."

gauge :running_job_runtime, tags: %i[queue worker], aggregation: :max, unit: :seconds,
comment: "How long currently running jobs are running (useful for detection of hung jobs)"

histogram :job_latency, comment: "The job latency, the difference in seconds between enqueued and running time",
unit: :seconds, per: :job,
tags: %i[queue worker],
buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
histogram :job_runtime, comment: "A histogram of the job execution time.",
unit: :seconds, per: :job,
tags: %i[queue worker],
buckets: LONG_RUNNING_JOB_RUNTIME_BUCKETS
end

# Metrics not specific for current Sidekiq process, but representing state of the whole Sidekiq installation (queues, processes, etc)
# You can opt-out from collecting these by setting YABEDA_SIDEKIQ_COLLECT_CLUSTER_METRICS to falsy value (+no+ or +false+)
if config.collect_cluster_metrics # defaults to +::Sidekiq.server?+
gauge :jobs_waiting_count, tags: %i[queue], comment: "The number of jobs waiting to process in sidekiq."
gauge :active_workers_count, tags: [], comment: "The number of currently running machines with sidekiq workers."
gauge :jobs_scheduled_count, tags: [], comment: "The number of jobs scheduled for later execution."
gauge :jobs_retry_count, tags: [], comment: "The number of failed jobs waiting to be retried"
gauge :jobs_dead_count, tags: [], comment: "The number of jobs exceeded their retry count."
gauge :active_processes, tags: [], comment: "The number of active Sidekiq worker processes."
gauge :queue_latency, tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"
end

collect do
Yabeda::Sidekiq.track_max_job_runtime if ::Sidekiq.server?

next unless config.collect_cluster_metrics

stats = ::Sidekiq::Stats.new

stats.queues.each do |k, v|
Expand All @@ -61,8 +73,6 @@ module Sidekiq
sidekiq_queue_latency.set({ queue: queue.name }, queue.latency)
end

Yabeda::Sidekiq.track_max_job_runtime

# That is quite slow if your retry set is large
# I don't want to enable it by default
# retries_by_queues =
Expand Down
18 changes: 18 additions & 0 deletions lib/yabeda/sidekiq/config.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# frozen_string_literal: true

require "anyway"

module Yabeda
module Sidekiq
class Config < ::Anyway::Config
config_name :yabeda_sidekiq

# By default all sidekiq worker processes (servers) collects global metrics about whole Sidekiq installation.
# Client processes (everything else that is not Sidekiq worker) by default doesn't.
# With this config you can override this behavior:
# - force disable if you don't want multiple Sidekiq workers to report the same numbers (that causes excess load to both Redis and monitoring)
# - force enable if you want non-Sidekiq process to collect them (like dedicated metric exporter process)
attr_config collect_cluster_metrics: ::Sidekiq.server?
end
end
end
1 change: 1 addition & 0 deletions yabeda-sidekiq.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Gem::Specification.new do |spec|
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
spec.require_paths = ["lib"]

spec.add_dependency "anyway_config", ">= 1.3", "< 3"
spec.add_dependency "sidekiq"
spec.add_dependency "yabeda", "~> 0.6"

Expand Down

0 comments on commit fdb1d34

Please sign in to comment.