Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Swap utilization detector #467

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -1184,6 +1184,7 @@
|System disk space utilization|X|X|-|-|-|
|System disk inodes utilization|X|X|-|-|-|
|System memory utilization|X|X|-|-|-|
|System swap utilization|X|X|-|-|-|
|System disk space running out|-|X|-|-|-|


Expand Down
28 changes: 28 additions & 0 deletions modules/smart-agent_system-common/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
- [How to collect required metrics?](#how-to-collect-required-metrics)
- [Monitors](#monitors)
- [Swap](#swap)
- [Metrics](#metrics)
- [Related documentation](#related-documentation)

Expand Down Expand Up @@ -82,6 +83,7 @@ This module creates the following SignalFx detectors which could contain one or
|System disk space utilization|X|X|-|-|-|
|System disk inodes utilization|X|X|-|-|-|
|System memory utilization|X|X|-|-|-|
|System swap utilization|X|X|-|-|-|
|System disk space running out|-|X|-|-|-|

## How to collect required metrics?
Expand Down Expand Up @@ -132,6 +134,30 @@ You have two choices to use load based detectors:
In both cases, the goal is to get alerts based on the __ratio__ of load by dividing the original load per the number of CPU/cores which is the only way to get generic and relevant alerts for load.
It mainly depends if you want to collect 2 metrics instead of 1 and if you want the load one to be raw or already averaged.

### Swap

To activate the swap monitor, you need to add this parameters in otel-agent configuration

* receivers configuration

```
receivers:
hostmetrics:
scrapers:
paging:
metrics:
system.paging.utilization:
enabled: true
```

* Exporters configuration

```
exporters:
signalfx:
include_metrics:
- metric_name: system.paging.utilization
```

### Metrics

Expand All @@ -150,6 +176,7 @@ parameter to the corresponding monitor configuration:
- '!load.midterm'
- '!memory.utilization'
- '!percent_inodes.used'
- '!system.paging.utilization'

```

Expand All @@ -167,3 +194,4 @@ parameter to the corresponding monitor configuration:
* [Smart Agent monitor memory](https://github.com/signalfx/signalfx-agent/blob/main/docs/monitors/memory.md)
* [Splunk Observability integration cpu](https://docs.splunk.com/Observability/gdi/cpu/cpu.html)
* [Splunk Observability integration load](https://docs.splunk.com/Observability/gdi/load/load.html)
* [Splunk Observability hostmetrics](https://docs.splunk.com/Observability/gdi/opentelemetry/components/host-metrics-receiver.html)
17 changes: 17 additions & 0 deletions modules/smart-agent_system-common/conf/06-swap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
module: system
name: "swap utilization"
id: swap
transformation: ".min(over='5m')"
value_unit: "%"
signals:
signal:
metric: system.paging.utilization
filter: "filter('state', 'used')"
rules:
critical:
threshold: 90
comparator: ">"
major:
threshold: 80
comparator: ">"
dependency: critical
26 changes: 26 additions & 0 deletions modules/smart-agent_system-common/conf/readme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ documentations:
url: 'https://docs.splunk.com/Observability/gdi/cpu/cpu.html'
- name: Splunk Observability integration load
url: 'https://docs.splunk.com/Observability/gdi/load/load.html'
- name: Splunk Observability hostmetrics
url: 'https://docs.splunk.com/Observability/gdi/opentelemetry/components/host-metrics-receiver.html'

source_doc: |
### Monitors
Expand All @@ -37,3 +39,27 @@ source_doc: |
In both cases, the goal is to get alerts based on the __ratio__ of load by dividing the original load per the number of CPU/cores which is the only way to get generic and relevant alerts for load.
It mainly depends if you want to collect 2 metrics instead of 1 and if you want the load one to be raw or already averaged.

### Swap

To activate the swap monitor, you need to add this parameters in otel-agent configuration

* receivers configuration

```
receivers:
hostmetrics:
scrapers:
paging:
metrics:
system.paging.utilization:
enabled: true
```

* Exporters configuration

```
exporters:
signalfx:
include_metrics:
- metric_name: system.paging.utilization
```
45 changes: 45 additions & 0 deletions modules/smart-agent_system-common/detectors-gen.tf
Original file line number Diff line number Diff line change
Expand Up @@ -248,3 +248,48 @@ EOF
max_delay = var.memory_max_delay
}

resource "signalfx_detector" "swap" {
name = format("%s %s", local.detector_name_prefix, "System swap utilization")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

viz_options {
label = "signal"
value_suffix = "%"
}

program_text = <<-EOF
signal = data('system.paging.utilization', filter=filter('state', 'used') and ${module.filtering.signalflow})${var.swap_aggregation_function}${var.swap_transformation_function}.publish('signal')
detect(when(signal > ${var.swap_threshold_critical}, lasting=%{if var.swap_lasting_duration_critical == null}None%{else}'${var.swap_lasting_duration_critical}'%{endif}, at_least=${var.swap_at_least_percentage_critical})).publish('CRIT')
detect(when(signal > ${var.swap_threshold_major}, lasting=%{if var.swap_lasting_duration_major == null}None%{else}'${var.swap_lasting_duration_major}'%{endif}, at_least=${var.swap_at_least_percentage_major}) and (not when(signal > ${var.swap_threshold_critical}, lasting=%{if var.swap_lasting_duration_critical == null}None%{else}'${var.swap_lasting_duration_critical}'%{endif}, at_least=${var.swap_at_least_percentage_critical}))).publish('MAJOR')
EOF

rule {
description = "is too high > ${var.swap_threshold_critical}%"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.swap_disabled_critical, var.swap_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.swap_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.swap_runbook_url, var.runbook_url), "")
tip = var.swap_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

rule {
description = "is too high > ${var.swap_threshold_major}%"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.swap_disabled_major, var.swap_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.swap_notifications, "major", []), var.notifications.major), null)
runbook_url = try(coalesce(var.swap_runbook_url, var.runbook_url), "")
tip = var.swap_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.swap_max_delay
}

5 changes: 5 additions & 0 deletions modules/smart-agent_system-common/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,8 @@ output "memory" {
value = signalfx_detector.memory
}

output "swap" {
description = "Detector resource for swap"
value = signalfx_detector.swap
}

90 changes: 90 additions & 0 deletions modules/smart-agent_system-common/variables-gen.tf
Original file line number Diff line number Diff line change
Expand Up @@ -492,3 +492,93 @@ variable "memory_at_least_percentage_major" {
type = number
default = 1
}
# swap detector

variable "swap_notifications" {
description = "Notification recipients list per severity overridden for swap detector"
type = map(list(string))
default = {}
}

variable "swap_aggregation_function" {
description = "Aggregation function and group by for swap detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
}

variable "swap_transformation_function" {
description = "Transformation function for swap detector (i.e. \".mean(over='5m')\")"
type = string
default = ".min(over='5m')"
}

variable "swap_max_delay" {
description = "Enforce max delay for swap detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
}

variable "swap_tip" {
description = "Suggested first course of action or any note useful for incident handling"
type = string
default = ""
}

variable "swap_runbook_url" {
description = "URL like SignalFx dashboard or wiki page which can help to troubleshoot the incident cause"
type = string
default = ""
}

variable "swap_disabled" {
description = "Disable all alerting rules for swap detector"
type = bool
default = null
}

variable "swap_disabled_critical" {
description = "Disable critical alerting rule for swap detector"
type = bool
default = null
}

variable "swap_disabled_major" {
description = "Disable major alerting rule for swap detector"
type = bool
default = null
}

variable "swap_threshold_critical" {
description = "Critical threshold for swap detector in %"
type = number
default = 90
}

variable "swap_lasting_duration_critical" {
description = "Minimum duration that conditions must be true before raising alert"
type = string
default = null
}

variable "swap_at_least_percentage_critical" {
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
type = number
default = 1
}
variable "swap_threshold_major" {
description = "Major threshold for swap detector in %"
type = number
default = 80
}

variable "swap_lasting_duration_major" {
description = "Minimum duration that conditions must be true before raising alert"
type = string
default = null
}

variable "swap_at_least_percentage_major" {
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
type = number
default = 1
}