Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(prometheus): expose controlplane connectivity state as a gauge #14020

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aryan9600
Copy link
Member

@aryan9600 aryan9600 commented Dec 13, 2024

Summary

Add a new Prometheus gauge metric control_plane_connected. Similar to datastore_reachable gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances:

  • Failure while establihing a websocket connection
  • Failure while sending basic information to controlplane
  • Failure while sending ping to controlplane
  • Failure while receiving a packet from the websocket connection

This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix #[issue number]

@github-actions github-actions bot added core/clustering plugins/prometheus cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Dec 13, 2024
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch 3 times, most recently from c2d6278 to 697c1e6 Compare December 23, 2024 13:32
@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 23, 2024
@aryan9600 aryan9600 marked this pull request as ready for review December 23, 2024 16:21
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch from 697c1e6 to de4a868 Compare January 2, 2025 11:08
@RobSerafini RobSerafini requested review from gszr and flrgh January 7, 2025 19:22
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch from de4a868 to ee922f2 Compare January 10, 2025 10:24
@aryan9600 aryan9600 requested a review from flrgh January 16, 2025 07:56
Comment on lines 77 to 78
local function set_control_plane_connected(reachable, ttl)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just hard-code the ttl as PING_WAIT since it's what we should use in all cases.

Suggested change
local function set_control_plane_connected(reachable, ttl)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl)
local function set_control_plane_connected(reachable)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, PING_WAIT)

local function set_control_plane_connected(reachable, ttl)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl)
if not ok then
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the log line to match the current name of the SHM key

Suggested change
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err)
ngx_log(ngx_ERR, _log_prefix, "failed to set \"control_plane_connected\" key in shm to ", reachable, " :", err)

Comment on lines 70 to 74
metrics.cp_connected = prometheus:gauge("control_plane_connected",
"Kong connected to control plane, " ..
"0 is unconnected",
nil,
prometheus.LOCAL_STORAGE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

Suggested change
metrics.cp_connected = prometheus:gauge("control_plane_connected",
"Kong connected to control plane, " ..
"0 is unconnected",
nil,
prometheus.LOCAL_STORAGE)
metrics.cp_connected = prometheus:gauge("control_plane_connected",
"Kong connected to control plane, " ..
"0 is unconnected",
nil,
prometheus.LOCAL_STORAGE)

Comment on lines 601 to 603
-- it takes some time for the cp<->dp connection to get established and the
-- metric to reflect that, so set the timeout to 10 secs.
assert.with_timeout(10).eventually(function()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting a more lenient timeout here to prevent the test from being flaky. It's not uncommon for stuff to be extra slow in CI.

Suggested change
-- it takes some time for the cp<->dp connection to get established and the
-- metric to reflect that, so set the timeout to 10 secs.
assert.with_timeout(10).eventually(function()
-- it takes some time for the cp<->dp connection to get established and the
-- metric to reflect that. On failure, re-connection attempts are spaced out
-- in `math.random(5, 10)` second intervals, so a generous timeout is used
-- in case we get unlucky and have to wait multiple retry cycles
assert.with_timeout(30).eventually(function()

Add a new Prometheus gauge metric `control_plane_connected`. Similar to
`datastore_reachable` gauge, 0 means the connection is not healthy; 1
means that the connection is healthy. We mark the connection as
unhealthy under the following circumstances:

* Failure while establihing a websocket connection
* Failure while sending basic information to controlplane
* Failure while sending ping to controlplane
* Failure while receiving a packet from the websocket connection

This is helpful for users running a signficant number of gateways to be
alerted about potential issues any gateway(s) may be facing while
talking to the controlplane.

Signed-off-by: Sanskar Jaiswal <sanskar.jaiswal@konghq.com>
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch from 1f1c1a3 to cf5b522 Compare January 23, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/clustering plugins/prometheus size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants