Support prometheus metrics #314

dtaniwaki · 2021-05-22T12:17:37Z

Usage summary by Erik

By explicitly setting --metrics-port=<some port number> you will be able to acquire Prometheus formatted metrics from http://<hostname, localhost, or ip>/metrics.

--metrics-ip is also an flag added and defaults to the permissive 0.0.0.0, which makes it listen to all IPv4 based IPs rather than just traffic from localhost.

Fixes #52

I implemented the Prometheus support with the prom-client package, and implemented a metrics endpoint. Please tell me more appropriate metric names if any.

I actually thought we should serve metrics on another port for a different requirement from the API port, which is argued on many systems like etcd-io/etcd#8060, but I didn't do it on this PR to get this feature reviewed first.

Here's the checklist @minrk listed in #52.

Add prometheus as a dependency
Add metrics API endpoint
Replace all existing statsd metrics with prometheus
Add additional metrics, if any feel necessary (?)
Remove statsd as a dependency
CLI option to opt-in to (or -out of) metrics

Here's the response of the metrics endpoint.

http://localhost:8001/metrics

# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.139221

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.048507

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.187728

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1621685578

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 26812416

# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.003281464

# HELP nodejs_eventloop_lag_min_seconds The minimum recorded event loop delay.
# TYPE nodejs_eventloop_lag_min_seconds gauge
nodejs_eventloop_lag_min_seconds 0.009043968

# HELP nodejs_eventloop_lag_max_seconds The maximum recorded event loop delay.
# TYPE nodejs_eventloop_lag_max_seconds gauge
nodejs_eventloop_lag_max_seconds 0.016777215

# HELP nodejs_eventloop_lag_mean_seconds The mean of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_mean_seconds gauge
nodejs_eventloop_lag_mean_seconds 0.010724828711003628

# HELP nodejs_eventloop_lag_stddev_seconds The standard deviation of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_stddev_seconds gauge
nodejs_eventloop_lag_stddev_seconds 0.0007361098704439105

# HELP nodejs_eventloop_lag_p50_seconds The 50th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p50_seconds gauge
nodejs_eventloop_lag_p50_seconds 0.010428415

# HELP nodejs_eventloop_lag_p90_seconds The 90th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p90_seconds gauge
nodejs_eventloop_lag_p90_seconds 0.011919359

# HELP nodejs_eventloop_lag_p99_seconds The 99th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p99_seconds gauge
nodejs_eventloop_lag_p99_seconds 0.012558335

# HELP nodejs_active_handles Number of active libuv handles grouped by handle type. Every handle type is C++ class name.
# TYPE nodejs_active_handles gauge
nodejs_active_handles{type="WriteStream"} 2
nodejs_active_handles{type="ReadStream"} 1
nodejs_active_handles{type="Server"} 2
nodejs_active_handles{type="Socket"} 1

# HELP nodejs_active_handles_total Total number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 6

# HELP nodejs_active_requests Number of active libuv requests grouped by request type. Every request type is C++ class name.
# TYPE nodejs_active_requests gauge

# HELP nodejs_active_requests_total Total number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0

# HELP nodejs_heap_size_total_bytes Process heap size from Node.js in bytes.
# TYPE nodejs_heap_size_total_bytes gauge
nodejs_heap_size_total_bytes 8003584

# HELP nodejs_heap_size_used_bytes Process heap size used from Node.js in bytes.
# TYPE nodejs_heap_size_used_bytes gauge
nodejs_heap_size_used_bytes 6823904

# HELP nodejs_external_memory_bytes Node.js external memory size in bytes.
# TYPE nodejs_external_memory_bytes gauge
nodejs_external_memory_bytes 1469716

# HELP nodejs_heap_space_size_total_bytes Process heap space size total from Node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="read_only"} 118784
nodejs_heap_space_size_total_bytes{space="new"} 1048576
nodejs_heap_space_size_total_bytes{space="old"} 5124096
nodejs_heap_space_size_total_bytes{space="code"} 339968
nodejs_heap_space_size_total_bytes{space="map"} 790528
nodejs_heap_space_size_total_bytes{space="large_object"} 532480
nodejs_heap_space_size_total_bytes{space="code_large_object"} 49152
nodejs_heap_space_size_total_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_used_bytes Process heap space size used from Node.js in bytes.
# TYPE nodejs_heap_space_size_used_bytes gauge
nodejs_heap_space_size_used_bytes{space="read_only"} 117808
nodejs_heap_space_size_used_bytes{space="new"} 840728
nodejs_heap_space_size_used_bytes{space="old"} 4860624
nodejs_heap_space_size_used_bytes{space="code"} 104160
nodejs_heap_space_size_used_bytes{space="map"} 378288
nodejs_heap_space_size_used_bytes{space="large_object"} 524344
nodejs_heap_space_size_used_bytes{space="code_large_object"} 2784
nodejs_heap_space_size_used_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_available_bytes Process heap space size available from Node.js in bytes.
# TYPE nodejs_heap_space_size_available_bytes gauge
nodejs_heap_space_size_available_bytes{space="read_only"} 0
nodejs_heap_space_size_available_bytes{space="new"} 206696
nodejs_heap_space_size_available_bytes{space="old"} 183208
nodejs_heap_space_size_available_bytes{space="code"} 5408
nodejs_heap_space_size_available_bytes{space="map"} 406856
nodejs_heap_space_size_available_bytes{space="large_object"} 0
nodejs_heap_space_size_available_bytes{space="code_large_object"} 0
nodejs_heap_space_size_available_bytes{space="new_large_object"} 1047424

# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v14.1.0",major="14",minor="1",patch="0"} 1

# HELP nodejs_gc_duration_seconds Garbage collection duration by kind, one of major, minor, incremental or weakcb.
# TYPE nodejs_gc_duration_seconds histogram
nodejs_gc_duration_seconds_bucket{le="0.001",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.01",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="incremental"} 2
nodejs_gc_duration_seconds_sum{kind="incremental"} 0.00007256100000000001
nodejs_gc_duration_seconds_count{kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.001",kind="major"} 0
nodejs_gc_duration_seconds_bucket{le="0.01",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="major"} 2
nodejs_gc_duration_seconds_sum{kind="major"} 0.0057768690000000004
nodejs_gc_duration_seconds_count{kind="major"} 2

# HELP api_route_get Count of API route get requests
# TYPE api_route_get counter
api_route_get 1

# HELP api_route_add Count of API route add requests
# TYPE api_route_add counter
api_route_add 0

# HELP api_route_delete Count of API route delete requests
# TYPE api_route_delete counter
api_route_delete 0

# HELP find_target_for_req Summary of find target requests
# TYPE find_target_for_req summary
find_target_for_req{quantile="0.01"} 0.000921552
find_target_for_req{quantile="0.05"} 0.000921552
find_target_for_req{quantile="0.5"} 0.000921552
find_target_for_req{quantile="0.9"} 0.000921552
find_target_for_req{quantile="0.95"} 0.000921552
find_target_for_req{quantile="0.99"} 0.000921552
find_target_for_req{quantile="0.999"} 0.000921552
find_target_for_req_sum 0.000921552
find_target_for_req_count 1

# HELP last_activity_updating Summary of last activity updating requests
# TYPE last_activity_updating summary
last_activity_updating{quantile="0.01"} 0
last_activity_updating{quantile="0.05"} 0
last_activity_updating{quantile="0.5"} 0
last_activity_updating{quantile="0.9"} 0
last_activity_updating{quantile="0.95"} 0
last_activity_updating{quantile="0.99"} 0
last_activity_updating{quantile="0.999"} 0
last_activity_updating_sum 0
last_activity_updating_count 0

# HELP requests_ws Count of websocket requests
# TYPE requests_ws counter
requests_ws 0

# HELP requests_web Count of web requests
# TYPE requests_web counter
requests_web 1

# HELP requests_proxy Count of proxy requests
# TYPE requests_proxy counter
requests_proxy{status="404"} 1

# HELP requests_api Count of API requests
# TYPE requests_api counter
requests_api{status="200"} 1
requests_api{status="404"} 1

consideRatio · 2021-05-22T15:42:58Z

Wooow nice work on this!!! Really excited about this work!

dtaniwaki · 2021-05-22T16:14:54Z

Looks like there're a few flaky tests.

lib/configproxy.js

consideRatio

Wow this looks like great progress to me (but i lack experience to be confident about being the sole person reviewing this).

lib/configproxy.js

consideRatio · 2021-05-22T16:49:04Z

test/api_spec.js

@@ -13,6 +13,7 @@ describe("API Tests", function () {
  var apiPort = port + 1;
  var proxy;
  var apiUrl = "http://127.0.0.1:" + apiPort + "/api/routes";
+  var metricsUrl = "http://127.0.0.1:" + apiPort + "/metrics";


Will the /metrics endpoint be exposed like any other on the same port? If not, what decides if it listens to localhost only or any incoming requests from other IPs etc?

Do you have a suggestion on how we handle this /metrics endpoint from a security standpoint, where I assume we prefer to not expose it publicly or preferably at least be able to control that somehow.

From my perspective, the metrics endpoint should not be public to the internet. However, putting it on the API server together may cause some problems if the API server is protected with API key or/and a client TLS certification, which is not available in your Prometheus server. On the other hand, if you don't use API key nor a client TLS certification, having another server may be too much. So, ideally speaking, we should have 2 options: add the metrics endpoint in the API server, or create another dedicated server for the metrics endpoint.

What do you think?

Ah hmmm. I think I'm positive towards exposing it via the API server on /metrics, but not having that route require a API token. Or creating a dedicated server for the metrics endpoint...

Not confident about this, hmmm... I know i dislike re-using an API server access token for access to metrics endpoint at least.

Perhaps the most flexible option for the future is to have a dedicated server? I dont know how advanced it is to go en for the various options, and reducing technical complexity is also valuable.

Perhaps the most flexible option for the future is to have a dedicated server?

Agree. I think it's the most flexible. We can drop the metrics endpoint in the API server entirely because it's complicated and may confuse users if we have 2 options to serve metrics.

@dtaniwaki and @minrk, what do you think?

Should we expose the metrics on: the proxy server (typically proxying traffic), the proxy-api server (typically controlling traffic routing), and/or a dedicated metrics server (just serving metrics).

What if any access control is implemented?

A fixed access token set via an environment variable?

A way to limit what network traffic it will accept traffic from, does it accept traffic only from localhost, some local network, or all of internet etc?

What should we aim for in this PR, and what should we aim for in future work?

Currently I think:

Only a dedicated metrics server

A bonus to have this

It feels almost required we have this

Aim for in this PR: a dedicated server that listens to a network interface that can accept traffic from a local network. Future improvements: a fixed access token set via an environment variable.

I'm to a large extent thinking about the following situations:

This software is running on a VM where JupyterHub and Prometheus is also running

This software is running in a k8s Pod in a k8s cluster, and JupyterHub as well as Prometheus is running in separate pods within in the k8s cluster.

I care about the following:

To not expose /metrics to everyone that can be routed

To be able to expose /metrics to prometheus if it is running either on the same machine or in the same k8s cluster

To protect /metrics with some password dedicated to this purpose even though we don't have TLS communication

I don't care much about the following:

To protect the /metrics endpoint with some password over a TLS/HTTPS connection using a self-signed cert or provided cert

I think running it under it's own configurable port may be the easiest for now. You don't have to worry about authentication, a JupyterHub admin shouldn't have to worry too much about the new port if they're following best practice and are running a firewall, and it can be easily blocked on k8s by a network policy.

👍 to its own --metrics-port (and --metrics-ip) config for its own server, and disabled if unspecified (default).

lib/metrics.js

dtaniwaki · 2021-05-25T16:04:31Z

I updated the code based on your feedback. Would you review it again?

minrk · 2021-05-26T12:03:31Z

Wonderful, thank you @dtaniwaki!

dtaniwaki added 9 commits May 22, 2021 19:40

Add prom-client package

099b1c2

Serve metrics

79ee319

Add metrics test

501ab60

Allow to disable metrics

46f49f9

Remove statsd

d5789d7

Remove lynx

d6dbf5a

Remove unnecessary timer code

c368d46

Count API requests

6700d87

Move metrics endpoint to API

6078ba3

dtaniwaki added 2 commits May 23, 2021 01:02

Fix reference error

0829e60

Fix more this reference

e575ac9

dtaniwaki added 2 commits May 23, 2021 01:35

Fix more this references

9de2d53

Check writable instead of writableEnded

6b9e9ac

dtaniwaki commented May 22, 2021

View reviewed changes

lib/configproxy.js Show resolved Hide resolved

consideRatio reviewed May 22, 2021

View reviewed changes

dtaniwaki added 3 commits May 23, 2021 01:59

Use content type of prom-client

8405fe4

Refactor metrics

387bf14

Fix register reference

cbe294b

consideRatio reviewed May 23, 2021

View reviewed changes

lib/metrics.js Show resolved Hide resolved

manics requested a review from minrk May 24, 2021 19:54

dtaniwaki added 3 commits May 26, 2021 00:31

Serve metrics only with metrics port option

3b20a63

Add tests

8d5af3a

Remove unnecessary that/this conversion

fff977e

minrk changed the title ~~Support prometheus~~ Support prometheus metrics May 26, 2021

minrk approved these changes May 26, 2021

View reviewed changes

minrk merged commit 9196b53 into jupyterhub:main May 26, 2021

consideRatio added new breaking labels May 26, 2021

consideRatio mentioned this pull request May 26, 2021

Let the proxy pod expose Prometheus metrics jupyterhub/zero-to-jupyterhub-k8s#2232

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support prometheus metrics #314

Support prometheus metrics #314

dtaniwaki commented May 22, 2021 •

edited by consideRatio

Loading

consideRatio commented May 22, 2021

dtaniwaki commented May 22, 2021 •

edited

Loading

consideRatio left a comment

consideRatio May 22, 2021

dtaniwaki May 22, 2021

consideRatio May 22, 2021

consideRatio May 22, 2021

dtaniwaki May 23, 2021

consideRatio May 23, 2021

consideRatio May 23, 2021 •

edited

Loading

manics May 24, 2021

minrk May 25, 2021

dtaniwaki commented May 25, 2021

minrk commented May 26, 2021

Support prometheus metrics #314

Support prometheus metrics #314

Conversation

dtaniwaki commented May 22, 2021 • edited by consideRatio Loading

Usage summary by Erik

consideRatio commented May 22, 2021

dtaniwaki commented May 22, 2021 • edited Loading

consideRatio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio May 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtaniwaki commented May 25, 2021

minrk commented May 26, 2021

dtaniwaki commented May 22, 2021 •

edited by consideRatio

Loading

dtaniwaki commented May 22, 2021 •

edited

Loading

consideRatio May 23, 2021 •

edited

Loading