Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support prometheus metrics #314

Merged
merged 19 commits into from
May 26, 2021
Merged

Conversation

dtaniwaki
Copy link
Contributor

@dtaniwaki dtaniwaki commented May 22, 2021

Usage summary by Erik

By explicitly setting --metrics-port=<some port number> you will be able to acquire Prometheus formatted metrics from http://<hostname, localhost, or ip>/metrics.

--metrics-ip is also an flag added and defaults to the permissive 0.0.0.0, which makes it listen to all IPv4 based IPs rather than just traffic from localhost.


Fixes #52

I implemented the Prometheus support with the prom-client package, and implemented a metrics endpoint. Please tell me more appropriate metric names if any.

I actually thought we should serve metrics on another port for a different requirement from the API port, which is argued on many systems like etcd-io/etcd#8060, but I didn't do it on this PR to get this feature reviewed first.

Here's the checklist @minrk listed in #52.

  • Add prometheus as a dependency
  • Add metrics API endpoint
  • Replace all existing statsd metrics with prometheus
  • Add additional metrics, if any feel necessary (?)
  • Remove statsd as a dependency
  • CLI option to opt-in to (or -out of) metrics

Here's the response of the metrics endpoint.

http://localhost:8001/metrics
# HELP process_cpu_user_seconds_total Total user CPU time spent in seconds.
# TYPE process_cpu_user_seconds_total counter
process_cpu_user_seconds_total 0.139221

# HELP process_cpu_system_seconds_total Total system CPU time spent in seconds.
# TYPE process_cpu_system_seconds_total counter
process_cpu_system_seconds_total 0.048507

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.187728

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1621685578

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 26812416

# HELP nodejs_eventloop_lag_seconds Lag of event loop in seconds.
# TYPE nodejs_eventloop_lag_seconds gauge
nodejs_eventloop_lag_seconds 0.003281464

# HELP nodejs_eventloop_lag_min_seconds The minimum recorded event loop delay.
# TYPE nodejs_eventloop_lag_min_seconds gauge
nodejs_eventloop_lag_min_seconds 0.009043968

# HELP nodejs_eventloop_lag_max_seconds The maximum recorded event loop delay.
# TYPE nodejs_eventloop_lag_max_seconds gauge
nodejs_eventloop_lag_max_seconds 0.016777215

# HELP nodejs_eventloop_lag_mean_seconds The mean of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_mean_seconds gauge
nodejs_eventloop_lag_mean_seconds 0.010724828711003628

# HELP nodejs_eventloop_lag_stddev_seconds The standard deviation of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_stddev_seconds gauge
nodejs_eventloop_lag_stddev_seconds 0.0007361098704439105

# HELP nodejs_eventloop_lag_p50_seconds The 50th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p50_seconds gauge
nodejs_eventloop_lag_p50_seconds 0.010428415

# HELP nodejs_eventloop_lag_p90_seconds The 90th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p90_seconds gauge
nodejs_eventloop_lag_p90_seconds 0.011919359

# HELP nodejs_eventloop_lag_p99_seconds The 99th percentile of the recorded event loop delays.
# TYPE nodejs_eventloop_lag_p99_seconds gauge
nodejs_eventloop_lag_p99_seconds 0.012558335

# HELP nodejs_active_handles Number of active libuv handles grouped by handle type. Every handle type is C++ class name.
# TYPE nodejs_active_handles gauge
nodejs_active_handles{type="WriteStream"} 2
nodejs_active_handles{type="ReadStream"} 1
nodejs_active_handles{type="Server"} 2
nodejs_active_handles{type="Socket"} 1

# HELP nodejs_active_handles_total Total number of active handles.
# TYPE nodejs_active_handles_total gauge
nodejs_active_handles_total 6

# HELP nodejs_active_requests Number of active libuv requests grouped by request type. Every request type is C++ class name.
# TYPE nodejs_active_requests gauge

# HELP nodejs_active_requests_total Total number of active requests.
# TYPE nodejs_active_requests_total gauge
nodejs_active_requests_total 0

# HELP nodejs_heap_size_total_bytes Process heap size from Node.js in bytes.
# TYPE nodejs_heap_size_total_bytes gauge
nodejs_heap_size_total_bytes 8003584

# HELP nodejs_heap_size_used_bytes Process heap size used from Node.js in bytes.
# TYPE nodejs_heap_size_used_bytes gauge
nodejs_heap_size_used_bytes 6823904

# HELP nodejs_external_memory_bytes Node.js external memory size in bytes.
# TYPE nodejs_external_memory_bytes gauge
nodejs_external_memory_bytes 1469716

# HELP nodejs_heap_space_size_total_bytes Process heap space size total from Node.js in bytes.
# TYPE nodejs_heap_space_size_total_bytes gauge
nodejs_heap_space_size_total_bytes{space="read_only"} 118784
nodejs_heap_space_size_total_bytes{space="new"} 1048576
nodejs_heap_space_size_total_bytes{space="old"} 5124096
nodejs_heap_space_size_total_bytes{space="code"} 339968
nodejs_heap_space_size_total_bytes{space="map"} 790528
nodejs_heap_space_size_total_bytes{space="large_object"} 532480
nodejs_heap_space_size_total_bytes{space="code_large_object"} 49152
nodejs_heap_space_size_total_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_used_bytes Process heap space size used from Node.js in bytes.
# TYPE nodejs_heap_space_size_used_bytes gauge
nodejs_heap_space_size_used_bytes{space="read_only"} 117808
nodejs_heap_space_size_used_bytes{space="new"} 840728
nodejs_heap_space_size_used_bytes{space="old"} 4860624
nodejs_heap_space_size_used_bytes{space="code"} 104160
nodejs_heap_space_size_used_bytes{space="map"} 378288
nodejs_heap_space_size_used_bytes{space="large_object"} 524344
nodejs_heap_space_size_used_bytes{space="code_large_object"} 2784
nodejs_heap_space_size_used_bytes{space="new_large_object"} 0

# HELP nodejs_heap_space_size_available_bytes Process heap space size available from Node.js in bytes.
# TYPE nodejs_heap_space_size_available_bytes gauge
nodejs_heap_space_size_available_bytes{space="read_only"} 0
nodejs_heap_space_size_available_bytes{space="new"} 206696
nodejs_heap_space_size_available_bytes{space="old"} 183208
nodejs_heap_space_size_available_bytes{space="code"} 5408
nodejs_heap_space_size_available_bytes{space="map"} 406856
nodejs_heap_space_size_available_bytes{space="large_object"} 0
nodejs_heap_space_size_available_bytes{space="code_large_object"} 0
nodejs_heap_space_size_available_bytes{space="new_large_object"} 1047424

# HELP nodejs_version_info Node.js version info.
# TYPE nodejs_version_info gauge
nodejs_version_info{version="v14.1.0",major="14",minor="1",patch="0"} 1

# HELP nodejs_gc_duration_seconds Garbage collection duration by kind, one of major, minor, incremental or weakcb.
# TYPE nodejs_gc_duration_seconds histogram
nodejs_gc_duration_seconds_bucket{le="0.001",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.01",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="incremental"} 2
nodejs_gc_duration_seconds_sum{kind="incremental"} 0.00007256100000000001
nodejs_gc_duration_seconds_count{kind="incremental"} 2
nodejs_gc_duration_seconds_bucket{le="0.001",kind="major"} 0
nodejs_gc_duration_seconds_bucket{le="0.01",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="0.1",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="1",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="2",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="5",kind="major"} 2
nodejs_gc_duration_seconds_bucket{le="+Inf",kind="major"} 2
nodejs_gc_duration_seconds_sum{kind="major"} 0.0057768690000000004
nodejs_gc_duration_seconds_count{kind="major"} 2

# HELP api_route_get Count of API route get requests
# TYPE api_route_get counter
api_route_get 1

# HELP api_route_add Count of API route add requests
# TYPE api_route_add counter
api_route_add 0

# HELP api_route_delete Count of API route delete requests
# TYPE api_route_delete counter
api_route_delete 0

# HELP find_target_for_req Summary of find target requests
# TYPE find_target_for_req summary
find_target_for_req{quantile="0.01"} 0.000921552
find_target_for_req{quantile="0.05"} 0.000921552
find_target_for_req{quantile="0.5"} 0.000921552
find_target_for_req{quantile="0.9"} 0.000921552
find_target_for_req{quantile="0.95"} 0.000921552
find_target_for_req{quantile="0.99"} 0.000921552
find_target_for_req{quantile="0.999"} 0.000921552
find_target_for_req_sum 0.000921552
find_target_for_req_count 1

# HELP last_activity_updating Summary of last activity updating requests
# TYPE last_activity_updating summary
last_activity_updating{quantile="0.01"} 0
last_activity_updating{quantile="0.05"} 0
last_activity_updating{quantile="0.5"} 0
last_activity_updating{quantile="0.9"} 0
last_activity_updating{quantile="0.95"} 0
last_activity_updating{quantile="0.99"} 0
last_activity_updating{quantile="0.999"} 0
last_activity_updating_sum 0
last_activity_updating_count 0

# HELP requests_ws Count of websocket requests
# TYPE requests_ws counter
requests_ws 0

# HELP requests_web Count of web requests
# TYPE requests_web counter
requests_web 1

# HELP requests_proxy Count of proxy requests
# TYPE requests_proxy counter
requests_proxy{status="404"} 1

# HELP requests_api Count of API requests
# TYPE requests_api counter
requests_api{status="200"} 1
requests_api{status="404"} 1

@consideRatio
Copy link
Member

Wooow nice work on this!!! Really excited about this work!

@dtaniwaki
Copy link
Contributor Author

dtaniwaki commented May 22, 2021

Looks like there're a few flaky tests.

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this looks like great progress to me (but i lack experience to be confident about being the sole person reviewing this).

lib/configproxy.js Outdated Show resolved Hide resolved
test/api_spec.js Outdated
@@ -13,6 +13,7 @@ describe("API Tests", function () {
var apiPort = port + 1;
var proxy;
var apiUrl = "http://127.0.0.1:" + apiPort + "/api/routes";
var metricsUrl = "http://127.0.0.1:" + apiPort + "/metrics";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the /metrics endpoint be exposed like any other on the same port? If not, what decides if it listens to localhost only or any incoming requests from other IPs etc?

Do you have a suggestion on how we handle this /metrics endpoint from a security standpoint, where I assume we prefer to not expose it publicly or preferably at least be able to control that somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, the metrics endpoint should not be public to the internet. However, putting it on the API server together may cause some problems if the API server is protected with API key or/and a client TLS certification, which is not available in your Prometheus server. On the other hand, if you don't use API key nor a client TLS certification, having another server may be too much. So, ideally speaking, we should have 2 options: add the metrics endpoint in the API server, or create another dedicated server for the metrics endpoint.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hmmm. I think I'm positive towards exposing it via the API server on /metrics, but not having that route require a API token. Or creating a dedicated server for the metrics endpoint...

Not confident about this, hmmm... I know i dislike re-using an API server access token for access to metrics endpoint at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the most flexible option for the future is to have a dedicated server? I dont know how advanced it is to go en for the various options, and reducing technical complexity is also valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the most flexible option for the future is to have a dedicated server?

Agree. I think it's the most flexible. We can drop the metrics endpoint in the API server entirely because it's complicated and may confuse users if we have 2 options to serve metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtaniwaki and @minrk, what do you think?

  1. Should we expose the metrics on: the proxy server (typically proxying traffic), the proxy-api server (typically controlling traffic routing), and/or a dedicated metrics server (just serving metrics).
  2. What if any access control is implemented?
    1. A fixed access token set via an environment variable?
    2. A way to limit what network traffic it will accept traffic from, does it accept traffic only from localhost, some local network, or all of internet etc?
  3. What should we aim for in this PR, and what should we aim for in future work?

Currently I think:

  1. Only a dedicated metrics server
    1. A bonus to have this
    2. It feels almost required we have this
  2. Aim for in this PR: a dedicated server that listens to a network interface that can accept traffic from a local network. Future improvements: a fixed access token set via an environment variable.

Copy link
Member

@consideRatio consideRatio May 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm to a large extent thinking about the following situations:

  • This software is running on a VM where JupyterHub and Prometheus is also running
  • This software is running in a k8s Pod in a k8s cluster, and JupyterHub as well as Prometheus is running in separate pods within in the k8s cluster.

I care about the following:

  1. To not expose /metrics to everyone that can be routed
  2. To be able to expose /metrics to prometheus if it is running either on the same machine or in the same k8s cluster
  3. To protect /metrics with some password dedicated to this purpose even though we don't have TLS communication

I don't care much about the following:

  1. To protect the /metrics endpoint with some password over a TLS/HTTPS connection using a self-signed cert or provided cert

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running it under it's own configurable port may be the easiest for now. You don't have to worry about authentication, a JupyterHub admin shouldn't have to worry too much about the new port if they're following best practice and are running a firewall, and it can be easily blocked on k8s by a network policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to its own --metrics-port (and --metrics-ip) config for its own server, and disabled if unspecified (default).

@manics manics requested a review from minrk May 24, 2021 19:54
@dtaniwaki
Copy link
Contributor Author

I updated the code based on your feedback. Would you review it again?

@minrk minrk changed the title Support prometheus Support prometheus metrics May 26, 2021
@minrk minrk merged commit 9196b53 into jupyterhub:main May 26, 2021
@minrk
Copy link
Member

minrk commented May 26, 2021

Wonderful, thank you @dtaniwaki!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add prometheus based metrics
4 participants