Grafana Dashboards for use with Zero to JupyterHub on Kubernetes
Grafana dashboards displaying prometheus metrics are extremely useful in diagnosing issues on Kubernetes clusters running JupyterHub. However, everyone has to build their own dashboards - there isn't an easy way to standardize them across many clusters run by many entities.
This project provides some standard Grafana Dashboards as Code to help with this. It uses jsonnet and grafonnet to generate dashboards completely via code. This can then be deployed on any Grafana instance!
-
Locally, you need to have jsonnet installed. The grafonnet library is already vendored in, using jsonnet-builder.
-
A recent version of prometheus installed on your cluster. Currently, it is assumed that your prometheus instance is installed using the prometheus helm chart, with kube-state-metrics, node-exporter and cadvisor enabled. In addition, you should scrape metrics from the hub instance as well.
-
A recent version of Grafana, with a prometheus data source already added.
-
An API key with 'admin' permissions. This is per-organization, and you can make a new one by going to the configuration pane for your Grafana (the gear icon on the left bar), and selecting 'API Keys'. The admin permission is needed to query list of data sources so we can auto-populate template variable options (such as list of hubs).
There's a helper deploy.py
script that can deploy the dashboards to any grafana installation.
export GRAFANA_TOKEN="<API-TOKEN-FOR-YOUR-GRAFANA>
./deploy.py <your-grafana-url>
This creates a folder called 'JupyterHub Default Dashboards' in your grafana, and adds a couple of dashboards to it.
If your Grafana deployment supports more than one datasource, then apart from the default dashboards in the dashboards
directory, you should also consider deploying apart the dashboards in global-dashboards
directory.
export GRAFANA_TOKEN="<API-TOKEN-FOR-YOUR-GRAFANA>
./deploy.py <your-grafana-url> --dashboards-dir global-dashboards
The gloabal dashboards will use the list of available dashboards in your Grafana provided to them and will build dashboards across all of them.
NOTE: ANY CHANGES YOU MAKE VIA THE GRAFANA UI WILL BE OVERWRITTEN NEXT TIME YOU RUN deploy.bash. TO MAKE CHANGES, EDIT THE JSONNET FILE AND DEPLOY AGAIN
If you are using a prometheus chart of a version later than 13.*
, then additional configuration for kube-state-metrics
needs to be provided because v2.0
of thekube-state-metrics
chart that comes with latest prometheus doesn't add any labels by default.
Since these dashboards assume the existence of such labels for pods or nodes, we need to explicitly configure prometheus to track them by populating the list at prometheus.kubeStateMetrics.metricLabelsAllowlist.
prometheus:
kube-state-metrics:
metricLabelsAllowlist:
# to select jupyterhub component pods and get the hub usernames
- pods=[app,component,hub.jupyter.org/username]
# allowing all labels is probably fine for nodes, since they don't churn much, unlike pods
- nodes[*]
If you're using a prometheus chart older than version 14.*
, then you can deploy the dashboards available prior to the upgrade, in the 1.0 tag
.
The grafonnet jsonnet library is bundled here with jsonnet-bundler.
Just running jb update
in the git repo root dir after installing jsonnet-bunder should bring
you up to speed.
Interpreting prometheus metrics and writing PromQL queries that serve a particular purpose can be difficult. Here are some guidelines to help.
"When will the OOM killer start killing processes in this container?" is the most useful
thing for us to know when measuring container memory usage. Of the many container memory
metrics, container_memory_working_set_bytes
tracks this (see this blog post
and this issue).
So prefer using that metric as the default for 'memory usage' unless specific reasons
exist for using a different metric.
The most common prometheus on kubernetes setup in the JupyterHub community seems to be the prometheus helm chart.
-
kube-state-metrics (metrics documentation) collects information about various kubernetes objects (pods, services, etc) by scraping the kubernetes API. Anything you can get via
kubectl
commands, you can probably get via a metric here. Very helpful as a way to query other metrics based on the kubernetes object they represent (like pod, node, etc). -
node-exporter (metrics documentation) collects information about each node - CPU usage, memory, disk space, etc. Since hostnames are usually random, you usually join these metrics with
kube-state-metrics
node metrics to get useful information out. If you are running a manual NFS server, it is recommended to run a node-exporter instance there as well to collect server metrics. -
cadvisor (metrics documentation) collects information about each container. Join these with pod metrics from
kube-state-metrics
for useful queries. -
jupyterhub (metrics documentation) collects information directly from the JupyterHubs.
-
Other components you have installed on your cluster - like prometheus, nginx-ingress, etc - will also emit their own metrics.
It seems that one container's resource metrics can be reported multiple times,
with an empty name
label and a name=k8s_...
label.
Because of this, if we do sum(container_resource_metric) by (pod)
,
we will often get twice the actual resource consumption of a given pod.
Since name=""
is always redundant, make sure to exclude this in any query
that includes a sum across container metrics.
For example:
sum(
irate(container_cpu_usage_seconds_total{name!=""}[5m])
) by (namespace, pod)