Better control over Prometheus metrics #4016

dtelyukh · 2021-02-12T09:56:38Z

On one hand, with lack of lables, prometheus metrics make barely more sense than measuring nationwide average body temperatures. On the other hand, with too many labels, in prometheus might be a source of unbounded high-cardinality, which will cause issues when scraping with Prometheus over long periods of time and with larger deployments.

In the usecase we had, we needed to see per-host metrics (for an obvious reason, different hosts often have very different performance), and also a per-content-type metric, here's why:

Dynamically generated resources usually have way bigger TTFB, and mixing such resources with static files we obscure actual peaks in data, because each dynamically generated HTML document also pulls 10 to 100 static files. Content-type is a good heuristic for dynamic content, because html is usually dynamically generated, and everything else is static in most cases
No matter if html is dynamic or static, its performance is more critical for overall page load time, because it's the root document for the page, and each millisecond of its load time is always added to overall page load time, which is not always the case for page resources.

We intended to create two extra grafana panels, to be able to see metrics subsets that represent:

html documents only
small static files (js/css)
large static files (images)

Also, it might be useful to separate metrics by document cache status, when caddy cache is enabled.

Based at the discussion in this ticket: #3784 and in this PR: #4015

We came up with a following proposal:

It's better to add extra configuration options, that allow both more detailed metrics and reduction of metrics size, rather just hardcoding extra added labels
We cannot extend an existing metrics directive, because it defines the virtual host that serves metrics if defined inside of a specific host
we chose metrics_options as new directive name; does anyone have better ideas maybe?

// enable per-host metrics for a single host
example.com {
  respond /foo "hello world"
  metrics_options {
    tags {
      host example.com
    }
  }
}

// enable per-host metrics + per-content-type for a single host
example2.com {
  respond /bar "hello world"
  metrics_options {
    tags {
      host example2.com
    }
    headers-to-tags {
      response.content-type
    }
  }
}

// disable metrics for a single host
example3.com:8443 {
  respond /baz "hello world"
  metrics_options {
    disable
  }
}

// disable metrics for a subpath
example3.com {
  handle_path /static {
    metrics_options {
      disable
    }
  }
}

// globally enable per-host metric, and also per-cache-status, based on response 
metrics_option {
  headers-to-tags {
    request.host
    response.x-cache-status
  }
}

The text was updated successfully, but these errors were encountered:

francislavoie · 2021-02-12T10:04:54Z

I'm not sure that metrics_options can be per-path though (or even have matchers), because by that point it's too late. Metrics collection starts before the request is passed through the handlers, I think. But maybe I misunderstand how that would work.

Also, please remember that what's important to think about is how it's configured in JSON, before even thinking of how it'll look in the Caddyfile. The Caddyfile is just an adapter to JSON config.

One thought though is maybe an API for the Caddyfile like this:

# Enables viewing metrics, this already exists:
metrics [<matcher>]

# New directive for configuring per-path (if even possible)
metric [<matcher>] [disable] | [tag <name> <value>] {
    tag <name> <value>
	disable
}

# Examples
metric tag host {host}
metric /static* tag content-type {http.response.header.Content-Type}
metric /untracked* disable
metric /static* {
	tag host {host}
	tag content-type {http.response.header.Content-Type}
}

This would imply having a JSON handler like this:

{
	"handler": "metric",
	"tags": {
		"host": "{http.request.host}",
		"content-type": "{http.response.header.Content-Type}"
	},
	"disable": false
}

And I figure this would have to be a handler that defers execution until the middleware chain is coming back downstream if we plan to support reading from the response headers.

/cc @hairyhenderson

dtelyukh · 2021-02-12T10:33:37Z

And I figure this would have to be a handler that defers execution until the middleware chain is coming back downstream if we plan to support reading from the response headers.

@francislavoie, it's OK, because Caddy already sends response code to metrics.

hairyhenderson · 2021-02-13T19:09:27Z

I haven't had time to think too deeply about this yet, but in general all labels should be set on all endpoints. That is to say, if a content_type or host label is tracked, it should be present for all time series.

Another way to say that is - no per-matcher/per-path metrics options, metric configuration should apply globally. The one potential exception to this is we may want to have some way to ignore certain endpoints. But that doesn't need to be part of this design.

I don't think it's necessary to have a new directive - we should be able to use the existing metrics directive.

And, I like the idea that @francislavoie implied with having custom labels that simply use a placeholder (or even a static value) as the value; I could see a directive like this:

metrics /metrics {
	label host {http.request.host}
	label content_type {http.response.header.Content-Type}
	label static_label "same value for everyone!"
}

I'm not at all familiar with the placeholder code but I think this would be relatively straightforward to do. I'll see if I can find some time in the next few weeks to work on this, but I'd be very open to contributions for a design like this 😅

francislavoie · 2021-02-13T19:25:12Z

Another way to say that is - no per-matcher/per-path metrics options, metric configuration should apply globally.

If that's the case, then the existing metrics handler won't do, because in the Caddyfile, a site block is a host matcher. Metrics would need to be configured via global options, and probably be set at the apps.http.servers level. (FWIW if it's a server option, it can be made part of https://caddyserver.com/docs/caddyfile/options#server-options)

francislavoie · 2021-02-13T22:50:51Z

I spoke with @hairyhenderson and ultimately I don't think this'll be possible in any way that doesn't have drastic downsides.

The trouble is that metrics are configured at process startup, and can't be safely reconfigured because the prometheus metrics lib aims to remain consistent for the lifetime of the process. If a reconfiguration were to happen, then the entire metrics storage would need to be thrown away and reinitialized. This is not ok, because certain counters need to remain because they track other global state.

Since Caddy is designed to have reloadable config, without restarting the process, this is at odds with that idea. Theoretically we could delegate initialization of the metrics until after the config is initially loaded, but this would mean that any subsequent config change that attempts to change metrics config would have to be rejected. Not ideal. Breaks certain assumptions made in Caddy's architecture.

So to avoid needing to open the possibility for reloads to fail, any configuration for metrics would have to be provided separately, specifically at process startup. This might mean having a separate config mechanism outside of the regular configs (e.g. environment variables, or new CLI args to caddy run), but this has pretty bad UX problems.

So... what do you think @mholt? I don't see a happy path here.

codecat · 2021-02-14T00:38:01Z

The trouble is that metrics are configured at process startup, and can't be safely reconfigured because the prometheus metrics lib aims to remain consistent for the lifetime of the process.

I'm not at all familiar with the metrics lib used, but would it be possible to contribute a patch to the lib that allows re-configuration at runtime?

francislavoie · 2021-02-14T01:04:51Z

I don't think it would be easy. It seems fundamental to how its designed. The lib in question is https://github.com/prometheus/client_golang/tree/master/prometheus.

I know that @hairyhenderson or I aren't interested in working on that, so someone else would need to work on it if they're up for it.

hairyhenderson · 2021-02-14T15:06:43Z

Just to add a bit more clarity on the issue - when exposing metrics with Prometheus, the expectation is that all metrics will remain consistent for the lifetime of the process, both in terms of value (a counter which starts at 0 at the beginning of the process should continue to increase indefinitely, as a reset back to 0 is interpreted as a process restart), and also in terms of labels (if caddy_http_request_duration_seconds_bucket{code="308",handler="static_response",method="GET",server="remaining_auto_https_redirects",le="0.005"} is exposed, it must not gain or lose a label during the lifetime of the process).

Another thing to consider is that Caddy exposes metrics on the admin API. An important metric for monitoring Caddy is the caddy_admin_http_request_errors_total metric. When there's an error during config reload, the caddy_admin_http_request_errors_total{handler="load",method="POST",path="/load"}series will increase, and it would be reasonable for a Prometheus alert to be set to watch that series and alert if the count of errors goes up above a desired threshold. Resetting this counter risks losing any visibility on errors.

Bottom line: we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time. I hope @mholt has some ideas on that.

mholt · 2021-02-15T19:25:24Z

@hairyhenderson

we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time.

That shouldn't be a problem. Caddy has something called a UsagePool that is designed for the sole purpose of maintaining global state across config reloads. You can dump your stats into that. This is how Caddy keeps track of the health status of proxy backends between config reloads, for example.

francislavoie · 2021-02-15T20:38:06Z

@mholt I think you misunderstood, metrics are already global. The problem is that the metrics cannot ever change after the first load.

mholt · 2021-02-15T20:39:53Z

@francislavoie What does their scope have to do with whether they change or not?

francislavoie · 2021-02-15T20:43:09Z

Because the only way to change scope involves resetting the metrics storage. As we're written above, that's a deal-breaker, we can't reset the counters during config reloads, because many of them need to stick around because they track global state.

Like I wrote above, the way prometheus metrics work is fundamentally incompatible with config reloads.

mholt · 2021-02-15T20:45:33Z

That sounds like a pretty severe limitation in Prometheus, which is surprising. I don't know how I could be of help here, as this is a bug in Prometheus (which I'm not familiar with), not Caddy.

francislavoie · 2021-02-15T20:53:55Z

It's not a "bug" though, it's an assumption that underlies the metrics tracking system.

The question is this:

So to avoid needing to open the possibility for reloads to fail, any configuration for metrics would have to be provided separately, specifically at process startup. This might mean having a separate config mechanism outside of the regular configs (e.g. environment variables, or new CLI args to caddy run), but this has pretty bad UX problems.

and

Bottom line: we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time. I hope @mholt has some ideas on that.

mholt · 2021-02-15T20:55:38Z

It's a broken assumption then. Servers change their configs, it's as simple as that. Prometheus should know that by now.

mr-karan · 2021-02-16T12:07:30Z

We can probably use https://github.com/VictoriaMetrics/metrics which supports creating metrics on the fly. I believe what you want to use is creating a NewSet.

It's also a nicer, less dependency alternative to the official client lib.

cc @hairyhenderson @francislavoie

hairyhenderson · 2021-02-16T12:52:07Z

@mr-karan I'm not sure that's exactly what we need. What's lacking is not the ability to create metrics "on the fly" (which the Prometheus client can obviously also do), but instead the ability to modify metrics after being created.

If that's possible with VictoriaMetrics, then that indicates a bug in that client, not a feature. Metrics need to be guaranteed to stay consistent through the life of the process.

mr-karan · 2021-02-16T13:12:50Z

but instead the ability to modify metrics after being created

To demonstrate with an example:

metric_label_a{val="ok"} 1

You want this to be modified to something else after the reload, like:

metric_label_a{new_val="hello"} 1

Is my understanding correct?

manojm321 · 2021-11-28T18:49:51Z

Leaving this here in case it helps - If you use telegraf to scrape metrics you can easily rename the tag values: manojm321/ansible-synology@83ca673

danlsgiga · 2022-11-22T17:32:47Z

This issue doesn't seem to be getting traction. It would be nice to have the labels capability.
In my case, I have the config via Caddyfile and metrics are emitted using the listener server name (srv0, srv1, etc) as a server label. Since I can't change the listener server name via Caddyfile, only JSON, its hard for me to know what metrics relate to what listeners.

francislavoie added discussion 💬 The right solution needs to be found feature ⚙️ New feature or request labels Feb 12, 2021

hairyhenderson mentioned this issue Feb 13, 2021

caddyhttp: Add content type to metrics #4015

Closed

hairyhenderson mentioned this issue Feb 13, 2021

Support per-host Prometheus metrics #3784

Closed

genofire mentioned this issue Apr 30, 2021

Prometheus Metrics for reverse_proxy upstreams #4140

Open

rugk mentioned this issue Aug 25, 2021

Include reverse_procy upstream monitoring in Prometheus monitoring #4309

Closed

renbou mentioned this issue Mar 25, 2022

Prometheus metrics are redundant and slow #4644

Open

git001 mentioned this issue Dec 3, 2022

Keep metrics data upon server reload/reconfiguration prometheus/client_golang#1176

Closed

hussam-almarzoq mentioned this issue Apr 28, 2024

Add per_host option to metrics handler #6279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better control over Prometheus metrics #4016

Better control over Prometheus metrics #4016

dtelyukh commented Feb 12, 2021 •

edited

Loading

francislavoie commented Feb 12, 2021 •

edited

Loading

dtelyukh commented Feb 12, 2021 •

edited

Loading

hairyhenderson commented Feb 13, 2021

francislavoie commented Feb 13, 2021

francislavoie commented Feb 13, 2021

codecat commented Feb 14, 2021

francislavoie commented Feb 14, 2021

hairyhenderson commented Feb 14, 2021

mholt commented Feb 15, 2021

francislavoie commented Feb 15, 2021

mholt commented Feb 15, 2021

francislavoie commented Feb 15, 2021 •

edited

Loading

mholt commented Feb 15, 2021 •

edited

Loading

francislavoie commented Feb 15, 2021

mholt commented Feb 15, 2021

mr-karan commented Feb 16, 2021

hairyhenderson commented Feb 16, 2021

mr-karan commented Feb 16, 2021 •

edited

Loading

manojm321 commented Nov 28, 2021

danlsgiga commented Nov 22, 2022 •

edited

Loading

Better control over Prometheus metrics #4016

Better control over Prometheus metrics #4016

Comments

dtelyukh commented Feb 12, 2021 • edited Loading

francislavoie commented Feb 12, 2021 • edited Loading

dtelyukh commented Feb 12, 2021 • edited Loading

hairyhenderson commented Feb 13, 2021

francislavoie commented Feb 13, 2021

francislavoie commented Feb 13, 2021

codecat commented Feb 14, 2021

francislavoie commented Feb 14, 2021

hairyhenderson commented Feb 14, 2021

mholt commented Feb 15, 2021

francislavoie commented Feb 15, 2021

mholt commented Feb 15, 2021

francislavoie commented Feb 15, 2021 • edited Loading

mholt commented Feb 15, 2021 • edited Loading

francislavoie commented Feb 15, 2021

mholt commented Feb 15, 2021

mr-karan commented Feb 16, 2021

hairyhenderson commented Feb 16, 2021

mr-karan commented Feb 16, 2021 • edited Loading

manojm321 commented Nov 28, 2021

danlsgiga commented Nov 22, 2022 • edited Loading

dtelyukh commented Feb 12, 2021 •

edited

Loading

francislavoie commented Feb 12, 2021 •

edited

Loading

dtelyukh commented Feb 12, 2021 •

edited

Loading

francislavoie commented Feb 15, 2021 •

edited

Loading

mholt commented Feb 15, 2021 •

edited

Loading

mr-karan commented Feb 16, 2021 •

edited

Loading

danlsgiga commented Nov 22, 2022 •

edited

Loading