Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better control over Prometheus metrics #4016

Open
dtelyukh opened this issue Feb 12, 2021 · 20 comments
Open

Better control over Prometheus metrics #4016

dtelyukh opened this issue Feb 12, 2021 · 20 comments
Labels
discussion 💬 The right solution needs to be found feature ⚙️ New feature or request

Comments

@dtelyukh
Copy link
Contributor

dtelyukh commented Feb 12, 2021

On one hand, with lack of lables, prometheus metrics make barely more sense than measuring nationwide average body temperatures. On the other hand, with too many labels, in prometheus might be a source of unbounded high-cardinality, which will cause issues when scraping with Prometheus over long periods of time and with larger deployments.

In the usecase we had, we needed to see per-host metrics (for an obvious reason, different hosts often have very different performance), and also a per-content-type metric, here's why:

  1. Dynamically generated resources usually have way bigger TTFB, and mixing such resources with static files we obscure actual peaks in data, because each dynamically generated HTML document also pulls 10 to 100 static files. Content-type is a good heuristic for dynamic content, because html is usually dynamically generated, and everything else is static in most cases

  2. No matter if html is dynamic or static, its performance is more critical for overall page load time, because it's the root document for the page, and each millisecond of its load time is always added to overall page load time, which is not always the case for page resources.

We intended to create two extra grafana panels, to be able to see metrics subsets that represent:

  • html documents only
  • small static files (js/css)
  • large static files (images)

Also, it might be useful to separate metrics by document cache status, when caddy cache is enabled.

Based at the discussion in this ticket: #3784 and in this PR: #4015

We came up with a following proposal:

  • It's better to add extra configuration options, that allow both more detailed metrics and reduction of metrics size, rather just hardcoding extra added labels
  • We cannot extend an existing metrics directive, because it defines the virtual host that serves metrics if defined inside of a specific host
  • we chose metrics_options as new directive name; does anyone have better ideas maybe?
// enable per-host metrics for a single host
example.com {
  respond /foo "hello world"
  metrics_options {
    tags {
      host example.com
    }
  }
}

// enable per-host metrics + per-content-type for a single host
example2.com {
  respond /bar "hello world"
  metrics_options {
    tags {
      host example2.com
    }
    headers-to-tags {
      response.content-type
    }
  }
}

// disable metrics for a single host
example3.com:8443 {
  respond /baz "hello world"
  metrics_options {
    disable
  }
}

// disable metrics for a subpath
example3.com {
  handle_path /static {
    metrics_options {
      disable
    }
  }
}

// globally enable per-host metric, and also per-cache-status, based on response 
metrics_option {
  headers-to-tags {
    request.host
    response.x-cache-status
  }
}
@francislavoie francislavoie added discussion 💬 The right solution needs to be found feature ⚙️ New feature or request labels Feb 12, 2021
@francislavoie
Copy link
Member

francislavoie commented Feb 12, 2021

I'm not sure that metrics_options can be per-path though (or even have matchers), because by that point it's too late. Metrics collection starts before the request is passed through the handlers, I think. But maybe I misunderstand how that would work.

Also, please remember that what's important to think about is how it's configured in JSON, before even thinking of how it'll look in the Caddyfile. The Caddyfile is just an adapter to JSON config.

One thought though is maybe an API for the Caddyfile like this:

# Enables viewing metrics, this already exists:
metrics [<matcher>]

# New directive for configuring per-path (if even possible)
metric [<matcher>] [disable] | [tag <name> <value>] {
    tag <name> <value>
	disable
}

# Examples
metric tag host {host}
metric /static* tag content-type {http.response.header.Content-Type}
metric /untracked* disable
metric /static* {
	tag host {host}
	tag content-type {http.response.header.Content-Type}
}

This would imply having a JSON handler like this:

{
	"handler": "metric",
	"tags": {
		"host": "{http.request.host}",
		"content-type": "{http.response.header.Content-Type}"
	},
	"disable": false
}

And I figure this would have to be a handler that defers execution until the middleware chain is coming back downstream if we plan to support reading from the response headers.

/cc @hairyhenderson

@dtelyukh
Copy link
Contributor Author

dtelyukh commented Feb 12, 2021

And I figure this would have to be a handler that defers execution until the middleware chain is coming back downstream if we plan to support reading from the response headers.

@francislavoie, it's OK, because Caddy already sends response code to metrics.

@hairyhenderson
Copy link
Collaborator

I haven't had time to think too deeply about this yet, but in general all labels should be set on all endpoints. That is to say, if a content_type or host label is tracked, it should be present for all time series.

Another way to say that is - no per-matcher/per-path metrics options, metric configuration should apply globally. The one potential exception to this is we may want to have some way to ignore certain endpoints. But that doesn't need to be part of this design.

I don't think it's necessary to have a new directive - we should be able to use the existing metrics directive.

And, I like the idea that @francislavoie implied with having custom labels that simply use a placeholder (or even a static value) as the value; I could see a directive like this:

metrics /metrics {
	label host {http.request.host}
	label content_type {http.response.header.Content-Type}
	label static_label "same value for everyone!"
}

I'm not at all familiar with the placeholder code but I think this would be relatively straightforward to do. I'll see if I can find some time in the next few weeks to work on this, but I'd be very open to contributions for a design like this 😅

@francislavoie
Copy link
Member

Another way to say that is - no per-matcher/per-path metrics options, metric configuration should apply globally.

If that's the case, then the existing metrics handler won't do, because in the Caddyfile, a site block is a host matcher. Metrics would need to be configured via global options, and probably be set at the apps.http.servers level. (FWIW if it's a server option, it can be made part of https://caddyserver.com/docs/caddyfile/options#server-options)

@francislavoie
Copy link
Member

I spoke with @hairyhenderson and ultimately I don't think this'll be possible in any way that doesn't have drastic downsides.

The trouble is that metrics are configured at process startup, and can't be safely reconfigured because the prometheus metrics lib aims to remain consistent for the lifetime of the process. If a reconfiguration were to happen, then the entire metrics storage would need to be thrown away and reinitialized. This is not ok, because certain counters need to remain because they track other global state.

Since Caddy is designed to have reloadable config, without restarting the process, this is at odds with that idea. Theoretically we could delegate initialization of the metrics until after the config is initially loaded, but this would mean that any subsequent config change that attempts to change metrics config would have to be rejected. Not ideal. Breaks certain assumptions made in Caddy's architecture.

So to avoid needing to open the possibility for reloads to fail, any configuration for metrics would have to be provided separately, specifically at process startup. This might mean having a separate config mechanism outside of the regular configs (e.g. environment variables, or new CLI args to caddy run), but this has pretty bad UX problems.

So... what do you think @mholt? I don't see a happy path here.

@codecat
Copy link

codecat commented Feb 14, 2021

The trouble is that metrics are configured at process startup, and can't be safely reconfigured because the prometheus metrics lib aims to remain consistent for the lifetime of the process.

I'm not at all familiar with the metrics lib used, but would it be possible to contribute a patch to the lib that allows re-configuration at runtime?

@francislavoie
Copy link
Member

I don't think it would be easy. It seems fundamental to how its designed. The lib in question is https://github.com/prometheus/client_golang/tree/master/prometheus.

I know that @hairyhenderson or I aren't interested in working on that, so someone else would need to work on it if they're up for it.

@hairyhenderson
Copy link
Collaborator

Just to add a bit more clarity on the issue - when exposing metrics with Prometheus, the expectation is that all metrics will remain consistent for the lifetime of the process, both in terms of value (a counter which starts at 0 at the beginning of the process should continue to increase indefinitely, as a reset back to 0 is interpreted as a process restart), and also in terms of labels (if caddy_http_request_duration_seconds_bucket{code="308",handler="static_response",method="GET",server="remaining_auto_https_redirects",le="0.005"} is exposed, it must not gain or lose a label during the lifetime of the process).

Another thing to consider is that Caddy exposes metrics on the admin API. An important metric for monitoring Caddy is the caddy_admin_http_request_errors_total metric. When there's an error during config reload, the caddy_admin_http_request_errors_total{handler="load",method="POST",path="/load"}series will increase, and it would be reasonable for a Prometheus alert to be set to watch that series and alert if the count of errors goes up above a desired threshold. Resetting this counter risks losing any visibility on errors.

Bottom line: we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time. I hope @mholt has some ideas on that.

@mholt
Copy link
Member

mholt commented Feb 15, 2021

@hairyhenderson

we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time.

That shouldn't be a problem. Caddy has something called a UsagePool that is designed for the sole purpose of maintaining global state across config reloads. You can dump your stats into that. This is how Caddy keeps track of the health status of proxy backends between config reloads, for example.

@francislavoie
Copy link
Member

@mholt I think you misunderstood, metrics are already global. The problem is that the metrics cannot ever change after the first load.

@mholt
Copy link
Member

mholt commented Feb 15, 2021

@francislavoie What does their scope have to do with whether they change or not?

@francislavoie
Copy link
Member

francislavoie commented Feb 15, 2021

Because the only way to change scope involves resetting the metrics storage. As we're written above, that's a deal-breaker, we can't reset the counters during config reloads, because many of them need to stick around because they track global state.

Like I wrote above, the way prometheus metrics work is fundamentally incompatible with config reloads.

@mholt
Copy link
Member

mholt commented Feb 15, 2021

That sounds like a pretty severe limitation in Prometheus, which is surprising. I don't know how I could be of help here, as this is a bug in Prometheus (which I'm not familiar with), not Caddy.

@francislavoie
Copy link
Member

It's not a "bug" though, it's an assumption that underlies the metrics tracking system.

The question is this:

So to avoid needing to open the possibility for reloads to fail, any configuration for metrics would have to be provided separately, specifically at process startup. This might mean having a separate config mechanism outside of the regular configs (e.g. environment variables, or new CLI args to caddy run), but this has pretty bad UX problems.

and

Bottom line: we can't move forward with any customization of the metrics series (i.e. labels) until we can find a suitable config mechanism that doesn't require resetting at config reload time. I hope @mholt has some ideas on that.

@mholt
Copy link
Member

mholt commented Feb 15, 2021

It's a broken assumption then. Servers change their configs, it's as simple as that. Prometheus should know that by now.

@mr-karan
Copy link

We can probably use https://github.com/VictoriaMetrics/metrics which supports creating metrics on the fly. I believe what you want to use is creating a NewSet.

It's also a nicer, less dependency alternative to the official client lib.

cc @hairyhenderson @francislavoie

@hairyhenderson
Copy link
Collaborator

@mr-karan I'm not sure that's exactly what we need. What's lacking is not the ability to create metrics "on the fly" (which the Prometheus client can obviously also do), but instead the ability to modify metrics after being created.

If that's possible with VictoriaMetrics, then that indicates a bug in that client, not a feature. Metrics need to be guaranteed to stay consistent through the life of the process.

@mr-karan
Copy link

mr-karan commented Feb 16, 2021

but instead the ability to modify metrics after being created

To demonstrate with an example:

metric_label_a{val="ok"} 1

You want this to be modified to something else after the reload, like:

metric_label_a{new_val="hello"} 1

Is my understanding correct?

@manojm321
Copy link

Leaving this here in case it helps - If you use telegraf to scrape metrics you can easily rename the tag values: manojm321/ansible-synology@83ca673

@danlsgiga
Copy link
Contributor

danlsgiga commented Nov 22, 2022

This issue doesn't seem to be getting traction. It would be nice to have the labels capability.
In my case, I have the config via Caddyfile and metrics are emitted using the listener server name (srv0, srv1, etc) as a server label. Since I can't change the listener server name via Caddyfile, only JSON, its hard for me to know what metrics relate to what listeners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion 💬 The right solution needs to be found feature ⚙️ New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants