Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTL for pushed metrics? #117

Closed
Rotwang opened this issue May 9, 2017 · 12 comments
Closed

TTL for pushed metrics? #117

Rotwang opened this issue May 9, 2017 · 12 comments

Comments

@Rotwang
Copy link

Rotwang commented May 9, 2017

Hi, it appears that a pushgateway doesn't support any form of TTL for the pushed metrics. Yes, I've seen this link: https://prometheus.io/docs/practices/pushing/. However cache should be invalidated under the right circumstances and I think that introducing a TTL (e.g. with a "meta" label like: 'push_gateway_ttl_seconds') could help in removing cached stale metrics.

@brian-brazil
Copy link
Contributor

Dupe of #19

For the rare times you need to delete a group, you can do so by hand.

@Rotwang
Copy link
Author

Rotwang commented May 9, 2017

I'm wondering why is this an anti-pattern. So I have batch job 'A' that I run periodically, month later this job is (a) no longer required or maybe (b) it was converted to a daemon with it's own exporter. Now in case 'b' I have a duplicate metric available (daemon exporter and push gateway). In case 'a' I have a stale metric (information which is no longer valid).
I don't want to micromanage my metrics either and DELETE them from the push gateway. Would be much easier to just TTL the metric I'm sending.

@beorn7
Copy link
Member

beorn7 commented May 9, 2017

#19 documents the conclusion back then. If you want to bring forward new evidence that justifies re-opening the discussion, please do so on the prometheus-developers mailing list.

@joshk0
Copy link

joshk0 commented Jun 10, 2017

Sorry to keep dredging this up, but I would like the devteam to explain the best practice for this situation:

  • We have an upload area where files are delivered.
  • Every time a file is noticed, a Kubernetes Job is spun up to ingest the file.
  • The Job pushes file-level metrics (number of successful records processed) when it completes to the pushgateway. The endpoint that is pushed to is http://pushgateway/job/ingest/instance/HOSTNAME_OF_POD .
  • We want to track incoming number of records (each incoming file has varying amount of records) per hour over all Jobs using Prometheus.
  • ~Hundreds of files arrive per hour.

This works well but the queries get slower as the number of unique hostnames increases. This is because every unique instance name is permanently remembered by the pushgateway. The way the pushgateway is designed, it seems like we have these choices:

  • Use a less unique instance name than HOSTNAME_OF_POD. But then if n files for the instance name are processed in the same scraping period (the files are often quite small), then all but 1 of the metrics for that scraping period would be lost and we would under-report.
  • Keep sending the hostname as the instance value, but delete metrics for completed ingestion processes after a given amount of time. How long would this need to be to prevent our graphs from being incomplete? It sounds like as long as it's a few multiples of the scraping period, it should be fine.

I think this is a common use case, and that the official documentation should describe what the best practice for this sort of 'ephemeral producer' use case is.

@beorn7
Copy link
Member

beorn7 commented Jun 12, 2017

I'm sure the Prometheus community is happy to discuss your use case. But an already closed GitHub issue is not the right place. Could you post to the prometheus-users mailing list where the discussion is accessible for everybody so that more people can benefit from it?

@eloo
Copy link

eloo commented Jul 10, 2017

would like to see a TTL feature too

@yumpy
Copy link

yumpy commented Nov 22, 2017

I'd also like to see a TTL feature. Having to manually remove stale groups is painful and for me it's not a 'rare time'.

@eloo
Copy link

eloo commented Nov 23, 2017

@yumpy maybe you want to look at this fork
https://github.com/pkcakeout/pushgateway

@0Ams
Copy link

0Ams commented Nov 24, 2017

would like to see a TTL feature too stable version

@Q-Lee
Copy link

Q-Lee commented Dec 6, 2017

I would also like this feature. Prometheus is frequently deployed in container infrastructures, where all jobs are ephemeral. This is doubly the case with the push gateway, which is designed for ephemeral jobs.

Furthermore, mailing lists are where discussions go to die. Email chains fork, they're often only visible to a small group, and they're frequently lost. Github persists context and conversation across years, as this issue shows.

Garbage collecting push metrics is nearly impossible in the prometheus model, because it's hard to know when a metric is no longer relevant. Fortunately, most of us don't need perfect: we need good enough. And stale metric deletion is good enough for most use cases.

@Spritekin
Copy link

I have the same problem. I have hundreds of jobs every day and I need to monitor the status of the jobs, but after a few hours the job finishes but the metric is still there. Then new jobs arrive continuously then pushgateway keep accumulating the jobs.
So I really need a way to expire entries.

Meanwhile, given those are ephemeral values, I guess I can delete all entries before adding new ones.

@beorn7
Copy link
Member

beorn7 commented Feb 28, 2018

Could you please take #117 (comment) into account? Really, folks, you are using the wrong forum to express your concerns.

@prometheus prometheus locked and limited conversation to collaborators Feb 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants