Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow timeout for metrics #19

Closed
matthiasr opened this issue Feb 3, 2015 · 15 comments · Fixed by #208
Closed

Allow timeout for metrics #19

matthiasr opened this issue Feb 3, 2015 · 15 comments · Fixed by #208

Comments

@matthiasr
Copy link

In some scenarios, a client will stop pushing metrics because it has gone away. Currently every node needs to be deleted explicitly or the last value will stick around forever. It would be good to be able to configure a timeout after which a metric is considered stale and removed. I think it would be best if the client could specify this.

@beorn7 beorn7 self-assigned this Feb 3, 2015
@brian-brazil
Copy link
Contributor

You may be interested in the textfile module of the node_exporter. It allows you to export information on the local filesystem, and as it's on the node will go away when the node does.

@juliusv
Copy link
Member

juliusv commented Feb 3, 2015

@matthiasr Actually this is a great point by @brian-brazil. We should simply move chef-client exporting to the node exporter, since it can be considered a per-node metric. Then the time series will go away automatically if the host is gone.

@brian-brazil
Copy link
Contributor

A use case for this has appeared, it may be a way to allow clients who really really want to push to do so; while offering some GC.

@matthiasr
Copy link
Author

@juliusv agree, that side-steps the issue in our case. But I think it's still something needed e.g. for cron jobs – an hourly cronjob may report to pushgateway, but after >1h that metric is no longer valid.

@brian-brazil
Copy link
Contributor

@matthiasr Hourly cronjobs is service-level monitoring of batch jobs which is the primary use case for the Pushgateway, you'd export that without any expiry, timestamps or other advanced things like that..

@matthiasr
Copy link
Author

Not necessarily … I'm not necessarily monitoring the job itself, but instead e.g. some complex calculated value from a Hadoop job.

But even when monitoring, say, the runtime of my cronjob, how would I tell whether it just always takes the same time or there has never been a run again? I'd rather have no metric if it didn't run than the metric from the last time it ran. At least in some cases, which is why I think it should be optional.

@juliusv
Copy link
Member

juliusv commented Feb 4, 2015

@matthiasr To expand on what Brian said, an hourly cronjob would push its last completion timestamp to the pushgateway. That way you can monitor (via time() - last_successful_run_timestamp_seconds) whether your batch job hasn't run for too long. The metric would still be ingested by Prometheus upon every scrape from the pushgateway and get a server-side current timestamp attached.

@juliusv
Copy link
Member

juliusv commented Feb 4, 2015

@brian-brazil
Copy link
Contributor

lemoer added a commit to lemoer/pushgateway that referenced this issue Jun 1, 2016
In some situations it is very useful if you can submit values to
the pushgateway that disappear after a certain while (if they
are not refreshed).

The lifetime is specified by adding a "Lifetime"-Field to your
HTTP-Header. The value is a string that the "ParseDuration"
(golang-builtin) function accepts as valid format.

Implements prometheus#19
@beorn7
Copy link
Member

beorn7 commented Jun 13, 2016

After some discussions, the conclusion is that we don't want this feature for now (in the spirit of https://twitter.com/solomonstre/status/715277134978113536 ). In most cases, this feature is requested to implement anti-patterns in the monitoring set-up. There might still be a small number of legitimate use cases, but in view of the huge potential of abusing the feature, and also semantic intricacies that will be hand to get right in implementing it, we declare it a bad trade-off.

@fvigotti
Copy link

fvigotti commented Oct 1, 2018

I would like TTL too, at the end I've created a very bare while loop in bash to accomplish that..
I use the pushgateway for some short lived scripts which I want to monitor and gather stats for..
ie:
1 ) I have a bash script that trigger backups jobs using a mixture of cron and inotify, those are short lived bash jobs attached to some kubernetes statefulsets they serve,
I export metrics to prometheus but when the statefulset gets recreated ( ie in event of eviction from a node/update/.. ) I still have all old job->instances, in my pushgateway and those are there forever, now I delete automatically after 10 minutes without that script I have a mess hard to filter in my grafana-graphs+grafana-alerts ( maybe in prometheus alerts would be easier having easily more expressive language for alerting )

anyway I don't see why you are very strong opinionated against TTL, it's a feature not hard to implement and a lot of people want it, I understand that you can say that everyone is using pushgateway in a wrong manner.. but maybe it's not true, a lot of people have different problems to solve,

now if you have a good alternative for my usecase at the end of which I have a lot of duplicated metrics like this
notifyborgbackup_throttle{instance="10.40.80.17",job="bk_grafana"} 1.538400733e+09
with expired "instance" ip address that no longer exists, but prometheus still scrape that metrics from pushgateway and create a lot of duplicated data for dead pod in my prometheus database.

another example,
2) I have some bash script that monitor latency from queries to legacy systems that are hard to monitor elseway, I don't want to open a socket in bash to let prometheus pull the data, so I use the pushgateway

let me know..
Thank you,
Francesco

p.s.
if the script could be useful to someone ( I've searched but found nothing before creating mine )
https://gist.github.com/fvigotti/cf5938d2ea037422555550e649b6a2c7

@juliusv
Copy link
Member

juliusv commented Oct 1, 2018

@fvigotti Since the statefulset itself is fundamentally long-running and discoverable via Kubernetes SD (which gives you all the discovery metadata benefits), it seems like this is a similar case as using the Node Exporter's textfile module for metrics tied to a specific host (just that here it's a statefulset's pod and not a host). So I'd expect the recommended thing to do would be to have a sidecar in each pod that serves metrics (either a specialized exporter or the Node Exporter with only textfile collector module enabled) instead of pushing the metrics to a PGW and then not having pod+PGW lifecycles tied together. This will enable clean cuts of metrics as well, as even with a TTL you will either lose metrics too early or you will have a lot of overlap between dead and alive instances.

@beorn7
Copy link
Member

beorn7 commented Oct 1, 2018

It makes more sense to have a discussion like this on the various Prometheus mailing lists rather than in a GitHub issue (in particular a closed one). A straight-forward feature request might still be a fit for a (new) GitHub issue, but where it is already apparent that it is more complicated than "Good idea, PRs welcome", I'd strongly recommend to start a thread on the prometheus-developers mailing list. If you are seeking advice how to implement your use case with the given tooling, the prometheus-users mailing list is a good place. On both mailing lists, more people are available to potentially chime in, and the whole community can benefit.

@fvigotti
Copy link

fvigotti commented Oct 2, 2018

@juliusv yes the statefulset ie: mysql, jenkins, etc export their metrics using standard patterns as being long running services, but the job in preDestroy( which trigger snapshot-backup + some checks , then push the metrics about the backup/destroy process, and the statefulset which yeah is long running but it's going to close..so I can't wait the next prometheus scrape interval.. ) , or some sidecar pod with a bash script that performs backups, different/custom healthcheks those are better exported using PGW with a simple curl without having to integrate node exporter textfiles or web services in every sidecar ,

I'm saing that not because I'm looking for advices on how I setup metrics ( even if advices are always welcome :) ) , but to show you how I use the PGW and why I'm also interested in TTL, to me seems that the design that you have in mind for the PGW is a very limited use case and you don't want to extend it to not create possible anti-patterns that's also way I'm telling you my use case to let you decide if mine is an anti-pattern or not..
I use tens of software I'm not subscribed to all those mailing lists , I found a discussion about TTL and I've contributed If I find some time I'll partecipate to the mailing list or if you want you can reference this thread, I've found already my inelegant solution ( published in the gist ) but wanted to give my 2cents advice..

@beorn7
Copy link
Member

beorn7 commented Oct 2, 2018

I'll lock this issue now. That's not to stifle the discussion but, on the contrary, to not let it rot in a closed issue in a repo that not every developer is tracking. Whoever is interested in convincing the Prometheus developer community to revert the decision of not implementing a TTL/timeout for pushed metrics, please open a thread on the prometheus-developers mailing list. (TODO for @beorn7 : Once such a thread exists, link it here and in the README.md.)

@fvigotti I understand that you are not keen on subscribing to a mailing list for every software you use. However, the Prometheus developers are not keen, either, to track all the (open and closed) issues of all repos in the Prometheus org (there are 38 of them!). As the Prometheus developers are doing all the work you are benefitting from, I think it is fair to ask that you play to their rules of how to tell them about your request.

@prometheus prometheus locked and limited conversation to collaborators Oct 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants