monitor boskos cleanup timing #13

ixdy · 2020-05-29T00:46:32Z

Originally filed as kubernetes/test-infra#14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos
/assign @krzyzacy
cc @fejta @mm4tt
/kind feature

k8s-ci-robot · 2020-05-29T00:46:34Z

@ixdy: The label(s) area/boskos cannot be applied, because the repository doesn't have them

In response to this:

Originally filed as kubernetes/test-infra#14715 by @BenTheElder

What would you like to be added: export and graph metrics for boskos cleanup timing

Why is this needed: so we can determine if this is increasing and we need to increase the janitor or fix boskos xref #14697

Possibly this should also move to the new monitoring stack? cc @cjwagner @detiber

/area boskos
/assign @krzyzacy
cc @fejta @mm4tt
/kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krzyzacy · 2020-06-19T05:15:12Z

/unassign
/help-wanted

detiber · 2020-06-19T15:21:17Z

/help

fejta-bot · 2020-09-17T15:55:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-10-17T16:38:46Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-11-16T17:21:21Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-11-16T17:21:29Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ixdy · 2020-11-16T18:44:52Z

/reopen
/remove-lifecycle stale

k8s-ci-robot · 2020-11-16T18:44:59Z

@ixdy: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

detiber · 2020-11-17T19:14:31Z

/lifecycle frozen

cpanato · 2021-01-19T18:48:09Z

this is something to add a prometheus metric for this operation? @detiber

detiber · 2021-01-19T18:50:58Z

@cpanato I believe that to be the case, yes. That said, I haven't dug into how the existing metrics are exposed for boskos. The dashboards sit at monitoring.prow.k8s.io, though.

cpanato · 2021-01-20T09:54:39Z

hello @ixdy the metric in question should be added in this part https://github.com/kubernetes-sigs/boskos/tree/master/cmd/cleaner ? or it is for another part of the code?
maybe the first question is, this is still needed?

ixdy · 2021-02-16T19:06:37Z

Sorry for the delay in response. To clarify, this would be metrics added to the janitor(s), not the (unfortunately named) cleaner component.

The basic gist is just adding some Prometheus metrics to the janitors, yes, but the primary challenge is that in some deployments (such as k8s.io prow) Boskos + the janitors run in a completely separate build cluster from the prow monitoring stack, which makes collecting these metrics more challenging, since they aren't directly accessible.

In the case of k8s.io prow, to collect metrics from the core boskos service, we expose the boskos metrics port on an external IP and then explicitly collect from that address. Since the janitors run as a separate container, we'd need to either expose additional IPs for each janitor (non-ideal) or set up some sort of collector for all of the boskos metrics (core and janitor) and then expose that to the prow monitoring stack. Alternately, we could collect/push these metrics to the monitoring stack. [Note: I'm probably using the wrong Prometheus terminology here.]

Figuring all of this out is the harder aspect of this issue. If this sounds interesting to you, please take it on!

cpanato · 2021-02-23T11:36:58Z

@ixdy thanks and my turn to say sorry for the delay 😄

There are two different things we need to do, the first one is to add the metric in the janitor and the second the infrastructure part.

For the second I have a couple of questions:

the janitor is a cron process or it is always up and running?
if is a cron we will need to use Prometheus pushgateway to send the metrics there and then the monitor cluster can scrape from there.
the cluster that runs the boskos and the janitor is the same?
then to avoid having multiple LB to expose we can deploy Prometheus in this cluster to collect the metrics and expose this to be scraped by the main monitoring system, so we just have one LB entry point (Prometheus federated)

I will work on the first part to add the metrics while we discuss the second If that sounds good to you

thanks!

cpanato · 2021-02-23T11:37:11Z

/assign
/remove-help

ixdy · 2021-02-24T00:00:20Z

Is the janitor a cron process or is it always up and running?

It depends. There are 3 (or 4) different janitor endpoints right now:

a. cmd/aws-janitor: one-shot command that cleans up an AWS account, optionally specifying a region.
b. cmd/aws-janitor-boskos: long-lived process which queries Boskos (using its API) for AWS regions that are in dirty state, cleaning up the relevant region using the same library as (a) and then returning the region in Boskos to the free state.
c. cmd/janitor/gcp_janitor.py: one-shot python script which cleans up the provided GCP project(s). Eventually should be rewritten in Go, probably.
d. cmd/janitor: resource-agnostic janitor that queries Boskos (using its API) for resources of a specified type that are in a dirty state, passing them to a specified janitor command to clean up, returning them to Boskos in the free state (assuming the janitor command exited successfully). Defaults to calling the gcp_janitor.py script, but can potentially call any other one-shot janitor (e.g. the AWS janitor from (a)).

The one-shot janitors could be run as CronJobs, with or without Boskos (e.g. to manage AWS environments, GCP projects, etc that are not managed by Boskos). The Boskos-specific janitors tend to run as long-running pods.

(So one follow-up question you might have: which janitor? The ones most relevant to this issue are probably cmd/aws-janitor-boskos and cmd/janitor, though hopefully you can generalize things enough to reduce the amount of duplicated code.)

The cluster that runs the boskos and the janitor is the same?

In general, yes, the janitors run in the same cluster as Boskos. This is because the necessary credentials/service accounts needed to interact with AWS accounts/GCP projects likely already exist in those clusters, as they are used by the test jobs.

cpanato · 2021-02-24T08:19:03Z

thanks for the clarification @ixdy

aws-janitor-boskos: add clean time and process time metrics

aws-janitor: add job duration metric

k8s-ci-robot assigned krzyzacy May 29, 2020

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2020

ixdy mentioned this issue May 29, 2020

monitor boskos cleanup timing kubernetes/test-infra#14715

Closed

k8s-ci-robot unassigned krzyzacy Jun 19, 2020

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 19, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2020

k8s-ci-robot closed this as completed Nov 16, 2020

k8s-ci-robot reopened this Nov 16, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 17, 2020

k8s-ci-robot assigned cpanato Feb 23, 2021

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 23, 2021

cpanato mentioned this issue Feb 24, 2021

aws-janitor-boskos: add clean time and process time metrics #75

Merged

k8s-ci-robot added a commit that referenced this issue Mar 3, 2021

Merge pull request #75 from cpanato/GH-13-aws-janitor-boskos

7fe7571

aws-janitor-boskos: add clean time and process time metrics

cpanato mentioned this issue Mar 4, 2021

aws-janitor: add job duration metric #78

Merged

k8s-ci-robot added a commit that referenced this issue Mar 12, 2021

Merge pull request #78 from cpanato/GH-13-aws-janitor

885e35e

aws-janitor: add job duration metric

spiffxp added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor boskos cleanup timing #13

monitor boskos cleanup timing #13

ixdy commented May 29, 2020

k8s-ci-robot commented May 29, 2020

krzyzacy commented Jun 19, 2020

detiber commented Jun 19, 2020

fejta-bot commented Sep 17, 2020

fejta-bot commented Oct 17, 2020

fejta-bot commented Nov 16, 2020

k8s-ci-robot commented Nov 16, 2020

ixdy commented Nov 16, 2020

k8s-ci-robot commented Nov 16, 2020

detiber commented Nov 17, 2020

cpanato commented Jan 19, 2021

detiber commented Jan 19, 2021

cpanato commented Jan 20, 2021

ixdy commented Feb 16, 2021

cpanato commented Feb 23, 2021

cpanato commented Feb 23, 2021

ixdy commented Feb 24, 2021

cpanato commented Feb 24, 2021

monitor boskos cleanup timing #13

monitor boskos cleanup timing #13

Comments

ixdy commented May 29, 2020

k8s-ci-robot commented May 29, 2020

krzyzacy commented Jun 19, 2020

detiber commented Jun 19, 2020

fejta-bot commented Sep 17, 2020

fejta-bot commented Oct 17, 2020

fejta-bot commented Nov 16, 2020

k8s-ci-robot commented Nov 16, 2020

ixdy commented Nov 16, 2020

k8s-ci-robot commented Nov 16, 2020

detiber commented Nov 17, 2020

cpanato commented Jan 19, 2021

detiber commented Jan 19, 2021

cpanato commented Jan 20, 2021

ixdy commented Feb 16, 2021

cpanato commented Feb 23, 2021

cpanato commented Feb 23, 2021

ixdy commented Feb 24, 2021

cpanato commented Feb 24, 2021