Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete jobs that have failed for at least the last 60 days in a row #2528

Closed
fejta opened this issue Apr 19, 2017 · 13 comments
Closed

Delete jobs that have failed for at least the last 60 days in a row #2528

fejta opened this issue Apr 19, 2017 · 13 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@fejta
Copy link
Contributor

fejta commented Apr 19, 2017

http://velodrome.k8s.io/dashboard/db/bigquery-metrics

Delete any job which:

  • Ran this week
  • Failed every run for the last 60 days.

Example jobs:

 "ci-kubernetes-e2e-gci-gce-examples": {
    "failing_days": 172
  },
  "ci-kubernetes-e2e-gce-examples": {
    "failing_days": 172
  },
  "ci-kubernetes-e2e-gce-latest-upgrade-cluster": {
    "failing_days": 165
  },
  "ci-kubernetes-e2e-gci-gke-pre-release": {
    "failing_days": 163
  },
  "ci-kubernetes-e2e-gke-pre-release": {
    "failing_days": 159
  },
  "ci-kubernetes-e2e-kops-aws-slow": {
    "failing_days": 146
  },
  "ci-kubernetes-e2e-kops-aws-serial": {
    "failing_days": 146
  },
  "ci-kubernetes-e2e-gke-stackdriver": {
    "failing_days": 94
  },
  "ci-kubernetes-e2e-ubuntu-gke-serial": {
    "failing_days": 80
  },
  "ci-kubernetes-e2e-ubuntu-gke-1-6-serial": {
    "failing_days": 80
  },
  "ci-kubernetes-e2e-ubuntu-gke-1-6-flaky": {
    "failing_days": 78
  },
  "ci-kubernetes-node-docker-benchmark": {
    "failing_days": 74
  },
  "ci-kubernetes-node-docker": {
    "failing_days": 74
  },
  "ci-kubernetes-node-kubelet-flaky": {
    "failing_days": 68
  },
  "ci-kubernetes-pull-gce-federation-deploy-canary": {
    "failing_days": 66
  },
  "ci-kubernetes-e2e-gce-gci-qa-serial-master": {
    "failing_days": 62
  },
  "pr:pull-kubernetes-e2e-kubeadm-gce": {
    "failing_days": 62
  },
  "ci-kubernetes-soak-gke-gci-test": {
    "failing_days": 61
  },
  "ci-kubernetes-e2e-gce-etcd3-release-1-5": {
    "failing_days": 60
  },
  "ci-kubernetes-soak-gke-test": {
    "failing_days": 60
  },
  "ci-kubernetes-e2e-kops-aws-canary": {
    "failing_days": 60
  },

Previous cleanup work: #2453

Current status: http://storage.googleapis.com/k8s-metrics/failures-latest.json

@fejta fejta self-assigned this Apr 19, 2017
@rmmh
Copy link
Contributor

rmmh commented Apr 19, 2017

Some tests are ONLY run on flaky suites. Are we going to just stop running them?

I guess that's a general problem. We should graph the test matrix-- show all the tests we have defined and which jobs they have run on in the last week. That way we can find tests that never run and flag them for revival or deletion.

@fejta
Copy link
Contributor Author

fejta commented Apr 19, 2017

There are a bunch of flaky suites.

@fejta
Copy link
Contributor Author

fejta commented Apr 27, 2017

I want to do the following:

  • Run the minimum set of testing necessary to give us confidence in our releases
  • Run the maximum set of testing we have the ability to maintain.

Right now it seems like we are running more tests than we have the ability to maintain. Therefore I am deleting the tests that seem to be providing the least amount of marginal value (based on the fact that they never pass).

@fejta
Copy link
Contributor Author

fejta commented May 1, 2017

Will delete these tests in a couple weeks unless someone signs up to fix them: https://github.com/kubernetes/test-infra/blob/master/experiment/bigquery/failures-latest.json

@pipejakob
Copy link
Contributor

Sign me up for pr:pull-kubernetes-e2e-kubeadm-gce. I have a WIP PR (#2509) to fix it.

@fejta
Copy link
Contributor Author

fejta commented May 1, 2017

Cool! And that one is a month away from the 60d mark anyway :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2018
@BenTheElder
Copy link
Member

Heh, @fejta I think maybe we don't want the stale job issue to go stale :-)

@BenTheElder
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 28, 2018
@BenTheElder
Copy link
Member

/remove-lifecycle stale

/cc @mithrav @spiffxp @AishSundar
we should codify something like this and enact it to clean up jobs that have been failing for ridiculously long

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants