Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter filtering of check run completion events #2520

Merged
merged 2 commits into from
Apr 27, 2023

Conversation

Nuru
Copy link
Contributor

@Nuru Nuru commented Apr 19, 2023

I observed that 100% of canceled jobs in my runner pool were not causing scale down events. This PR fixes that.

The problem was caused by #2119.

#2119 ignores certain webhook events in order to fix #2118. However, #2119 overdoes it and filters out valid job cancellation events. This PR uses stricter filtering and adds visibility for future troubleshooting.

Example cancellation event

This is the redacted top portion of a valid cancellation event my runner pool received and ignored.

{
  "action": "completed",
  "workflow_job": {
    "id": 12848997134,
    "run_id": 4738060033,
    "workflow_name": "slack-notifier",
    "head_branch": "auto-update/slack-notifier-0.5.1",
    "run_url": "https://api.github.com/repos/nuru/<redacted>/actions/runs/4738060033",
    "run_attempt": 1,
    "node_id": "CR_kwDOB8Xtbc8AAAAC_dwjDg",
    "head_sha": "55bada8f3d0d3e12a510a1bf34d0c3e169b65f89",
    "url": "https://api.github.com/repos/nuru/<redacted>/actions/jobs/12848997134",
    "html_url": "https://github.com/nuru/<redacted>/actions/runs/4738060033/jobs/8411515430",
    "status": "completed",
    "conclusion": "cancelled",
    "created_at": "2023-04-19T00:03:12Z",
    "started_at": "2023-04-19T00:03:42Z",
    "completed_at": "2023-04-19T00:03:42Z",
    "name": "build (arm64)",
    "steps": [

    ],
    "check_run_url": "https://api.github.com/repos/nuru/<redacted>/check-runs/12848997134",
    "labels": [
      "self-hosted",
      "arm64"
    ],
    "runner_id": 0,
    "runner_name": "",
    "runner_group_id": 0,
    "runner_group_name": ""
  },

@Nuru Nuru mentioned this pull request Apr 19, 2023
7 tasks
@Link- Link- added community Community contribution needs triage Requires review from the maintainers labels Apr 21, 2023
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 24, 2023

Hey @Nuru! This looks good, although I'm unable to reproduce your issue.
How did you create your runners and how did you cancel the job?
For me, every workflow job with status=completed and conclusion=cancelled gives me non-zero runner_id and non empty runner_name. Those are for manually cancelled jobs. If you cancel jobs in other ways, perhaps it might miss runner_id and so on like you've seen...?

UPDATE: I was able to obtain workflow job events with runner_id=0 when I manually cancelled workflow jobs immediately after they got queued. Is that how you canceled your jobs?

@Nuru
Copy link
Contributor Author

Nuru commented Apr 24, 2023

@mumoshu asked:

How did you create your runners and how did you cancel the job?

Runners were part of a Runner deployment, created by HorizontalRunnerAutoscaler in response to a workflow_job: queued event.

Jobs were matrix jobs canceled automatically by "fail-fast" when another job in the matrix failed.

Most likely the failed jobs failed before the canceled jobs were started, i.e. the jobs were canceled while waiting for runners to become available (it can take 90 seconds for our runner pool to scale up). This is because some of the matrix jobs run in pools where we have idle runners waiting, so they start immediately, while others run on pools where we do not.

@Nuru Nuru requested a review from mumoshu April 26, 2023 06:40
Nuru and others added 2 commits April 27, 2023 13:06
Copy link
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for your contribution @Nuru!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution needs triage Requires review from the maintainers
Projects
None yet
3 participants