Runner Observability #1116

ghost · 2021-05-25T11:12:36Z

Prerequisites

You use GitHub Enterprise Server
Naturally, you use Self-Hosted Runners
You embrace DevOps, meaning you give the teams free reign over the runners activity

Nature of problem
Assuming you have (like us) over 100 developers, dozens or hundreds of workflows. All share the same self-hosted runner(s).
You have no oversight, who highjacks the runners. Highjack means hogging any form of resouce:

Runtime
Upload volume
Log volume

Describe the enhancement
The cleanest enhancement would be a form of extension hooks. Upon job start a hook in some form gets called, within this hook you could then define your own actions. Maybe something in style of Swizzling where the native hook does nothing (or a console log) while you can swizzle the component to add your own action.
Upon completion another hook gets called with which you can then complete your observability.

Code Snippet

Some pseudo code. Given that the runner is .NET code it would not look like that, I just come from the TS world.

function onInit(flowId: UUID, runner: UUID, context: GitHubContext){
  infos = composeInfos(args);
  prometheus.pushgateway.push(infos);
}

function onInit(flowId: UUID, runner: UUID, context: GitHubContext, duration: number, uploadedBytes: number, loggedLines: number){
  infos = composeInfos(args);
  prometheus.pushgateway.push(infos);
}

Additional information
It might be that this concept already exists, but then its just not documented or not findable.

Also, I once saw a /timing API but I can not find it anymore, seems to have been removed.

Clearly, when enterprises start to adopt Actions the demand for observability will raise. Are we alone? 🛸

The text was updated successfully, but these errors were encountered:

nedrebo · 2021-05-26T01:29:14Z

We are looking for similar features, but we would like not to code it ourselves. I think this could be solved nicely by GA provided dashboards (read only accesible by all devs) that provide statistics for runners, workflows, label bottlenecks, load balancing, and so on.

Right now we work around this in two ways (wip):

Run Netdata cloud on all agents and have alerts there for HW/OS level issues.
Insert instrumentation into all workflows, job, and stage levels using our workflow generator. This data is inserted into elasticsearch and then we build dashboards and triggers on top of that.

jbergstroem · 2021-05-26T02:33:54Z

Other metrics that would make sense is for instance queue length. In gitlab land there are excellent ways of getting observability out of the runner via prometheus exporters. I wish the github runner took a similar approach.

toast-gear · 2021-07-23T15:22:54Z

https://github.com/Spendesk/github-actions-exporter found this, thought I'd post it on this issue as I think people will find it useful. I haven't tested it personally but it implements prometheus exporters for data you can get from the API covering some of the stuff you would want to be tracking providing some much needed observability (I wish these statistics were just baked into the github.com UI offering though!). One of the big limitations with this approach is no observability at the step level. If builds are taking longer is that because there is a problem or is it because we aren't hitting the cache as often? For example

thboop · 2022-03-14T15:58:16Z

We recently published an ADR for Job Started / Job Completed hooks for self hosted runners, feel free to provide your feedback.

In particular we would love to hear what (if anything ) else you would need to support your use case, and if the interface makes sense for you.

thboop · 2022-03-30T15:50:53Z

We've shipped a beta of this functionality in 2.289.1, please try it out and provide any feedback you have on the adr!

ghost added the enhancement New feature or request label May 25, 2021

TingluoHuang added the Actions Feature Feature requires both runner, pipelines service and launch changes label Jun 2, 2021

toast-gear mentioned this issue Jul 3, 2021

README.md: add example for monitoring the github runner actions/actions-runner-controller#671

Closed

thboop closed this as completed Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner Observability #1116

Runner Observability #1116

ghost commented May 25, 2021

nedrebo commented May 26, 2021

jbergstroem commented May 26, 2021

toast-gear commented Jul 23, 2021 •

edited

Loading

thboop commented Mar 14, 2022

thboop commented Mar 30, 2022

Runner Observability #1116

Runner Observability #1116

Comments

ghost commented May 25, 2021

nedrebo commented May 26, 2021

jbergstroem commented May 26, 2021

toast-gear commented Jul 23, 2021 • edited Loading

thboop commented Mar 14, 2022

thboop commented Mar 30, 2022

toast-gear commented Jul 23, 2021 •

edited

Loading