Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner Observability #1116

Closed
ghost opened this issue May 25, 2021 · 5 comments
Closed

Runner Observability #1116

ghost opened this issue May 25, 2021 · 5 comments
Labels
Actions Feature Feature requires both runner, pipelines service and launch changes enhancement New feature or request

Comments

@ghost
Copy link

ghost commented May 25, 2021

Prerequisites

  • You use GitHub Enterprise Server
  • Naturally, you use Self-Hosted Runners
  • You embrace DevOps, meaning you give the teams free reign over the runners activity

Nature of problem
Assuming you have (like us) over 100 developers, dozens or hundreds of workflows. All share the same self-hosted runner(s).
You have no oversight, who highjacks the runners. Highjack means hogging any form of resouce:

  • Runtime
  • Upload volume
  • Log volume

Describe the enhancement
The cleanest enhancement would be a form of extension hooks. Upon job start a hook in some form gets called, within this hook you could then define your own actions. Maybe something in style of Swizzling where the native hook does nothing (or a console log) while you can swizzle the component to add your own action.
Upon completion another hook gets called with which you can then complete your observability.

Code Snippet

Some pseudo code. Given that the runner is .NET code it would not look like that, I just come from the TS world.

function onInit(flowId: UUID, runner: UUID, context: GitHubContext){
  infos = composeInfos(args);
  prometheus.pushgateway.push(infos);
}

function onInit(flowId: UUID, runner: UUID, context: GitHubContext, duration: number, uploadedBytes: number, loggedLines: number){
  infos = composeInfos(args);
  prometheus.pushgateway.push(infos);
}

Additional information
It might be that this concept already exists, but then its just not documented or not findable.

Also, I once saw a /timing API but I can not find it anymore, seems to have been removed.

Clearly, when enterprises start to adopt Actions the demand for observability will raise. Are we alone? 🛸

@ghost ghost added the enhancement New feature or request label May 25, 2021
@nedrebo
Copy link

nedrebo commented May 26, 2021

We are looking for similar features, but we would like not to code it ourselves. I think this could be solved nicely by GA provided dashboards (read only accesible by all devs) that provide statistics for runners, workflows, label bottlenecks, load balancing, and so on.

Right now we work around this in two ways (wip):

  1. Run Netdata cloud on all agents and have alerts there for HW/OS level issues.
  2. Insert instrumentation into all workflows, job, and stage levels using our workflow generator. This data is inserted into elasticsearch and then we build dashboards and triggers on top of that.

@jbergstroem
Copy link

Other metrics that would make sense is for instance queue length. In gitlab land there are excellent ways of getting observability out of the runner via prometheus exporters. I wish the github runner took a similar approach.

@TingluoHuang TingluoHuang added the Actions Feature Feature requires both runner, pipelines service and launch changes label Jun 2, 2021
@toast-gear
Copy link

toast-gear commented Jul 23, 2021

https://github.com/Spendesk/github-actions-exporter found this, thought I'd post it on this issue as I think people will find it useful. I haven't tested it personally but it implements prometheus exporters for data you can get from the API covering some of the stuff you would want to be tracking providing some much needed observability (I wish these statistics were just baked into the github.com UI offering though!). One of the big limitations with this approach is no observability at the step level. If builds are taking longer is that because there is a problem or is it because we aren't hitting the cache as often? For example

@thboop
Copy link
Collaborator

thboop commented Mar 14, 2022

We recently published an ADR for Job Started / Job Completed hooks for self hosted runners, feel free to provide your feedback.

In particular we would love to hear what (if anything ) else you would need to support your use case, and if the interface makes sense for you.

@thboop
Copy link
Collaborator

thboop commented Mar 30, 2022

We've shipped a beta of this functionality in 2.289.1, please try it out and provide any feedback you have on the adr!

@thboop thboop closed this as completed Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Actions Feature Feature requires both runner, pipelines service and launch changes enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants