Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to opt out of agent auto-update #246

Closed
myoung34 opened this issue Dec 20, 2019 · 55 comments
Closed

Add ability to opt out of agent auto-update #246

myoung34 opened this issue Dec 20, 2019 · 55 comments
Assignees
Labels

Comments

@myoung34
Copy link

Describe the enhancement

Running the agent within docker and having it auto update causes the container to throttle and never succeed without having to manage PID 1

It would be nice to have something in config.sh or run.sh, possibly environment variables to help opt out of auto-update

Code Snippet

n/a

Additional information

n/a

@myoung34 myoung34 added the enhancement New feature or request label Dec 20, 2019
@bryanmacfarlane
Copy link
Member

We're also looking at producing an image as part of the runner build so if you are doing single use container builds then you never hit a required update - it's picking up the container with the runner in it that the service demands.

Question - are you doing single use containers or does the runner / container lifetime span jobs and builds in your scenario?

@myoung34
Copy link
Author

@bryanmacfarlane
Copy link
Member

instead of entrypoint running run.sh (interactive) it should probably run runsvc.sh which handles update return codes. That coupled with a image released with every runner version would probably eliminate the issue

https://github.com/actions/runner/blob/master/src/Misc/layoutbin/runsvc.sh

@bryanmacfarlane
Copy link
Member

^^ I'm working on some docker containers to be part of the build - I'll look over your files

@myoung34
Copy link
Author

just make sure the dockerfiles work for arm/arm64 please 😃

@bryanmacfarlane
Copy link
Member

#239

@SvanBoxel
Copy link

I have a customer requesting this. At the moment the self-hosted runner self-updates, making it possible unpredictable for production workloads.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

Please don't include docker by default. I already run this inside a container.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

Here is my suggestions:

  • Don't want self-updating features, it's deployed as a versioned package. If an update is needed we will release a new package.
  • Please separate the files according to unix conventions: read-only area for binaries/assets and a directory for runtime state, e.g. the user's home directory, or allow it to be specified on the command line.

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Apr 1, 2020

It's the service that demands and pulls the runner forward. There are new capabilities and features on a cloud service that's rolling forward which needs changes in the runner. Even if it's demanded only if you use a new feature, it still means a runner can get pulled forward.

So, when it gets pulled forward, if the container is referencing ./runsvc.sh, it will handle the update code and get pulled forward. if you're using a container built with the runner / latest, then you will minimize the frequency of the update (time).

Also note that if we build some containers and sim release with the runner, that doesn't preclude you from building your own containers as you're doing now. It would be a pure convenience for others.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

I'd rather it stopped working than auto updated.

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Apr 1, 2020

That's possible as well. Call ./run.sh, it will exit and not handle the update code. However, please be aware that all your workflows will stop running until you build another container or manually configure all your self hosted runners again. You will only be able to do that with the version the service demands that you didn't want to auto update to. So I'm not sure what the stop buys you.

Note that the runner is a simple step runner. The yaml and the orchestration runs service side (that's why things like matrix'ing sets with outputs etc.) and delegates steps to this app. We're also not talking about the hosted images, dev tools, etc. which usually affect the work being done. the tools in the container or on the self hosted runner will stay the same.

@bryanmacfarlane
Copy link
Member

☝️ Also note that we are painfully aware of how this affects container scenarios and building your own. We're thinking about that so I don't want to discount the problem here. Just trying to explain how it works and what the current state is. Hope that helps.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

I understand where you are coming from.

With CI, you are running a lot of untrusted code so maybe my argument doesn't hold much water.

That being said, you can isolate untrusted code using KVM, run it as an unprivileged user, trash the results.

However, the way the system is currently designed and distributed makes this difficult.

One would normally expect a package to have a clean split between "executables/assets" and "run-time state". For whatever reason, this package doesn't have that. Not even the configuration is separated out.

Ideally, you'd install something like github-actions into /usr/bin and start it by running github-actions --work-dir=/tmp/github-actions or something similar. It would read a per user and per system configuration file, or maybe accept a command line argument to a configuration file if required.

Some of the consequences of this design are:

  • Impossible to multiplex - e.g. have multiple instances of github-actions running on the same system.
  • Huge package size because it includes C# and NodeJS.
  • Hard to integrate with existing package managers that expect more "unix" like model.
  • Hard to isolate using traditional unix approach, e.g. drop privileges, chroot, etc.

So, github-actions necessitates containers and other isolation constructs because of it's design choices.

Auto update makes the package unpredictable. Some package systems provide verification functions to check if there are any errors on the file system or something has been manipulated. Some package managers provide permissions check system which validates file permissions. Some package managers provide file ownership query and other operations. If the package files are changing, these functions stop working. Additionally, it makes it hard to remove the package, if the files are not known. For example, removing this package in arch linux boils down to rm -rf which is, kind of rude.

So, my advice is:

  • There should be a static part of the package which includes executables, assets and other files.
  • A nice to have would be the ability to build using the systems build in C# / Node.JS packages - i.e. don't need to distribute them as part of the package.
  • It should be possible to run the github action as any user, with a per-instance configuration/working directory, without needing containers to achieve this.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

Oh, I also wanted to add, auto update is a massive target for black hats, so completely agree with @SvanBoxel - hard to audit, hard to reproduce, hard to validate, etc.

@bryanmacfarlane
Copy link
Member

We'll reflect on ☝️ for a bit. But note that there's a ton of separation right now where the runner knows nothing about yaml, the language and other service side / user features. that's all compiled / processed service side into the simplest "job" message which is one node of the whole run graph ( a set of steps). Those use contexts and are passed through fairly opaquely to the runner to minimize changes.

However there are times when there's a feature that's added which takes runner changes. As an example, js or container actions can have post job scripts and we're about to add pre. Another example would be side car containers for test (expressed in yaml but executed by the runner).

So there's actually multiple moving parts (1) the service (2) the runner and (3) the actions that 3rd parties are releasing. :o

So now, when an action packages a pre/script (start an emulator, shutdown etc.) they expect it to run and even though the outer workflow is run service side, actions run VM side and the steps are run by the runner.

In Azure (where this code started) we actually did a real complicated thing. We allowed the runners to stay back and only get update when the user used a new service side feature that demanded a new runner and we also allowed action authors to demand a min version of the runner. The service would compute a min based on features used and enumerate actions, calculate the min and if too old, send an update message. It still would get updated, just less often and much less predictably. Task authors couldn't figure it out, didn't get it right and it led to really hard to debug situations.

I think the fundamental design decision is the yaml is not run by the runners. It's run service side with a durable stateful orchestration which dishes out work to runners. That has alot of benefits with things like parallel graphs of nodes which dish out simple instructions to a runner. But it does mean that the runner binary is an RPC remote extension of that orchestration and actions are separate from that.

Completely open to refactoring the runner and ideas. But I think that the root of the problem is the multiple moving parts with the other two moving forward and getting new features. :)

If you got this far, thanks for reading my wall of text. Hope it provides some background.

@ioquatix
Copy link

ioquatix commented Apr 1, 2020

Yeah, I can understand what you are saying.

I agree it's not simple and to be honest, I see both sides of that coin.

But as someone who was trying to put together a package for users, this was non-trivial to deal with.

Right now, all the files are installed into /var/lib/github-actions and owned by the user of the same name. It's not ideal.

So somehow we need to find a middle ground that allows someone like myself to package it up in a more traditional way, but also support things like on the fly updates.

Whether you just vendor all the self updating code into the runtime directory is kind of up to you/the engineers behind the code, but at least it should be possible. In essence, the package should install the core code, the system integration, the user account setup, systemd, pull in dependencies which may be provided by the system (nodejs, c#) etc, and then you might start up an instance, it would self update it's own scripts into the runtime directory, but this is largely state which could be wiped clean without affecting the package being reinstalled, running the update process in the future, etc.

@myoung34
Copy link
Author

myoung34 commented Apr 1, 2020

It doesnt have to stop working in general if you do deprecation warnings saying "a new version is available" for one version.

Running X
Get alert on "version y is available"
Both X and Y work
Z is released
X stops working
Y starts showing "version z is available"

would work for most scenarios. deprecation is hard for sure, but everyone would have an upgrade window in this scenario

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Apr 1, 2020

We follow the model above when we (rarely) deprecate features. When we do it, that process ☝️ spans months.

However, this is not about deprecation. This is about new features getting added to the service (which users use when released. e.g. new yaml step construct) and the runner having to add new code (not deprecating) for that feature to work. This is also about the runner adding new features to support new capabilities for action authors (and yet others who use those actions and maintain runners).

@SvanBoxel
Copy link

SvanBoxel commented Apr 1, 2020

An important consideration I would like us to think about is that companies in some industries are simply not compliant if they run software that hasn't been explicitly verified/validated. So having a service that runs on your own infrastructure that automatically updates could mean your business is not compliant.

An explicit upgrade process, or a way to opt-out of automatic updates, would solve this compliance conflict. I can give the example of software that is being used within companies that require FDA compliance need to go through this process before a service is being upgraded: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?CFRPart=820&showFR=1&subpartNode=21:8.0.1.1.12.7

@myoung34
Copy link
Author

myoung34 commented Apr 1, 2020

☝️ is what I was going to follow up with.
All patches need to have a change control process of "some sort" in some regulatory environments

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Apr 1, 2020

Yep. We've worked with those regulated environments in both azure and github.

In those scenarios, other things do not move forward. Everything is controlled or air-gapped including the on-premise server (TFS at MS, upcoming GHES at GitHub). It's not just the runner that's under scrutiny. The tasks/actions were also scrutinized and controlled.

In those scenarios, the runner (and actions) do not update since the service is also controlled.
We also allowed for explicit but convenient update of the runner by placing the runner in a configured location.

Even in on-premises non air gapped, the service and actions do not continuously update so the runners can stay back. The minimum runner configured is recorded and control server / service side.

Which goes back to the key point: if the other parts (service and actions) move forward then the code that executes those features has to move forward.

@bryanmacfarlane
Copy link
Member

Note that I think we should still look at investing in options here. The points above are clarifying where we currently are with the current model.

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Apr 17, 2020

Also note in the short term if you are running the runner in a container, you should call the ./runsvc.sh entrypoint and not ./run.sh which is meant for interactive running. Calling ./runsvc.sh will at least handle the update and not exit.

In the medium term, we also want to support .container in the yaml when running in K8s so at least the container your job runs in (the steps) can stay static and is decoupled from the runner. That also means you don't have to build the runner into your containers. That's how it works with VM self-hosted - the runner on the host execs steps in the container you bring.

Finally, we should consider refactoring as @ioquatix referred to into the base runner (listener) and the steps portion which is derived from the service yaml orchestration.

@ioquatix
Copy link

@bryanmacfarlane one thing I considered several months ago was writing a compatible daemon in Ruby just for fun. Do you have any published specs for the GitHub Actions runner? I started reading through the C# source code but... trying to derive a protocol/specification by reading through C# is not really my thing.

@matthewhembree
Copy link

I'm a bit concerned about the runner updating to a pre-release version. When I build my container to get around the auto-update autokill, I query https://api.github.com/repos/actions/runner/releases/latest. Currently, this returns a version that needs to update itself.

@tetchel
Copy link
Contributor

tetchel commented Apr 21, 2021

I also did a double-take at the "pre-release" tag on the last 3 versions: v2.277.0, v2.277.1, and v2.278.0. It's not clear to me why those specifically are pre-releases.

As far as I can tell, only the Linux arm/arm64 builds are pre-release, but those were added prior to the listed versions.

@matthewhembree
Copy link

Promises 👀
Promises 🔍
Promises 🙄

@pietroalbini
Copy link

Rust's builders also got stuck yesterday when a pre-release was pushed as an update to our runners. We can't use auto-updates for security reasons.

@nehagargSeequent
Copy link

Is there any update on the issue? I am using Azure Kubernetes to host the runners and every time a build is triggered, the runners first go offline to auto-update which is super annoying.

@bryanmacfarlane
Copy link
Member

@pietroalbini I think you may be confusing the runner application (this repo) and the hosted images. Even if you could pin this application, the OS images and all the toolsets slide underneath it so I don't think this issue will address your issue and other's concerns.

@pietroalbini
Copy link

@bryanmacfarlane the reason Rust can't use auto-updates is not related to the hosted images, but it's a byproduct of running a custom runner fork as a workaround to another runner limitation.

The main problem Rust has with the self-hosted runner is #494 (the runner executing builds from PRs causing security issues), and to work around it we're using a fork of the runner that rejects any build coming that's not a push event. In isolation this works fine and avoids the security issue, but running that fork means we can never use the auto-update feature: any auto-update would update to the upstream version of the runner, which doesn't have the security protections we added to our fork.

In our fork we also added a change to prevent auto-updates, but since the GitHub backend only sends jobs to a runner if it's up to date, preventing auto-updates that way effectively disables the runner whenever new runner release is made. We have to watch for new releases and rebase our fork before the updat e rolls to our organization, and when we don't get around to it fast enough our queue stops, which is annoying.

@patryk-kozak-branch
Copy link

@bryanmacfarlane @hross
I feel that for v2.280.1 and v2.280.2 there was at least couple of days grace period after release, as runners with older version could still connect and work, didn't need to immediate upgrade them.
However version released today, v2.280.3 almost immediately broke (~5hr after release) runners and CI for entire company. Is there some kind of default time/scheduled plan for releases before they get applied and are forcing runners to be upgraded?

Do you have plans to create some kind of channel of communication for upgrades? With grace period to upgrade? Or to support X last versions for compatibility?
Sorry for all the different questions, but knowing answers to those questions are critical in real world application of GitHub CI and it usefulness 😉

@TingluoHuang
Copy link
Member

@patryk-kozak-branch Sorry about what you experienced, do you have any runner diag log still available for me to check? I would like to understand why the 2.280.3 upgrade breaks everything for you. 🙇

@jenschelkopf

This comment has been minimized.

@myoung34
Copy link
Author

@jenschelkopf is that private? Im unable to see that issue

@jenschelkopf
Copy link

Ah, yes @myoung34. That issue is part of a repository where we track customer feedback internally. I got a little link happy :) I'll hide the comment so other people don't try it as well.

@ChristopherHX
Copy link
Contributor

ChristopherHX commented Sep 11, 2021

My little, but effective workaround of my own runner implementation written in golang, which obviously doesn't want to be replaced by the official runner:
To opt out of auto update the connecting runner just must have a version >= latest during session creation.

In our fork we also added a change to prevent auto-updates, but since the GitHub backend only sends jobs to a runner if it's up to date, preventing auto-updates that way effectively disables the runner whenever new runner release is made.

As you can compile the runner in a fork or locally, why not change the runner version to 3.0.0 (file https://github.com/actions/runner/blob/main/src/runnerversion). If you run the dev script / workflow now your runner won't get updates for some month or even years. (no idea when github runners will reach 3.x.x, then use 4.0.0 and so on).

This method might bear the risk github requires a new runner feature like a new message encryption method (fips was the last one which were added) and your old runner version doesn't have that feature and fails to connect, because the actions service assumes the feature is present.

@edudip-thomasp
Copy link

edudip-thomasp commented Sep 24, 2021

So is there anything new on this topic? Auto-updating when running "runner" in Kubernetes is constantly failing. An option to opt out would be nice.

fgalind1 added a commit to fgalind1/runner that referenced this issue Oct 28, 2021
When using infrastructure as code, containers and recipes, pinning
versions for reproducibility is a good practice. Usually docker
container images contain static tools and binaries and the docker image
is tagged with a specific version.

Constructing a docker image of a runner with a specific runner version
and then self-updating itself doesn't seem that natural, instead the
docker image should use whatever that binary was built/tagged to.

Additionally to this - this concept doesn't play well when using
ephemeral runners and kuberentes. First of all, we need to pay the price
of downloading/self-updating every single ephemeral pod for every single
job which causes delays in execution. Secondly this doesn't work well
and containers may get stuck

Related issues that will be solved with this:
- actions#1396
- actions#246
- actions#485
- actions#422
- actions#442
fgalind1 added a commit to fgalind1/runner that referenced this issue Oct 28, 2021
When using infrastructure as code, containers and recipes, pinning
versions for reproducibility is a good practice. Usually docker
container images contain static tools and binaries and the docker image
is tagged with a specific version.

Constructing a docker image of a runner with a specific runner version
and then self-updating itself doesn't seem that natural, instead the
docker image should use whatever that binary was built/tagged to.

Additionally to this - this concept doesn't play well when using
ephemeral runners and kuberentes. First of all, we need to pay the price
of downloading/self-updating every single ephemeral pod for every single
job which causes delays in execution. Secondly this doesn't work well
and containers may get stuck

Related issues that will be solved with this:
- actions#1396
- actions#246
- actions#485
- actions#422
- actions#442
@na-jakobs
Copy link

I would like to see this as well 🥇

@thboop
Copy link
Collaborator

thboop commented Feb 2, 2022

This has been implemented in runner 2.287.1, and is live on github.com. It will come to ghes/ghae in subsequent releases! 🎉
Please see this changelog for more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests