Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ephemeral agent (single-use). #2176

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

hpidcock
Copy link
Contributor

@hpidcock hpidcock commented Aug 8, 2023

This adds support for ephemeral/single-use agents. Agents given the --ephemeral flag will at most run one pipeline. After running a single pipeline, the agent will taint itself (which for now just disables the agent by marking it as no_schedule).

The purpose for this feature for agents, is to allow potentially privileged pipelines to run on an agent, isolated from other pipelines. Then after the pipeline has run, effectively throw away the agent. It is intended that these agents are provisioned by a seperate system, external to woodpecker. An external system would dynamically provision agents (based on number of waiting pipelines and their labels), run those agents with --ephemeral, wait for them to be tainted, then destroy them.

Also this includes a fix to no_schedule causing the disabled agents to hammer the Next rpc method. Could be a problem in a large cluster with many disabled agents (effectively a self-inflicted ddos on the woodpecker server).

@6543 6543 added agent feature add new functionality labels Aug 8, 2023
@6543 6543 added this to the 1.1.0 milestone Aug 8, 2023
Moving the agent tainting to inside the runner so that the agent is
tainted right after it has been assigned a workflow. This ensures the
agent is tainted just before the system is affected by the workflow.
When an agent is disabled, the rpc client will not receive an error
and will be retried via the outer loop. This allows the client retry
logic to step in and retry the rpc call after a delay.
@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@cbb1c46). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2176   +/-   ##
=======================================
  Coverage        ?   40.72%           
=======================================
  Files           ?      182           
  Lines           ?    10899           
  Branches        ?        0           
=======================================
  Hits            ?     4439           
  Misses          ?     6121           
  Partials        ?      339           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Conflicts:
- cmd/agent/agent.go
@hpidcock hpidcock marked this pull request as ready for review August 12, 2023 04:09
@woodpecker-bot
Copy link
Contributor

woodpecker-bot commented Aug 12, 2023

Deployment of preview was successful: https://woodpecker-ci-woodpecker-pr-2176.surge.sh

agent/runner.go Outdated Show resolved Hide resolved
cmd/agent/agent.go Show resolved Hide resolved
cmd/agent/agent.go Outdated Show resolved Hide resolved
server/grpc/server.go Outdated Show resolved Hide resolved
@hpidcock hpidcock requested a review from qwerty287 August 12, 2023 11:49
Comment on lines +72 to +77
// if ephemeral, taint the agent before running any workload.
if r.ephemeral {
err = r.client.TaintAgent(runnerCtx)
if err != nil {
return fmt.Errorf("tainting agent: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't is possible to move this to cmd/agent/agent.go (l. 232) too? Or has this some side-effect?
So mainly that the Runner struct does not need the ephemeral field anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to move out, but the ordering is important here.

The agent needs to be tainted after receiving the workflow, just before it runs it.

  • If it is tainted before, then it will never receive one.
  • If it is tainted after a workflow, and that workflow causes the agent to restart in some way, it will never be tainted.

I'm open to alternative ways we can achieve this. It could be handled server side instead, but I was trying as much as possible to limit the scope of the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is tainted after a workflow, and that workflow causes the agent to restart in some way, it will never be tainted.

How can a workflow restart the agent? Since the number of parallel workflows is 1, it shouldn't take new ones so I don't see an issue with tainting it after the workflow. It's possible that I oversaw something though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only probable with a malicious workflow. If there was a workflow using privileged steps or was running on a local backend agent, it could potentially take over the agent before it is "tainted", and pull whatever workflows it wants until it gets some information or run some other nefarious action.

My goal with this work is to enable me to run privileged steps on my agents by making them ephemeral. This is the first step in doing that.

@anbraten
Copy link
Member

Did you had a look at the woodpecker autoscaler. Its an external tool which is also watching agents and disabling them based on a specific condition. So I guess a similar tool could solve your usecase without needing to adjust the core

@hpidcock
Copy link
Contributor Author

Did you had a look at the woodpecker autoscaler. Its an external tool which is also watching agents and disabling them based on a specific condition. So I guess a similar tool could solve your usecase without needing to adjust the core

I have looked at this tool, it is what I want in terms of the creation/destruction of the agents (I'll need to add AWS+OpenStack support for my needs). But this PR's use case is specifically that I cannot trust an agent after it has started a workflow, and I need it to be able limit itself from running any more workflows.

@pat-s pat-s modified the milestones: 2.0.0, 2.x.x Oct 13, 2023
@zc-devs zc-devs mentioned this pull request Jan 2, 2024
4 tasks
@anbraten anbraten removed this from the 3.x.x milestone Jan 30, 2024
@wez
Copy link
Contributor

wez commented Apr 19, 2024

Just came here to say: I'm looking for this single-use agent feature, and here's the context for my use case:

Specifically what I'm looking for is:

  1. Run a single workflow, privileged
  2. Reset the machine containing the agent to a known-good snapshot
  3. Repeat

I intend on running this in a local proxmox so that I can apply appropriate isolation (vlan + firewall to prevent access to other networks)

I'm open to other approaches that would enable me to run privileged workflows on my local hardware, with the same kind of isolation mentioned above, and without any state on the agent leaking between workflow runs.

@zc-devs
Copy link
Contributor

zc-devs commented Apr 19, 2024

pipelines to run on an agent

I cannot trust an agent after it has started a workflow, and I need it to be able limit itself from running any more workflows

Seems, you are talking about local backend.

I'm open to other approaches...

It might be

  1. libvirt backend;
  2. kubevirt backend;
  3. or you can try to run Kata Containers with Kubernetes now, maybe it would work with docker backend also.
  4. using Mirantis' virtlet CRI.

@theanurin
Copy link

I'm interested in the feature too.

My use-case is same to described by @wez.

I'm migrating GitLab -> Woodpecker.
My setup described here https://aljax.us/how-to-setup-gitlab-runners-in-kvm-qemu-virtual-machines/
This feature may cover GitLab's cleanup_exec

I'm happy to help with the PR!
I start from testing this implementation in my env...

@qwerty287 qwerty287 added this to the 2.6.0 milestone Jun 5, 2024
@anbraten anbraten modified the milestones: 2.6.0, 2.7.0 Jun 10, 2024
@6543
Copy link
Member

6543 commented Jul 13, 2024

well please resolve conficts :)

@6543 6543 removed this from the 2.7.0 milestone Jul 13, 2024
@6543 6543 added this to the 2.8.0 milestone Jul 13, 2024
@6543
Copy link
Member

6543 commented Jul 13, 2024

through waiting for #3895 might be a good idea ...

EDIT: once my pull got merged I'll finish this one here ... :)

@6543 6543 self-assigned this Jul 13, 2024
@6543 6543 removed this from the 2.8.0 milestone Jul 22, 2024
@qwerty287 qwerty287 added this to the 3.0.0 milestone Jul 24, 2024
@6543 6543 self-requested a review August 17, 2024 11:12
@pat-s pat-s modified the milestones: 3.0.0, 3.x.x Nov 24, 2024
@pat-s pat-s marked this pull request as draft January 5, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent feature add new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants