-
-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ephemeral agent (single-use). #2176
base: main
Are you sure you want to change the base?
Conversation
Moving the agent tainting to inside the runner so that the agent is tainted right after it has been assigned a workflow. This ensures the agent is tainted just before the system is affected by the workflow.
When an agent is disabled, the rpc client will not receive an error and will be retried via the outer loop. This allows the client retry logic to step in and retry the rpc call after a delay.
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #2176 +/- ##
=======================================
Coverage ? 40.72%
=======================================
Files ? 182
Lines ? 10899
Branches ? 0
=======================================
Hits ? 4439
Misses ? 6121
Partials ? 339 ☔ View full report in Codecov by Sentry. |
Conflicts: - cmd/agent/agent.go
Deployment of preview was successful: https://woodpecker-ci-woodpecker-pr-2176.surge.sh |
// if ephemeral, taint the agent before running any workload. | ||
if r.ephemeral { | ||
err = r.client.TaintAgent(runnerCtx) | ||
if err != nil { | ||
return fmt.Errorf("tainting agent: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't is possible to move this to cmd/agent/agent.go (l. 232) too? Or has this some side-effect?
So mainly that the Runner
struct does not need the ephemeral
field anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to move out, but the ordering is important here.
The agent needs to be tainted after receiving the workflow, just before it runs it.
- If it is tainted before, then it will never receive one.
- If it is tainted after a workflow, and that workflow causes the agent to restart in some way, it will never be tainted.
I'm open to alternative ways we can achieve this. It could be handled server side instead, but I was trying as much as possible to limit the scope of the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is tainted after a workflow, and that workflow causes the agent to restart in some way, it will never be tainted.
How can a workflow restart the agent? Since the number of parallel workflows is 1, it shouldn't take new ones so I don't see an issue with tainting it after the workflow. It's possible that I oversaw something though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only probable with a malicious workflow. If there was a workflow using privileged steps or was running on a local backend agent, it could potentially take over the agent before it is "tainted", and pull whatever workflows it wants until it gets some information or run some other nefarious action.
My goal with this work is to enable me to run privileged steps on my agents by making them ephemeral. This is the first step in doing that.
Did you had a look at the woodpecker autoscaler. Its an external tool which is also watching agents and disabling them based on a specific condition. So I guess a similar tool could solve your usecase without needing to adjust the core |
I have looked at this tool, it is what I want in terms of the creation/destruction of the agents (I'll need to add AWS+OpenStack support for my needs). But this PR's use case is specifically that I cannot trust an agent after it has started a workflow, and I need it to be able limit itself from running any more workflows. |
Just came here to say: I'm looking for this single-use agent feature, and here's the context for my use case: Specifically what I'm looking for is:
I intend on running this in a local proxmox so that I can apply appropriate isolation (vlan + firewall to prevent access to other networks) I'm open to other approaches that would enable me to run privileged workflows on my local hardware, with the same kind of isolation mentioned above, and without any state on the agent leaking between workflow runs. |
Seems, you are talking about
It might be
|
I'm interested in the feature too. My use-case is same to described by @wez. I'm migrating GitLab -> Woodpecker. I'm happy to help with the PR! |
well please resolve conficts :) |
through waiting for #3895 might be a good idea ... EDIT: once my pull got merged I'll finish this one here ... :) |
This adds support for ephemeral/single-use agents. Agents given the
--ephemeral
flag will at most run one pipeline. After running a single pipeline, the agent will taint itself (which for now just disables the agent by marking it asno_schedule
).The purpose for this feature for agents, is to allow potentially privileged pipelines to run on an agent, isolated from other pipelines. Then after the pipeline has run, effectively throw away the agent. It is intended that these agents are provisioned by a seperate system, external to woodpecker. An external system would dynamically provision agents (based on number of waiting pipelines and their labels), run those agents with
--ephemeral
, wait for them to be tainted, then destroy them.Also this includes a fix to no_schedule causing the disabled agents to hammer the Next rpc method. Could be a problem in a large cluster with many disabled agents (effectively a self-inflicted ddos on the woodpecker server).