-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC not working with ResourceQuotas. Fails to schedule pod instead of queuing. #3629
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
I created a separate issue for k8s mode and container hooks: #3630. There the jobs fail if there's not enough quota at the moment available. |
Following change seems to "fix" at least the simple case with two jobs but this is probably not the way to go and I would not recommend it: diff --git a/controllers/actions.github.com/ephemeralrunner_controller.go b/controllers/actions.github.com/ephemeralrunner_controller.go
index 36ea114..be82878 100644
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@@ -21,6 +21,7 @@ import (
"errors"
"fmt"
"net/http"
+ "strings"
"time"
"github.com/actions/actions-runner-controller/apis/actions.github.com/v1alpha1"
@@ -216,6 +217,21 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
case err == nil:
return result, nil
case kerrors.IsInvalid(err) || kerrors.IsForbidden(err):
+ if strings.Contains(err.Error(), "exceeded quota") {
+ log.Info("Failed to create a pod due to quota exceeded. Let's try again later")
+ log.Error(err, "Error: ")
+ err := r.Patch(ctx, ephemeralRunner, client.RawPatch(types.MergePatchType, []byte(`{"metadata":{"finalizers":[]}}`)))
+ if err != nil {
+ log.Error(err, "Error: ")
+ return ctrl.Result{}, err
+ }
+ err = r.Delete(ctx, ephemeralRunner)
+ if err != nil {
+ log.Error(err, "Error: ")
+ return ctrl.Result{}, err
+ }
+ return ctrl.Result{}, nil
+ }
log.Error(err, "Failed to create a pod due to unrecoverable failure")
errMessage := fmt.Sprintf("Failed to create the pod: %v", err)
if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, ReasonInvalidPodFailure, log); err != nil {
|
Thank you @ropelli for bringing this up. We are facing exactly the same issue - the EphemeralRunner ends up in Failed state, and RunnerSet is not spawning another runner, if the number or runners, also those in Failed state, equals running workflows. In my opinion it would be good to see all Failed EphemeralRunners (it gives us context what happened) at least for some time, but have Controller able to recover, and spawn new ones, when one of them fails to start. Awaiting ARC team response :) |
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
You need to request more cpu,memory,storage etc. than the resource quota. For example:
You can also do this with a single job that goes over the resource quota. But above is more likely scenario.
Describe the bug
In the example provided, one job will run, the other will get stuck waiting for a runner as the ephemeralrunner ends up in Failed state.
In general, jobs get stuck waiting for a runner that never appears until another job is scheduled for same runner scale set.
Describe the expected behavior
In the example provided, one job should run at a time and queue properly and complete one after the other. Leading to a successful build.
In general, when quota is temporarily exceeded, we should try again after a while preferably through a queue implementation.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: