-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ScaledJob Scaling Issue with secondary matching batch #3782
Comments
Just a thought, should my queuelen be also reporting running jobs? Is it that simple :) Ok, its my code. I'll keep this issue to provide resolution. only true with parent style |
I've fixed the issue, but going to test more before I re-open the commit. Basically if you queue up a lot of jobs and the cluster doesn't spin up the agents and assign them to AzDO before the next keda cycle it queues another n jobs. I've added an activejob register so it won't re-add unique jobs |
Hum, |
So the issue occurs in more extreme cases. For instance, if you queue 1 job, the container spins and registers with AzDO, then AzDO assigns the job to the agent and keda won't match another agent as it is already matched. This fix ensures that you don't get a pod spun up for a job that is already awaiting a pod, regardless of the speed your cluster can spin up new jobs. |
The problem is that this approach requires of having state, and scalers shouldn't have state because they can be recreated at any time or you could have more than 1 instance, etc |
Create an ado job with strategy parallel 20, that's what I was using for testing. I'll try and reproduce with the other AzDO scaler types. Certainly happens with parent style as that uses "matching agent". I'll look for a stateless method too |
set PR to draft, you were right @JorTurFer after a long run state gets it upset. |
I have to apologize, my life has been complicated last weeks and I haven't checked this, it's still in my TODO list before release @Eldarrin |
No worries, it worked out well. I've been doing a long-running simulation and it gets itself out of kilter. The worst is its fixed with a reboot but each test takes about 7-10 days to simulate real world lol |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
I need to check this yet |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
@Eldarrin |
@JorTurFer I'm not using anything custom, basically my samples in my repo are the type. https://github.com/Eldarrin/keda-azdo-example |
I ask because we faced with a similar behavior when one squad set: scalingStrategy:
strategy: "custom"
customScalingQueueLengthDeduction: 1
customScalingRunningJobPercentage: "0.5" Basically, they copied and pasted the ScaledJob spec from docs and once they set the default behavior, everything worked again properly xD |
Made it do it :) I can't share it as I had to use my an ado that supports parallel jobs. Run it about 3 times and you'll be left with agents still running. I don't think it cares about demands, parents or basic |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
Report
This is a bit esoteric, but when using a ScaledJob if I scale 20(n) jobs as a first pass, then scale another 20(n-x) jobs as a second pass (after the first 20(n) are initiated and running) the second batch does not scale. Eventually it will catch up and scale to do the jobs, but only when the previous pods have terminated. First n is great than second n-x shows it best.
This is only true when using demands based parent style scalers.
Current main branch tested
Expected Behavior
When a job is demanded it gets scaled regardless
Actual Behavior
The secondary batch of jobs do not scale until the previous batch is at least partially finished
Steps to Reproduce the Problem
Logs from KEDA operator
Cannot provide logs at this time, if you need them I can make available
KEDA Version
2.8.1
Kubernetes Version
1.23
Platform
Microsoft Azure
Scaler Details
azure-pipelines
Anything else?
I believe it relates to
parent-keda-templates. will fix
I think the counter is counting pods running against the previous job against the jobs running from the new job so decreasing the pending counter when it shouldn’t
The text was updated successfully, but these errors were encountered: