Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Each test-execution starts multiple jobs #1904

Closed
maxnitze opened this issue Jul 30, 2023 · 21 comments · Fixed by #2400
Closed

[🐛 Bug]: Each test-execution starts multiple jobs #1904

maxnitze opened this issue Jul 30, 2023 · 21 comments · Fixed by #2400
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-answer

Comments

@maxnitze
Copy link

maxnitze commented Jul 30, 2023

What happened?

When I start selenium tests using the grid there are always two jobs started.

There is one started immediately. Once it is up-and-running a second is scheduled. The second job is then used for the test. After the test is done only the second is finished. The other keeps on running (doing nothing). Today I stopped one, that was in running state the whole weekend.

The second is only started up a soon as the first is ready. I saw this when it was scheduled on a node, that did not have the image yet. It took about 2:30m to pull it. Only after that was done the second job got scheduled. First I thought this might have something to do with a timeout, because it took to long to pull the image. But it also happens if the image is available and the first job only takes seconds to get ready.

Command used to start Selenium Grid with Docker

I installed the Grid from the Helm chart using an existing KEDA installation.

selenium-grid:
  ingress:
    enabled: true
    [ ... ]

  hub:
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 50m
        memory: 2Gi

  autoscaling:
    enableWithExistingKEDA: true
    scalingType: job

  chromeNode:
    enabled: true
    maxReplicaCount: 16
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  firefoxNode:
    enabled: true
    maxReplicaCount: 8
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  edgeNode:
    enabled: false

My Kubernetes cluster is in version 1.23.

Relevant log output

I only put the KEDA log in the form, as I could not see any interesting output in the Grid logs.

2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}

Operating System

Kubernetes 1.23 on Flatcar Linux

Docker Selenium version (tag)

4.10.0-20230607

@github-actions
Copy link

@maxnitze, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@diemol
Copy link
Member

diemol commented Jul 31, 2023

Can you share the test script you are using to see this behavior?

@maxnitze
Copy link
Author

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: kedacore/keda#4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

kedacore/keda#4833 (comment)

@maxnitze
Copy link
Author

maxnitze commented Jul 31, 2023

Can you share the test script you are using to see this behavior?

To answer your question: I have Geb tests for some of my applications. To connect to the Grid I use the RemoteWebDriver from org.seleniumhq.selenium:selenium-remote-driver:3.141.59.

@diemol
Copy link
Member

diemol commented Aug 1, 2023

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: kedacore/keda#4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

kedacore/keda#4833 (comment)

@msvticket do you know?

@maxnitze
Copy link
Author

maxnitze commented Aug 1, 2023

For reference: It was set to accurate right from the beginning: f0bbfe0

I could not find any discussion about the strategy in the PR.

@amardeep2006
Copy link
Contributor

amardeep2006 commented Aug 7, 2023

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well.
Observation :
wdio framework gets 504 gateway timeout error.
A session is started on node but browser does nothing.
Few sessions are shown as pending in queue as well.

I will try following and share results:

  1. Increase timeout on ingress.
  2. Increase default connect timeout in wdio framework.

@maxnitze
Copy link
Author

maxnitze commented Aug 8, 2023

We currently experience a problem with the default strategy as well: It expects the sessions to stay in the queue while they are worked on. The calculation for the scaled jobs basically checks, whether more jobs are running than are in the queue. And if that's the case no new job is scheduled.

Maybe that's what was tried to be fixed by using the accurate strategy? We are currently checking if and how we can implement a custom strategy instead.

@amardeep2006
Copy link
Contributor

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.

I will try following and share results:

  1. Increase timeout on ingress.
  2. Increase default connect timeout in wdio framework.

Updates with scalingType: deployment . We have seen improvements after increasing the timeouts in the ingress. The pending sessions are not there anymore.

@msvticket
Copy link
Contributor

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

Your mileage may vary apparently. For me it worked much better with accurate. The scale up was way to slow with default. I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

@maxnitze
Copy link
Author

maxnitze commented Aug 9, 2023

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended.

see https://keda.sh/docs/2.11/concepts/scaling-jobs/

@maxnitze
Copy link
Author

maxnitze commented Aug 9, 2023

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished. I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

@msvticket
Copy link
Contributor

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished.

Which is the same thing.

I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

Yes it would.

@msvticket
Copy link
Contributor

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

Exactly. Which is why I choose accurate as the default strategy in the chart.

@amardeep2006
Copy link
Contributor

I have been experimenting with both type of scaling strategies (job/deployment) and seeing multiple jobs getting triggered . In one occurrence it started 16 Jobs for just two test cases.
For now I am sticking with deployment and wait to hear more from others on this behavior.
I tried KEDA 2.12.0 as well.

@cr-liorholtzman
Copy link

Any update on this one? we are also started having this issue after upgrading KEDA to 2.12.0 from 2.11.1

@maxnitze
Copy link
Author

Fortunately (or unfortunately for you) we don't have the problem anymore. This was happening when we had a test setup, that only one application used at the time. When we scaled this up it just went away. We are running 100s of jobs daily now and no issues with "extra spawned jobs" so far.

Sorry, that I cannot be of more help :/

@maxnitze
Copy link
Author

See also my comment here: kedacore/keda#4833 (comment)

@VietND96 VietND96 added the I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA label Mar 21, 2024
@VietND96
Copy link
Member

VietND96 commented Sep 20, 2024

You can follow https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda
Replace KEDA component image tag and try out to see how it works
Recommend to use default strategy

@quiqueg
Copy link

quiqueg commented Oct 16, 2024

@VietND96 Using the Selenium Grid Helm chart v0.36.1 and therefore keda-operator image: "selenium/keda:2.15.1-selenium-grid-20240922" and with autoscaling.scalingType as the default "job", I am seeing the following behavior:

  • When setting autoscaling.scaledJobOptions.scalingStrategy.strategy to default and starting 10 sessions in parallel:
    • The grid quickly scales up to 5 nodes and 5 still in the queue. It does not seem to scale beyond (# of total sessions / 2).
    • It then waits for any of the active sessions to finish, and then moves 1 of the queued sessions to active: now we have 5 nodes, and 4 still in the queue.
    • When the next session finishes, it moves another queued session to active: now 5 nodes, and 3 in the queue.
    • It continues this until the queue # drops to 0, and 5 nodes.
    • The nodes are scaled back down as the last 5 sessions finish.
  • When setting autoscaling.scaledJobOptions.scalingStrategy.strategy to accurate and starting 10 sessions in parallel:
    • The grid quickly scales up to 10 nodes and 0 still the queue.
    • The nodes are scaled back down as each sessions finishes.

Because I expect my sessions to take several minutes each (CI tests), and we run into test timeouts for about half of the sessions when using default (the ones that got left in the queue while the first half were active), accurate is working best for us.

@VietND96
Copy link
Member

@quiqueg, thanks for your feedback, I am checking on that. I also set up a test with strategy default and could see all queue requests are served. The scaler also has a few unit tests with given scenarios and expected correct wants numbers.
image
Will check further on this difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-answer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants