[🐛 Bug]: Each test-execution starts multiple jobs #1904

maxnitze · 2023-07-30T13:29:23Z

What happened?

When I start selenium tests using the grid there are always two jobs started.

There is one started immediately. Once it is up-and-running a second is scheduled. The second job is then used for the test. After the test is done only the second is finished. The other keeps on running (doing nothing). Today I stopped one, that was in running state the whole weekend.

The second is only started up a soon as the first is ready. I saw this when it was scheduled on a node, that did not have the image yet. It took about 2:30m to pull it. Only after that was done the second job got scheduled. First I thought this might have something to do with a timeout, because it took to long to pull the image. But it also happens if the image is available and the first job only takes seconds to get ready.

Command used to start Selenium Grid with Docker

I installed the Grid from the Helm chart using an existing KEDA installation.

selenium-grid:
  ingress:
    enabled: true
    [ ... ]

  hub:
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 50m
        memory: 2Gi

  autoscaling:
    enableWithExistingKEDA: true
    scalingType: job

  chromeNode:
    enabled: true
    maxReplicaCount: 16
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  firefoxNode:
    enabled: true
    maxReplicaCount: 8
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  edgeNode:
    enabled: false

My Kubernetes cluster is in version 1.23.

Relevant log output

I only put the KEDA log in the form, as I could not see any interesting output in the Grid logs.

2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}

Operating System

Kubernetes 1.23 on Flatcar Linux

Docker Selenium version (tag)

4.10.0-20230607

github-actions · 2023-07-30T13:29:35Z

@maxnitze, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

diemol · 2023-07-31T20:54:01Z

Can you share the test script you are using to see this behavior?

maxnitze · 2023-07-31T20:58:51Z

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: kedacore/keda#4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

kedacore/keda#4833 (comment)

maxnitze · 2023-07-31T21:37:46Z

Can you share the test script you are using to see this behavior?

To answer your question: I have Geb tests for some of my applications. To connect to the Grid I use the RemoteWebDriver from org.seleniumhq.selenium:selenium-remote-driver:3.141.59.

diemol · 2023-08-01T07:48:44Z

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: kedacore/keda#4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

kedacore/keda#4833 (comment)

@msvticket do you know?

maxnitze · 2023-08-01T08:42:57Z

For reference: It was set to accurate right from the beginning: f0bbfe0

I could not find any discussion about the strategy in the PR.

amardeep2006 · 2023-08-07T14:18:51Z

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well.
Observation :
wdio framework gets 504 gateway timeout error.
A session is started on node but browser does nothing.
Few sessions are shown as pending in queue as well.

I will try following and share results:

Increase timeout on ingress.
Increase default connect timeout in wdio framework.

maxnitze · 2023-08-08T09:45:50Z

We currently experience a problem with the default strategy as well: It expects the sessions to stay in the queue while they are worked on. The calculation for the scaled jobs basically checks, whether more jobs are running than are in the queue. And if that's the case no new job is scheduled.

Maybe that's what was tried to be fixed by using the accurate strategy? We are currently checking if and how we can implement a custom strategy instead.

amardeep2006 · 2023-08-09T05:53:40Z

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.

I will try following and share results:

Increase timeout on ingress.

Increase default connect timeout in wdio framework.

Updates with scalingType: deployment . We have seen improvements after increasing the timeouts in the ingress. The pending sessions are not there anymore.

msvticket · 2023-08-09T08:57:11Z

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

Your mileage may vary apparently. For me it worked much better with accurate. The scale up was way to slow with default. I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

maxnitze · 2023-08-09T09:01:42Z

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended.

see https://keda.sh/docs/2.11/concepts/scaling-jobs/

maxnitze · 2023-08-09T09:08:17Z

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished. I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

msvticket · 2023-08-09T11:15:27Z

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished.

Which is the same thing.

I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

Yes it would.

msvticket · 2023-08-09T11:17:00Z

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

Exactly. Which is why I choose accurate as the default strategy in the chart.

amardeep2006 · 2023-09-30T13:53:00Z

I have been experimenting with both type of scaling strategies (job/deployment) and seeing multiple jobs getting triggered . In one occurrence it started 16 Jobs for just two test cases.
For now I am sticking with deployment and wait to hear more from others on this behavior.
I tried KEDA 2.12.0 as well.

cr-liorholtzman · 2023-11-30T10:54:20Z

Any update on this one? we are also started having this issue after upgrading KEDA to 2.12.0 from 2.11.1

maxnitze · 2023-11-30T11:00:12Z

Fortunately (or unfortunately for you) we don't have the problem anymore. This was happening when we had a test setup, that only one application used at the time. When we scaled this up it just went away. We are running 100s of jobs daily now and no issues with "extra spawned jobs" so far.

Sorry, that I cannot be of more help :/

maxnitze · 2023-11-30T11:00:56Z

See also my comment here: kedacore/keda#4833 (comment)

VietND96 · 2024-09-20T04:04:05Z

You can follow https://github.com/SeleniumHQ/docker-selenium/tree/trunk/.keda
Replace KEDA component image tag and try out to see how it works
Recommend to use default strategy

quiqueg · 2024-10-16T00:10:25Z

@VietND96 Using the Selenium Grid Helm chart v0.36.1 and therefore keda-operator image: "selenium/keda:2.15.1-selenium-grid-20240922" and with autoscaling.scalingType as the default "job", I am seeing the following behavior:

When setting autoscaling.scaledJobOptions.scalingStrategy.strategy to default and starting 10 sessions in parallel:
- The grid quickly scales up to 5 nodes and 5 still in the queue. It does not seem to scale beyond (# of total sessions / 2).
- It then waits for any of the active sessions to finish, and then moves 1 of the queued sessions to active: now we have 5 nodes, and 4 still in the queue.
- When the next session finishes, it moves another queued session to active: now 5 nodes, and 3 in the queue.
- It continues this until the queue # drops to 0, and 5 nodes.
- The nodes are scaled back down as the last 5 sessions finish.
When setting autoscaling.scaledJobOptions.scalingStrategy.strategy to accurate and starting 10 sessions in parallel:
- The grid quickly scales up to 10 nodes and 0 still the queue.
- The nodes are scaled back down as each sessions finishes.

Because I expect my sessions to take several minutes each (CI tests), and we run into test timeouts for about half of the sessions when using default (the ones that got left in the queue while the first half were active), accurate is working best for us.

VietND96 · 2024-10-16T03:00:40Z

@quiqueg, thanks for your feedback, I am checking on that. I also set up a test with strategy default and could see all queue requests are served. The scaler also has a few unit tests with given scenarios and expected correct wants numbers.

Will check further on this difference.

maxnitze added the needs-triaging label Jul 30, 2023

diemol added R-awaiting-answer and removed needs-triaging labels Jul 31, 2023

VietND96 added the I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA label Mar 21, 2024

VietND96 mentioned this issue Sep 19, 2024

Experimental: Selenium Grid scaler in K8s implementation preview #2400

Merged

8 tasks

VietND96 closed this as completed in #2400 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Each test-execution starts multiple jobs #1904

[🐛 Bug]: Each test-execution starts multiple jobs #1904

maxnitze commented Jul 30, 2023 •

edited

Loading

github-actions bot commented Jul 30, 2023

diemol commented Jul 31, 2023

maxnitze commented Jul 31, 2023

maxnitze commented Jul 31, 2023 •

edited

Loading

diemol commented Aug 1, 2023

maxnitze commented Aug 1, 2023

amardeep2006 commented Aug 7, 2023 •

edited

Loading

maxnitze commented Aug 8, 2023

amardeep2006 commented Aug 9, 2023

msvticket commented Aug 9, 2023

maxnitze commented Aug 9, 2023 •

edited

Loading

maxnitze commented Aug 9, 2023

msvticket commented Aug 9, 2023

msvticket commented Aug 9, 2023

amardeep2006 commented Sep 30, 2023

cr-liorholtzman commented Nov 30, 2023

maxnitze commented Nov 30, 2023

maxnitze commented Nov 30, 2023

VietND96 commented Sep 20, 2024 •

edited

Loading

quiqueg commented Oct 16, 2024

VietND96 commented Oct 16, 2024

[🐛 Bug]: Each test-execution starts multiple jobs #1904

[🐛 Bug]: Each test-execution starts multiple jobs #1904

Comments

maxnitze commented Jul 30, 2023 • edited Loading

What happened?

Command used to start Selenium Grid with Docker

Relevant log output

Operating System

Docker Selenium version (tag)

github-actions bot commented Jul 30, 2023

diemol commented Jul 31, 2023

maxnitze commented Jul 31, 2023

maxnitze commented Jul 31, 2023 • edited Loading

diemol commented Aug 1, 2023

maxnitze commented Aug 1, 2023

amardeep2006 commented Aug 7, 2023 • edited Loading

maxnitze commented Aug 8, 2023

amardeep2006 commented Aug 9, 2023

msvticket commented Aug 9, 2023

maxnitze commented Aug 9, 2023 • edited Loading

maxnitze commented Aug 9, 2023

msvticket commented Aug 9, 2023

msvticket commented Aug 9, 2023

amardeep2006 commented Sep 30, 2023

cr-liorholtzman commented Nov 30, 2023

maxnitze commented Nov 30, 2023

maxnitze commented Nov 30, 2023

VietND96 commented Sep 20, 2024 • edited Loading

quiqueg commented Oct 16, 2024

VietND96 commented Oct 16, 2024

maxnitze commented Jul 30, 2023 •

edited

Loading

maxnitze commented Jul 31, 2023 •

edited

Loading

amardeep2006 commented Aug 7, 2023 •

edited

Loading

maxnitze commented Aug 9, 2023 •

edited

Loading

VietND96 commented Sep 20, 2024 •

edited

Loading