Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: After upgrading to Selenium version 4.16.1 and Edge 120, some of the edge nodes are being placed in a queue. #2113

Closed
chandupranayp opened this issue Jan 24, 2024 · 15 comments
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-answer
Milestone

Comments

@chandupranayp
Copy link

What happened?

After upgrading to Selenium version 4.16.1 and Edge 120, we have encountered an issue where some of the Edge nodes are being placed in a queue. Previously, we were using version 4.13.0 and Edge 117, and did not experience this problem. It seems that this issue is specific to Edge, as Chrome is functioning properly.

For example, when we trigger 5 Edge and 5 Chrome scripts, only 4 Edge nodes and all 5 Chrome nodes will open. One Edge node will be placed in the queue, despite setting the maxReplicaCount to 50.

Command used to start Selenium Grid with Docker (or Kubernetes)

Below are the yml files:

values.yml
global:
  seleniumGrid:
    imageRegistry: crazcdaks.azurecr.io
    imageTag: 4.16.1-20231219
    nodesImageTag: 4.16.1-20231219
    imagePullSecret: ""

basicAuth:
  enabled: false

isolateComponents: false

ingress:
  enabled: true
  className: ""
  annotations: {}
  hostname: selenium-grid.local
  tls: []

busConfigMap:
  name: selenium-event-bus-config
  annotations: {}

components:

  router:
    imageName: selenium/router

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 4444
    livenessProbe:
      enabled: true
      path: /readyz
      initialDelaySeconds: 10
      failureThreshold: 10
      timeoutSeconds: 10
      periodSeconds: 10
      successThreshold: 1
    readinessProbe:
      enabled: true
      path: /readyz
      initialDelaySeconds: 12
      failureThreshold: 10
      timeoutSeconds: 10
      periodSeconds: 10
      successThreshold: 1
    resources: {}
    serviceType: ClusterIP
    loadBalancerIP: ""
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  distributor:
    imageName: selenium/distributor

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5553
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  eventBus:
    imageName: selenium/event-bus

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5557
    publishPort: 4442
    subscribePort: 4443
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  sessionMap:
    imageName: selenium/sessions

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5556
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  sessionQueue:
    imageName: selenium/session-queue

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5559
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  extraEnvironmentVariables:

  extraEnvFrom:

hub:
  imageName: selenium/hub
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  annotations: {}
  labels: {}
  publishPort: 4442
  subscribePort: 4443
  port: 4444
  livenessProbe:
    enabled: true
    path: /readyz
    initialDelaySeconds: 10
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  readinessProbe:
    enabled: true
    path: /readyz
    initialDelaySeconds: 12
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  resources: {}
  serviceType: ClusterIP
  loadBalancerIP: ""
  serviceAnnotations: {}
  tolerations: []
  nodeSelector: {}
  priorityClassName: ""

chromeNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-chrome
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  tolerations: []
  nodeSelector: {}
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations: {}
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

firefoxNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-firefox
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  tolerations: []
  nodeSelector: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations: {}
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

edgeNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-edge
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  tolerations: []
  nodeSelector: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations:
      hello: world
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

customLabels: {}
*********************************************Keda-seleniumtriggers.yml**********************************
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-chrome-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-chrome-node
spec:
  maxReplicaCount: 50
  scaleTargetRef:
    name: selenium-chrome-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'chrome'
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-firefox-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-firefox-node
spec:
  maxReplicaCount: 5
  scaleTargetRef:
    name: selenium-firefox-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'firefox'
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-edge-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-edge-node
spec:
  maxReplicaCount: 50 
  scaleTargetRef:
    name: selenium-edge-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'MicrosoftEdge'
        sessionBrowserName: 'msedge'

Relevant log output

To reproduce the issue, run the below test 5 times in parallel and you will see only 4 active edge nodes and 1 node in queue.


        [TestMethod]
        public static void Browser_Initialization()
        {
            try
            {
                    if (docker_execution.Equals("Edge"))
                    {
                        EdgeOptions options = new EdgeOptions();
                        driver = new RemoteWebDriver(new Uri("https://selenium.***.in.***.dev/wd/hub"), options.ToCapabilities(), TimeSpan.FromMinutes(5));
                        driver.Manage().Window.Maximize();
                    }
            }
            catch (Exception ex)
            {
            }
        }

Operating System

Kubernetes version: 1.26.6

Docker Selenium version (image tag)

Selenium version 4.16.1 and Edge 120

Selenium Grid chart version (chart version)

No response

Copy link

@chandupranayp, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

Hi @chandupranayp, autoscaling is on top of your own existing KEDA? If yes, which KEDA version did you use?

@chandupranayp
Copy link
Author

Hi @VietND96, yes, autoscaling is on top of existing KEDA and KEDA version we are using is 2.9.3

@VietND96
Copy link
Member

Yes, so I suggest that you should upgrade KEDA to recent version, now is 2.13.0 to test and confirm
If take a look at KEDA changelog https://github.com/kedacore/keda/blob/main/CHANGELOG.md - between 2.9.3 - 2.13.0 there are few fixes for Selenium Grid Scaler

  • Selenium Grid Scaler: ScaledObject with a trigger whose metadata browserVersion is latest is always being triggered regardless of the browserVersion requested by the user
  • Selenium Grid Scaler: Add platformName to selenium-grid scaler metadata structure
  • Selenium Grid Scaler: Fix scaling based on latest browser version

@chandupranayp
Copy link
Author

@VietND96, Thanks for your quick feedback.

I need to wait for a couple more weeks to update the KEDA version due to some other dependencies. I can only test and confirm after that. However, after reviewing the changelog you provided, I didn't find any fixes related to the EDGE issue I am currently facing. Do you have any suggestions for other possible issues that I can try to fix and test before proceeding with the KEDA upgrade?

@VietND96
Copy link
Member

VietND96 commented Jan 25, 2024

Ah yes, as you mentioned after upgrade 4.16.1. In this version, in chart 0.26.3 there was a change that updated default value autoscaling.scalingStrategy.strategy from accurate to default
In case you are using scalingType: job and facing this issue, can you try to change it back accurate
Noted: in the latest chart 0.27.0, this default value changed back accurate already
If you are using scalingType: deployment, the strategy is not related

@chandupranayp
Copy link
Author

@VietND96 , We are using 'scalingType' for deployment. I upgraded the Selenium version from 4.16.1 to 4.17, but the issue persists. In a few days, we will be updating our KEDA and testing this issue. Meanwhile, please let me know if you can recommend any other fixes. I greatly appreciate your time and feedback.

@chandupranayp
Copy link
Author

chandupranayp commented Feb 28, 2024

@VietND96, We have now upgraded our infrastructure to the below versions. However, even after the upgrade, the issue remains the same. We still notice that some of our Edge nodes are going into the queue. Can you please assist on this issue.?

Kubernetes version: 1.27.7
KEDA: 2.12.1
Selenium grid: 4.18.1
Edge: 122
Chrome: 122

@chandupranayp
Copy link
Author

@VietND96 Can you please assist with this? Please let me know if you need any further information from my end.

@VietND96
Copy link
Member

VietND96 commented Mar 2, 2024

Hi @chandupranayp, I will get back to you on this when having any clue. Besides this issue, also having some other unstable related to autoscaling are under investigation.

@chandupranayp
Copy link
Author

Hello @VietND96, thank you so much for the acknowledgment.

@VietND96 VietND96 added the I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA label Mar 21, 2024
@chandupranayp
Copy link
Author

Hello @VietND96, any update on my issue, pls?

@VietND96
Copy link
Member

VietND96 commented Jul 21, 2024

@chandupranayp, the exact root cause has yet to be identified. However, 2 fixes are available from the Grid server.
SeleniumHQ/selenium#14272 (delivered in 4.23)
SeleniumHQ/selenium#14282 (will be delivered in 4.23)
We will continue to keep track of this issue.

@VietND96 VietND96 added this to the 4.24.0 milestone Jul 21, 2024
@VietND96
Copy link
Member

FYI, image tag 4.23.0-20240727 and chart version 0.33.0 contain the fixes mentioned above. Kindly verify and provide feedback if it is the right fix for this issue.

@VietND96 VietND96 modified the milestones: 4.24.0, 4.23.0 Jul 27, 2024
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA R-awaiting-answer
Projects
None yet
Development

No branches or pull requests

2 participants