-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling selenium grid on kubernetes #1714
Autoscaling selenium grid on kubernetes #1714
Conversation
@diemol @jamesmortensen As discussed earlier, I have raised two separate PR's for autoscaling and video recoriding. |
Autoscale selenium browser nodes running in kubernetes based on the request pending in session queue using KEDA. Toggle autoscaling on/off using 'autoscalingEnabled' option in helm charts.
7e4f215
to
96eca2e
Compare
Hi @prashanth-volvocars thank you for splitting them up. We'll take a look as soon as we can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, @prashanth-volvocars!
I left some comments.
charts/selenium-grid/README.md
Outdated
@@ -29,6 +29,11 @@ helm install selenium-grid docker-selenium/selenium-grid --version <version> | |||
helm install selenium-grid --set ingress.hostname=selenium-grid.k8s.local docker-selenium/chart/selenium-grid/. | |||
``` | |||
|
|||
## Enable Selenium Grid Autoscaling | |||
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work. | |
Selenium Grid has the ability to autoscale browser nodes up/down based on the pending requests in the | |
session queue. | |
You can enable it by setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the | |
[instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work. |
## Enable Selenium Grid Autoscaling | ||
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work. | ||
|
||
The hpa.url value is configured to work for grid installed in `default` namespace. If you are installing the grid in some other namespace make sure to update the value of hpa.url accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hpa.url value is configured to work for grid installed in `default` namespace. If you are installing the grid in some other namespace make sure to update the value of hpa.url accordingly. | |
The `hpa.url` value is configured to work for Grid when installed in the `default` namespace. If you are installing | |
the Grid in some other namespace make sure to update the value of `hpa.url` accordingly. |
charts/selenium-grid/values.yaml
Outdated
@@ -363,7 +364,7 @@ chromeNode: | |||
# Custom annotations for service | |||
annotations: {} | |||
# Size limit for DSH volume mounted in container (if not set, default is "1Gi") | |||
dshmVolumeSizeLimit: 1Gi | |||
dshmVolumeSizeLimit: 2Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? This would affect people who are already using the chart, their clusters have been provisioned with the existing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had some issues while running it along with video recording which vanished as soon as we increased the limit. Think it shouldn’t be part of this PR. Will revert this.
charts/selenium-grid/values.yaml
Outdated
@@ -375,17 +376,17 @@ chromeNode: | |||
# failureThreshold: 120 | |||
# periodSeconds: 5 | |||
# Time to wait for pod termination | |||
terminationGracePeriodSeconds: 30 | |||
terminationGracePeriodSeconds: 3600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the autoscaler chooses to kill a pod that's running a test currently, we would have at most one hour for the running test to complete rather than 30 seconds.
If the test doesn't complete within the configured grace period it would result in failure so giving it the maxium allowed time would help reduce these failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this affect the setup of users who are already having this configuration? Isn't there an alternative to this hard coded value? One of the reasons we added the draining after X sessions feature was to enable this use case, so the pod could live until the session is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on what @diemol writes:
Wouldn't it be more straight forward to use ScaledJob combined with draining after one session?
I wrote about it here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diemol Killing after x sessions wouldn’t help. Consider this scenario where you create two sessions, one runs for 60 mins and another 1 min. As soon as the first is done, the hpa will try to bring down one of the pod irrespective of the configured x sessions. We have no control on which pod is chosen by hpa. This is not current flow but will happen when KEDA scales down.
Also I don’t understand the implication that would be caused by increasing it even for current users. No one would want their tests to be interrupted. Also with this value has no effect unless we have prestop lifecycle hook enabled. Even if that is enabled its gonna delay the pod termination only until the tes session is completed, which i think be desired behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prashanth-volvocars draining after X sessions will make the container exit and ideally the pod should stop, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that for most users ScaledJobs would be the best fit so this should be the default in the helm chart. Creating a pod is normally not expensive in Kubernetes; Fargate is a special case. But sure, you can skip that and I guess it's me or @diemol that will implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diemol yes, we no longer need X sessions as the pod can handle sessions as long as there are pending sessions in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msvticket The default should be what works for every environment, not for a specific few. Jobs works for a few, not all, but pod works for all. If you think a scaled job suits your need best, it's available as an option that you can very well enable, but making it default will entirely break it for specific environments. Hope you understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What option? I don't see any mention of it in the chart.
lifecycle: | ||
preStop: | ||
exec: | ||
command: | ||
- bash | ||
- -c | ||
- | | ||
curl -X POST 127.0.0.1:5555/se/grid/node/drain --header 'X-REGISTRATION-SECRET;' && \ | ||
while curl 127.0.0.1:5555/status; do sleep 1; done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this preStop
be enabled only if autoscalingEnabled
is true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will add the check
autoscalingEnabled: false | ||
maxReplicaCount: 8 | ||
hpa: | ||
url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selenium-hub.default
can this be retrieved from the Hub or Router service or pod name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought too, but created the PR in the excitement of sharing it😁. Will add this too.
charts/selenium-grid/values.yaml
Outdated
@@ -475,7 +485,7 @@ firefoxNode: | |||
# Custom annotations for service | |||
annotations: {} | |||
# Size limit for DSH volume mounted in container (if not set, default is "1Gi") | |||
dshmVolumeSizeLimit: 1Gi | |||
dshmVolumeSizeLimit: 2Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here | ||
browserName: chrome | ||
# browserVersion: '91.0' # Optional. Only required when supporting multiple versions of browser in your Selenium Grid. | ||
unsafeSsl : 'true' # Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?unsafeSsl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diemol https://keda.sh/docs/2.8/scalers/selenium-grid-scaler/
"Skip certificate validation when connecting over HTTPS."
charts/selenium-grid/values.yaml
Outdated
@@ -487,17 +497,17 @@ firefoxNode: | |||
# failureThreshold: 120 | |||
# periodSeconds: 5 | |||
# Time to wait for pod termination | |||
terminationGracePeriodSeconds: 30 | |||
terminationGracePeriodSeconds: 3600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
lifecycle: | ||
preStop: | ||
exec: | ||
command: | ||
- bash | ||
- -c | ||
- | | ||
curl -X POST 127.0.0.1:5555/se/grid/node/drain --header 'X-REGISTRATION-SECRET;' && \ | ||
while curl 127.0.0.1:5555/status; do sleep 1; done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
maxReplicaCount: 8 | ||
hpa: | ||
url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here | ||
browserName: MicrosoftEdge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you was Edge to work you will need to have "sessionBrowserName" as well set to "msedge"
Docs: https://keda.sh/docs/2.8/scalers/selenium-grid-scaler/
Issue regarding this: kedacore/keda#2709
@prashanth-volvocars Any news regarding this PR? |
@prashanth-volvocars @diemol Any updates on this? |
@diemol i have added/addressed all the concerns raised. Please take a look |
I will have a look after releasing 4.10.0, so in June. |
Description
Autoscale selenium browser nodes running in kubernetes based on the request pending in session queue using KEDA. Toggle autoscaling on/off using 'autoscalingEnabled' option in helm charts.
Motivation and Context
Auto scaling selenium grid was a problem that was pending to be solved for long. So i took it up when there was a requirement at current my work place. KEDA seemed to the best candidate for the Job and i wrote a new scalar for Selenium Grid a year ago. I would like to have this enabled by default in our charts so everyone could use it.
Types of changes
Checklist