Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Performance issues #116

Closed
ghost opened this issue May 18, 2017 · 33 comments
Closed

Performance issues #116

ghost opened this issue May 18, 2017 · 33 comments

Comments

@ghost
Copy link

ghost commented May 18, 2017

Hi all,

While using Zalenium, I've had some performance issues with the Docker containers. It's very slow to start
and to launch containers to the point where all the tests start timing out. Even when I use a fixed number of containers, some of them time out and are shut down, and it ends up being very slow waiting for the others to start. For reference, I'm usually running about 10 to 20 tests in parallel from a test run of over 150 tests, and the machine slows down considerably, and all the tests past the first one usually fail.
Is there any way, to speed up this process? Either a feature that is planned, available, or by hacking.

Thank you in advance.

@diemol
Copy link
Contributor

diemol commented May 18, 2017

Hi @joao-valente,

It might be possible that some performance issues arise in cases of high concurrency or many tests running in parallel. Our normal scenarios don't run more than 4 tests in parallel so that's why we have not seen something relevant yet.

We have seen some things that we can improve:

  • Scale horizontally (we plan to work on it during the Summer), see kubernetes support #103
  • We found an improvement in the container creation, in some cases it was failing and then some tests were failing randomly (we want to release this improvement next week).
  • We are also thinking about reusing containers for more than one test, to avoid the continuous creation. This is just an idea and we need to see if it makes sense. Restart node after run #135
  • We create the new containers in a very conservative way, one by one, because we have seen that the Grid has problems when many nodes come at the same time to register. We'll try to improve that so the container creation is more fluent. Too many containers are created #143

Maybe you can help us with some more information, perhaps a timeline of how things happen. From the beginning where everything is running fine, and then what happens afterwards to make the performance go down. Also, what hardware specs are you using? How many containers are you starting at the beginning?

With more info from your side we could come up with more ideas.

@woza2000
Copy link

Hi @diemol

Reusing containers does make sense.

In my case, I need static nodes which I can get internal IP, and allocate user account per IP. When nodes are dynamic, I have to use extra database to manage these account in order to make sure each test get unique account. Because I can't create thousands of testing accounts, I have to lock and unlock them in db during every test running.

Your improvement plan is highly appreciated.

@SrinivasanTarget
Copy link

@diemol Executing simultaneously on 10 ~ 20 containers doesn't yield stable results.Tests hangs sometimes and I see interrupt and null pointer exceptions in logs. Can share the logs tomorrow if required.

@diemol
Copy link
Contributor

diemol commented May 25, 2017

@SrinivasanTarget could you please also share your HW specifications? Logs are helpful as well.
So far, we have seen that Zalenium performs well depending on the available RAM and processor power.

@SrinivasanTarget
Copy link

SrinivasanTarget commented May 25, 2017

@diemol Yup i was running in a 16Gig VM which runs Ubuntu 16.x. Don't have logs in hand now. Will share it surely tomorrow. Was trying to execute around 200~ tests with 20 containers spinned up. Same execution on elgalu/docker-selenium was fine.

@diemol
Copy link
Contributor

diemol commented May 25, 2017

Thanks @SrinivasanTarget, logs will be useful. Perhaps you can also share with us:

  • How you start Zalenium
  • how many threads you configure in your tests
  • how you start elgalu/docker-selenium when the execution goes well

All this info will be very helpful for us :)

@SrinivasanTarget
Copy link

SrinivasanTarget commented May 26, 2017

@diemol Please find the zalenium logs here: https://gist.github.com/SrinivasanTarget/a88aa39274717d31af46d01056408175

How you start Zalenium

docker run --rm -ti --name zalenium -p 4444:4444 -p 5555:5555
-v /var/run/docker.sock:/var/run/docker.sock
-v /tmp/videos:/home/seluser/videos
dosel/zalenium start --maxDockerSeleniumContainers 20

Results were same even when executed via docker-compose.

how many threads you configure in your tests

data-provider-thread-count="15" but it is the same results even when count is reduced to 4 or 8.

how you start elgalu/docker-selenium when the execution goes well

Yeah it is through docker-compose https://github.com/elgalu/docker-selenium/blob/master/docker-compose.yml.

@SrinivasanTarget
Copy link

@diemol Do you have any updates on this? Do you need any other information?

@diemol
Copy link
Contributor

diemol commented May 29, 2017

Hi @SrinivasanTarget,

We need more time to check it. We were running 16 parallel tests on a linux machine with 16GB and it worked OK, the same amount of threads on a Mac with 16GB didn't work so well.

We'll check if something can be improved or if it just a matter of HW.

@saikrishna321
Copy link

@diemol we also have the same issue, when we bring up more than 15 containers

@tacf
Copy link

tacf commented May 30, 2017

Hi, regarding performance issues, we've found that, with the vanilla containers (from selenium) ram was not an issue (14gb are more than sufficient for 20 containers), in order to stabilize test runs we needed to upgrade from 2 to 4 cores (using azure cloud) and we even move to a 4 core on a improved processor family (30% plus processing capacity). We're working over swarm network and scaling from nothing to 60 containers on 3 machines takes less than a minute, including node registering (i would risk saying about 30 seconds for all nodes to register).

Another side note is that when testing heavy load on the same setup is easy to pass the point where you overload the machine with containers and tests start failing because the grid doesn't respond in time. The same setup we use now to run 60 browsers will run 200 without complains, but the test results will be flaky. Another point to notice is that the browser instance request (to the selenium hub) overhead makes scaling, for instance, from 15 to 20 browser may not really be worth it when running between 100-200 tests on the same test run (assuming parallelism) the request from 0 to X browsers up and running takes to long. We've gained 2 minutes out of 15 going from 15 to 20. Making it 60 parallel browser for the same run made it to 9 minutes only.

My point is that, you guys surely can work out the performance issues as they must have something to do with your own customizations. Loved seeing these features in a more stabilized way. Nonetheless you should note these facts that i worked out in order to differentiate between the performance issues that you can address and the selenium grid nature.

These are my 2 cents, hope it will be helpful.

@katryo
Copy link

katryo commented May 30, 2017

Hi, I also ran into the same (probably) problem.

  1. Start Zalenium with docker run --rm -ti --name zalenium -p 4446:4444 -p 5555:5555 -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/videos:/home/seluser/videos dosel/zalenium start --timeZone "Asia/Tokyo" --videoRecordingEnabled true
  2. Run six tests in parallel using Zalenium
  3. Containers were created, browsers were opened, but they suddenly stopped working when I was watching the browsers with HOST:4446/grid/admin/live.
  4. Several containers became unhealthy while others are healthy.
  5. After I stopped the tests, browser sessions and containers remained.

I reproduced this problem on Mac's Docker too.

I have not experienced this kind of problem when I ran the tests with the official Selenium Docker image ( https://github.com/SeleniumHQ/docker-selenium ) while I ran the eight tests in parallel.

In addition, I got some other problems, such as elgalu/selenium container does not start selenium-node-chrome because of java.lang.RuntimeException: java.net.BindException: Address already in use error, but I cannot reproduce the problem.

OS: CentOS 7.2.1511
CPU: Intel Core Processor (Haswell, no TSX), 2.4GHz, 10 cores
Memory: 10GB

docker's log: https://gist.github.com/katryo/d2c588554d1ace8583ccaa3e755bfb98
other data: https://gist.github.com/katryo/9180919444544db12d3bd1677ca8f6eb

I hope this helps you.

@diemol
Copy link
Contributor

diemol commented May 31, 2017

Thank you @SrinivasanTarget, @tacf, @katryo, @saikrishna321 for all the info and detailed logs.

Right now we are spending time on reading the logs you submitted us and also running Zalenium in debug mode to spot where the main bottleneck happen when many tests are executed at the same time.

What we plan to achieve is:

  • After understanding better the logs, make some changes to improve the behaviour when running several tests in parallel.
  • Come up with some "good usage" guidelines, which should give a hint on how many tests you could run in parallel given some hardware specifications. Also some setup tips, like starting the containers before and things like that.

I am not sure how long this will take, but we are investing time on this because we think that if we are able to fix this performance issues (and adding the Kubernetes feature), Zalenium could become very successful.

@SrinivasanTarget
Copy link

@diemol Thanks for your response :) Thanks for this wonderful project 👍

I would like to share few observations from my end here.

I hope you guys are aware of https://github.com/aerokube/selenoid. I was trying all available docker selenium solutions in market. Based on my attempts, i see i was able to execute upto ~ 200 tests in 13-15 containers using Zalenium/docker-selenium/elgalu's docker-selenium images in a 16Gig ubuntu machine. I did executed same 200~ scripts in 16Gig Ubuntu machine with 30 containers (CPU usage was 85-90%) using Selenoid successfully. I was able to derive stable results from selenoid during each execution. Though i love the idea of on-demand containers in zalenium, i see selenoid spins up little less containers and seems like they reuse containers to an extent. I think it would be great if Zalenium also resues containers instead of killing/relaunch/registering nodes for each tests. I accept kubernetes/Docker Swarm/ powerful AWS instances might be a long term solution.

we think that if we are able to fix this performance issues

Looking forward to it :)

(and adding the Kubernetes feature)

Are we planning to support Docker Swarm as well because both kubernetes and Docker Swarm supports self healing capability now.

@elgalu
Copy link
Member

elgalu commented May 31, 2017

@SrinivasanTarget thanks for this info!! is really helpful

I was trying all available docker selenium solutions in market

Do you think you could send us a PR to add an "Alternatives" section to the README.md listing all available working alternatives (with the links) ? I think this will be very useful to us and to our users, ideally we would differentiate each project per use case so people reading it can decide what fits them better and they don't have to go to trying them all.

It should be something short and concise, if it's too long then a blog post might be a better place though.

@SrinivasanTarget
Copy link

@elgalu Sure, still couple of solutions left for me to try. Will raise a PR post that attempts.

@manoj9788
Copy link

@SrinivasanTarget That's a good piece of work on researching in terms of stability.

@diemol
Copy link
Contributor

diemol commented Jun 1, 2017

Thanks for the comments @SrinivasanTarget
I was aware of Selenoid, and we were trying it yesterday and it looks awesome! How come they are not more known?

Continuing with the topic, right now we are in the process of breaking apart Zalenium in pieces to detect where it gets slow when adding many tests in parallel, during this we found yesterday a few things that may lead us to improvements. We'll work on changing the network mode and also changing the way containers are created, until reaching the point where the only limit is the grid itself.

We'll keep you posted.

@elgalu
Copy link
Member

elgalu commented Jun 1, 2017

@SrinivasanTarget you may also want to check https://github.com/seleniumkit/gridrouter as someone pointed out in another issue

@SrinivasanTarget
Copy link

Yes it is in my list @elgalu :)

@elgalu
Copy link
Member

elgalu commented Jun 1, 2017

As Diego mentioned, we tested Selenoid yesterday with great results! I was able to run, without VNC enabled, 50 tests in parallel within 1 minute in my laptop! (8 cores, 16GB)
Great job! @aandryashin @vania-pooh !!!

Diego is looking into Zalenium performance issues as we speak:)

@vania-pooh
Copy link

@SrinivasanTarget regarding GridRouter - please try the newer implementation: http://github.com/aerokube/ggr It's also a Golang stuff tested enough in production.

@vania-pooh
Copy link

vania-pooh commented Jun 1, 2017

Just to put all eggs in one basket :) here are some recently posted articles about ggr and Selenoid:

@SrinivasanTarget
Copy link

@vania-pooh I did read all the histories today. Interesting and a long journey. Great Stuff 👍

@manoj9788
Copy link

@vania-pooh Do you want to submit a paper on this for the upcoming Selenium Conference in Berlin ?

@vania-pooh
Copy link

@manoj9788: already submitted a talk about scalable Selenium.

@manoj9788
Copy link

Oh! yeah! I see that. Thanks.

@vania-pooh
Copy link

Btw, regarding Selenium server performance I found several places in code that could be optimized:

  1. Jetty 9 is a monster. Too much functionality for Selenium purposes. I would replace by something lightweight e.g. Undertow. It supports both Servlet API and JAX-RS.
  2. Even if we leave Jetty I would use built-in proxying capabilities instead of doing this manually with Apache HTTP client. Take a look at how it's done in original Java-based GridRouter: https://github.com/seleniumkit/gridrouter/blob/master/proxy/src/main/java/ru/qatools/gridrouter/ProxyServlet.java
  3. I think there are some problems in Apache client settings. To reproduce slow down - just connect approximately 20 nodes to hub and request all available browsers. If you then try to open Grid console - you will notice that it opens slowly. My hypothesis is that something locks in Apache HTTP client connection pooler. However I checked - pool size (2000) is enough. So needs further investigation.

@diemol
Copy link
Contributor

diemol commented Jun 2, 2017

Thanks for the comments @vania-pooh, and hopefully we meet in SeleniumConf!

I mostly agree with the three points you mention. The thing is that we are using the grid as it is, we are not compiling our own grid (yet, I don't discard to do it in the future). I'll look into them, so maybe we find a way to improve the grid.

We already found ways to improve Zalenium's performance by tuning some of the parameters passed to the grid and also changing the way we create the containers on the fly. We are still testing those changes, but it looks promising.

It won't be as fast Selenoid :), but at least it is running several threads in parallel in a stable way and in a decent time. More details to come soon.

@diemol
Copy link
Contributor

diemol commented Jun 8, 2017

Hi all,

We just released version 3.3.1i, where we have improved a few things. Taking the list of improvements that I mentioned in a previous comment, I can give you an update:

We have worked to improve Zalenium and also created a basic document with our findings.

In addition, for the pending tasks there are separated issues that will complete them.

Please check the document and try the new version we have released. Thank you very much for all the input you gave us.

For now, I would like to close this issue since there are too many things in it. In case of finding new bugs or performance problems, please create a new issue and we will work on it. We invite you to contribute to the linked document with your own performance data, so more people can benefit from it.

@felippenardi
Copy link

@diemol Can you add the tag for 3.3.1l?

@diemol
Copy link
Contributor

diemol commented Jun 20, 2017

Hi @felippenardi,

This was released with tag 3.3.1i, but more improvements were doing in subsequent releases, the current release is 3.3.1k.

3.3.1l is still under development.

@felippenardi
Copy link

Oh got you! Thanks :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants