-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] DocsClientYamlTestSuiteIT timed out after 30 minutes #49753
Comments
Pinging @elastic/es-docs (:Docs) |
This is happening increasingly often over the past couple of days due to the scope of the suite simply slowly increasing over time. I think we may need to bump this timeout in the interim until we sort out a better solution like splitting it up. |
So something has clearly changed to make this test suite suddenly run so much slower. The reason I don't believe we are seeing this failure in intake and pull request checks is that those are split up into multiple smaller parallel builds and therefore this is much less contention on the worker then for the full periodic builds. There's no single test that's running long. It simply looks as though all test cases are just a bit slower, adding up to a pretty substantial runtime hit. Another interesting data point is this is only happening in @ywelsch @DaveCTurner Any thoughts? Any chance that change might have an adverse effect on performance on a system with lot of other CPU and IO contention (lots of other concurrent tests). @danielmitterdorfer Have you seen any recent changes in performance benchmarks that might indicate some kind regression might have been introduced? |
For me this look like a duplicate of #49579 However, since 20/01/13 the number of failures have increased, which support the investigations above. |
I suspect #50907 and #50928. Although these changes save potentially hundreds of Do these tests run using a ramdisk? If not, can we move them to use a ramdisk? That would make |
Pinging @elastic/es-core-infra (:Core/Infra/Build) |
While investigations are running which hopefully result in a proper fix, I will increase the timeout as a temporary solution, because too many builds are failing at the moment. |
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue #49753
I increased the timeout to 35 minutes. This is a temporary solution, please revert once we have a proper solution |
It doesn't look like 35 minutes is enough. Suite is still timing out. I'll push the timeout up further. |
The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753
The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: #49753
The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753 Backport of: elastic#51200
The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753 Backport of: elastic#51200
increase timeout of DocsClientYamlTestSuiteIT to 35 minutes, temporary solution for issue elastic#49753
The docs test suite is still timing out on CI at 35 minutes, so pushing it to 40 minutes while we determine the cause of the slowdown. Relates: elastic#49753
Master docs yaml tests timed out today. I couldn't find evidence of any particular test killing the cluster or the test itself failing, so looks to be another instance of general slowdown. |
@mark-vieira can you establish a base-line performance for On my machine, for example, it takes 32 minutes, running on an SSD, but on RamDisk, it takes 16 minutes, which brings it well below the range of the timeout. |
Another maybe-useful data point: I just saw three timeouts come in back-to-back, and they were all from the same host. Perhaps there is something specific to certain hosts/machines/OS combinations that is exacerbating the issue? I know the OS/JREs have already been analyzed, but wasnt sure about hosts so thought I'd mention |
We use ephemeral workers so we never run multiple builds on a single agent. Perhaps the same type of host. Our compatibility matrix tests show this timing out across all operating systems so I think specific host types is not the issue. Also, as I mentioned before, this regression is not in |
Keep in mind this is running only a single test suite so resource contention is very low. This is the same reason why we generally don't see timeouts in pull-request or intake builds because those jobs are split up into smaller pieces with fewer concurrently executing tests. The graph below is average execution time for this suite over the past 30 days for intake/PR builds (these use a ramdisk). You can see while we are still well below the timeout threshold there is a clear bump in execution time around the time of Jan 12-13. The additional load of our periodic tests (which run the entire test suite in a single build) is enough to push this over the edge. |
Here are the graphs for The clear disparity between branches makes me want to rule out anything infrastructure related given that all this stuff runs on the same infra. That said, there might have been some change in |
Ok, I take back the conclusion in my previous comment. The additional syscall logging seems to be specific to certain OSes, not branches, and we have extended test execution times even on systems where this logging isn't present so this doesn't seem to be the cause of our problem. Damn, back to the drawing board. |
I've found the source for slowness here and will open a PR shortly. The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. On the core/infra side, we could also look at speeding up plugin installs. It currently takes 30-40 seconds at the beginning of
Edit: PR #51430 added |
While we could and should do that (now that we can install them all in one plugin command), the timeout here is in the test suite, while we do plugin installation externally from gradle. |
I don't believe the solution proposed in #51418 is going to help here. As I've stated before, we already run most of our CI builds in ramdisk workspace and those builds continue to timeout at 40 minutes whereas we were running under 20 minutes before. |
The docs tests have recently been running much slower than before (see #49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes
The docs tests have recently been running much slower than before (see elastic#49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes
The docs tests have recently been running much slower than before (see #49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes
The docs tests have recently been running much slower than before (see #49753). The gist here is that with ILM/SLM we do a lot of unnecessary setup / teardown work on each test. Compounded with the slightly slower cluster state storage mechanism, this causes the tests to run much slower. In particular, on RAMDisk, docs:check is taking ES 7.4: 6:55 minutes ES master: 16:09 minutes ES with this commit: 6:52 minutes on SSD, docs:check is taking ES 7.4: ??? minutes ES master: 32:20 minutes ES with this commit: 11:21 minutes
This looks to have done the trick. Thanks @ywelsch! |
We got another timeout on elastic+elasticsearch+master+multijob-darwin-compatibility: https://gradle-enterprise.elastic.co/s/fczylflkgrebm Looking at the output, the tests are running at an excruciatingly slow pace (3-4 seconds per test). Perhaps darwin tests are not using RAMDisk, or doing too many things in parallel? |
Thanks for reporting this @ywelsch. This looks to be an issue with that macos Jenkins worker. We've seen a number of build failures specific to this machine for which I have an open infra issue for (https://github.com/elastic/infra/issues/17621). You can see this test is passing just fine and running in a reasonable time on other mac workers. Since this is an infra problem, and I've reached out regarding this, I'm going to close this issue again. |
Build failure https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=corretto11,nodes=general-purpose/362/console
Build scan available at https://gradle-enterprise.elastic.co/s/lp2vd4vqmj25o
The text was updated successfully, but these errors were encountered: