-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Diagnosing Windows integration test flakes #11904
Draft: Diagnosing Windows integration test flakes #11904
Conversation
Doesn't look like this caught the listener timeout, but at a cursory look at https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/44131/logs/59 |
TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool on inspection I don't think it's fixed by antonio's fix, because we only delay close for HTTP. TCP gets flush-write. Again just intuition but it still smells like event loop bug to me. |
Either way if we get a pass in this build I'd be inclined to check it in. windows is failing so often it'd be great to get debug info on everyone's PRs. I'm going to throw this @lizan's way for final approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to check this in with -l trace without runs-per-test and see what we can learn from CI failures.
I see the following 2 failures in the log which look like timeout waiting for listeners. The large number of logs like: [source/common/network/connection_impl.cc:607] [C101] write ready Is likely due to LEVEL trigger for FDs. If the events come from select in LEVEL mode, the event loop does check for fds at each iteration, so this is not a true infinite loop. A potential explanation may include starvation of some fds if there are too many fds that are returning Write events and there is a limit on the number of FDs that can be returned by select. I'm not familiar with the select API; I know that the epoll polling mechanism does have a limitation on number of fds with events returned by each call to epoll_wait. 2020-07-07T15:42:27.0511627Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestCloseOnHealthFailure/IPv4_OriginalConnPool 2020-07-07T15:48:38.9565959Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool |
Actually, see if merging master helps. I think that the change in #11833 may actually fix this issue if the timeout is caused by the bug that could cause delayed close timers to never fire. |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
we will continue diagnosing these today and onwards |
/nostalebot |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
From the latest run with trace logging: https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/45657/logs/52 |
Rebased for simplicity's sake as this is still in draft, looking at master again with the previous month of progress. This is a more aggressive flavor of #12343 where we expect to continue to see failures/flakes. |
Set to reviewable just to get CI to run |
We're actually seeing that the integration test setup is not actually waiting for a real 10s and causing integration tests to time out waiting for listeners When we applied the following d38afb1 we were able to see tests pass regularly in RBE We applied that commit to #12343 and will hopefully see integration tests no longer flake on Windows (the change needs to be refactored before a real commit, this is just a POC) |
@sunjayBhatia this is the same problem as #12480. I was just discussing this with @jmarantz. This is a problem on all test that use SimTime. I was going to do a similar fix. Do you want me to do it or do you want to clean up the patch? |
Can we simply fix the waitFor() / advanceTimeWait() / advanceTimeAsync() to always use wall clock for waiting, and resolve the sleeping-once and then falling through bugs? Otherwise it looks like some 30 problematic waitFor()'s to clean up. |
@mattklein123 The state of this PR is your submitted work combined against enabling every integration test on Windows. We expect some (a few? many?) to still fail, but with simulated time corrections, a good number should be passing now. We'll refresh again against your additional corrections, or you can cherry-pick this commit back to your patch, modulo any still-broken tests which can be left with the fails_on_windows tag. |
The patch enables 89 integration-related tests. With the progress so far, only 31-some are now failing. We can tickle this a few times to see if the failing set changes from run to run. |
1st pass today of af6e1fd in #12527 against Windows; /test/extensions/filters/network/mysql_proxy:mysql_integration_test TIMEOUT |
Second pass, prior to merging Matt's fix and master merge, traded one failure; Rekicked with the latest fixes of the day. |
Last update of Windows failures following merge master and other fixes; So, two fewer failures, but it's likely that they remain flaky. |
See what percentage are addressed by #12527 Co-authored-by: William A Rowe Jr <wrowe@vmware.com> Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com> Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Final pass, prior to rebasing to master based on the final simulated time refactoring, one new timeout and a number of previous failures may be resolved... +//test/extensions/stats_sinks/metrics_service:metrics_service_integration_test TIMEOUT Will take a pass through the list to re-tag remaining/consistent failures as 'fails_on_windows', and any inconsistent failures as still 'flaky_on_windows' |
quick FYI: what's merged into master was mostly not sim-time infrastructure, it was integration test infrastructure. The only material thing that changed in sim-time really was that SImulatedTimeSystem::waitFor() no longer advances the simulated time. It does real-time blocks without advancing sim-time. This turns out to work much better for the integration tests. There is still a significant pending change for sim-time: #12614 but I think the main effect of this is to make it possible to delete timers that are currently in the process of calling their callbacks. At one point we suspected that might be happening in tests, but it doesn't seem like that anymore, following the cleanup of the integration tests. #12614 is blocked on some of its own integration test failures, but I think resolution of that may have to wait for @antoniovicente . |
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Will promote retry_on_windows -> flaky_on_windows if they succeed over some 10 more rounds, and will finally tag remaining retry_on_windows as consistently fails_on_windows Co-authored-by: William A Rowe Jr <wrowe@vmware.com> Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com> Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
…-test-flakes Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
…-test-flakes Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
FYI @mattklein123 is a draft PR in the first place. Has anyone tracked down why draft PR's refused to run through the entire CI cycle? That was the only reason we toggled this to an open PR. It seems that we might be better off closing this PR and beginning smaller new ones over more focused experimental changes, since I'm unaware of any means to untag the very very many code owners impacted by this particular experimental change set. We can simply kick the next smaller-scope PR's once 12695 is merged to master, hopefully with less ownership-related spam. |
Closing in favor of #12695, a fresh start |
DO NOT MERGE (until/unless we have some actionable changes)
This is just for diagnosing integration test flakes in CI that to not occur locally
List of flaky tests (non-exhaustive):
//test/extensions/filters/http/router:auto_sni_integration_test
//test/integration:tcp_proxy_integration_test
Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]