Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Diagnosing Windows integration test flakes #11904

Closed
wants to merge 9 commits into from
Closed

Draft: Diagnosing Windows integration test flakes #11904

wants to merge 9 commits into from

Conversation

sunjayBhatia
Copy link
Member

@sunjayBhatia sunjayBhatia commented Jul 6, 2020

DO NOT MERGE (until/unless we have some actionable changes)

This is just for diagnosing integration test flakes in CI that to not occur locally

List of flaky tests (non-exhaustive):

  • //test/extensions/filters/http/router:auto_sni_integration_test
  • //test/integration:tcp_proxy_integration_test

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

@sunjayBhatia
Copy link
Member Author

@alyssawilk
Copy link
Contributor

Doesn't look like this caught the listener timeout, but at a cursory look at https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/44131/logs/59
this looks like it might be a real bug. It looks like we're getting in an infinite write loop.
I just merged the fix @antoniovicente wrote which fixes issues on delay close. Not sure if this is the same bug but I'd suggest a master merge and see if we can repro or not.

@alyssawilk
Copy link
Contributor

TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool

on inspection I don't think it's fixed by antonio's fix, because we only delay close for HTTP. TCP gets flush-write. Again just intuition but it still smells like event loop bug to me.

@alyssawilk
Copy link
Contributor

Either way if we get a pass in this build I'd be inclined to check it in. windows is failing so often it'd be great to get debug info on everyone's PRs.

I'm going to throw this @lizan's way for final approval.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to check this in with -l trace without runs-per-test and see what we can learn from CI failures.

@antoniovicente
Copy link
Contributor

Doesn't look like this caught the listener timeout, but at a cursory look at https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/44131/logs/59
this looks like it might be a real bug. It looks like we're getting in an infinite write loop.
I just merged the fix @antoniovicente wrote which fixes issues on delay close. Not sure if this is the same bug but I'd suggest a master merge and see if we can repro or not.

I see the following 2 failures in the log which look like timeout waiting for listeners. The large number of logs like: [source/common/network/connection_impl.cc:607] [C101] write ready

Is likely due to LEVEL trigger for FDs. If the events come from select in LEVEL mode, the event loop does check for fds at each iteration, so this is not a true infinite loop. A potential explanation may include starvation of some fds if there are too many fds that are returning Write events and there is a limit on the number of FDs that can be returned by select. I'm not familiar with the select API; I know that the epoll polling mechanism does have a limitation on number of fds with events returned by each call to epoll_wait.

2020-07-07T15:42:27.0511627Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestCloseOnHealthFailure/IPv4_OriginalConnPool
2020-07-07T15:46:57.0156092Z [2020-07-07 15:37:41.728][3712][critical][assert] [test/integration/integration.cc:489] assert failure: 0. Details: Timed out waiting for listeners.

2020-07-07T15:48:38.9565959Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool
2020-07-07T15:52:50.6531359Z [2020-07-07 15:37:40.290][1944][critical][assert] [test/integration/integration.cc:489] assert failure: 0. Details: Timed out waiting for listeners.

@antoniovicente
Copy link
Contributor

Actually, see if merging master helps. I think that the change in #11833 may actually fix this issue if the timeout is caused by the bug that could cause delayed close timers to never fire.

@stale
Copy link

stale bot commented Jul 18, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2020
@sunjayBhatia
Copy link
Member Author

we will continue diagnosing these today and onwards

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 21, 2020
@sunjayBhatia
Copy link
Member Author

/nostalebot

@stale
Copy link

stale bot commented Jul 29, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 29, 2020
@sunjayBhatia
Copy link
Member Author

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 29, 2020
@wrowe
Copy link
Contributor

wrowe commented Aug 4, 2020

Rebased for simplicity's sake as this is still in draft, looking at master again with the previous month of progress. This is a more aggressive flavor of #12343 where we expect to continue to see failures/flakes.

@sunjayBhatia sunjayBhatia marked this pull request as ready for review August 5, 2020 15:33
@sunjayBhatia
Copy link
Member Author

Set to reviewable just to get CI to run

@sunjayBhatia sunjayBhatia changed the title Windows: Diagnosing integration test flakes Draft: Diagnosing Windows integration test flakes Aug 5, 2020
@sunjayBhatia
Copy link
Member Author

sunjayBhatia commented Aug 6, 2020

We're actually seeing that the integration test setup is not actually waiting for a real 10s and causing integration tests to time out waiting for listeners

When we applied the following d38afb1 we were able to see tests pass regularly in RBE

We applied that commit to #12343 and will hopefully see integration tests no longer flake on Windows (the change needs to be refactored before a real commit, this is just a POC)

@mattklein123
Copy link
Member

@sunjayBhatia this is the same problem as #12480. I was just discussing this with @jmarantz. This is a problem on all test that use SimTime. I was going to do a similar fix. Do you want me to do it or do you want to clean up the patch?

@wrowe
Copy link
Contributor

wrowe commented Aug 6, 2020

I was going to do a similar fix. Do you want me to do it or do you want to clean up the patch?

Can we simply fix the waitFor() / advanceTimeWait() / advanceTimeAsync() to always use wall clock for waiting, and resolve the sleeping-once and then falling through bugs? Otherwise it looks like some 30 problematic waitFor()'s to clean up.

@wrowe
Copy link
Contributor

wrowe commented Aug 11, 2020

@mattklein123 The state of this PR is your submitted work combined against enabling every integration test on Windows. We expect some (a few? many?) to still fail, but with simulated time corrections, a good number should be passing now. We'll refresh again against your additional corrections, or you can cherry-pick this commit back to your patch, modulo any still-broken tests which can be left with the fails_on_windows tag.

@wrowe
Copy link
Contributor

wrowe commented Aug 11, 2020

The patch enables 89 integration-related tests. With the progress so far, only 31-some are now failing. We can tickle this a few times to see if the failing set changes from run to run.

@wrowe
Copy link
Contributor

wrowe commented Aug 11, 2020

1st pass today of af6e1fd in #12527 against Windows;

/test/extensions/filters/network/mysql_proxy:mysql_integration_test TIMEOUT
/test/integration:eds_integration_test TIMEOUT
/test/integration:http_subset_lb_integration_test TIMEOUT
/test/integration:load_stats_integration_test TIMEOUT
/test/test_common:simulated_time_system_test TIMEOUT
/test/extensions/clusters/redis:redis_cluster_integration_test FAILED
/test/extensions/filters/http/cache:cache_filter_integration_test FAILED
/test/extensions/filters/http/ext_authz:ext_authz_integration_test FAILED
/test/extensions/filters/http/ratelimit:ratelimit_integration_test FAILED
/test/extensions/filters/network/sni_dynamic_forward_proxy:proxy_filter_integration_test FAILED
/test/extensions/filters/network/thrift_proxy:integration_test FAILED
/test/extensions/filters/network/thrift_proxy:translation_integration_test FAILED
/test/extensions/filters/udp/dns_filter:dns_filter_integration_test FAILED
/test/extensions/filters/udp/udp_proxy:udp_proxy_integration_test FAILED
/test/integration:api_listener_integration_test FAILED
/test/integration:api_version_integration_test FAILED
/test/integration:filter_manager_integration_test FAILED
/test/integration:http2_upstream_integration_test FAILED
/test/integration:http_timeout_integration_test FAILED
/test/integration:integration_admin_test FAILED
/test/integration:listener_filter_integration_test FAILED
/test/integration:proxy_proto_integration_test FAILED
/test/integration:socket_interface_integration_test FAILED
/test/integration:tcp_tunneling_integration_test FAILED
/test/integration:uds_integration_test FAILED
/test/integration:vhds_integration_test FAILED
/test/integration:websocket_integration_test FAILED
/test/integration:xds_integration_test FAILED
/test/integration:hds_integration_test FAILED
/test/integration:idle_timeout_integration_test FAILED
/test/integration:http2_integration_test TIMEOUT

@wrowe
Copy link
Contributor

wrowe commented Aug 11, 2020

Second pass, prior to merging Matt's fix and master merge, traded one failure;
-//test/integration:http2_integration_test TIMEOUT
+//test/integration:overload_integration_test FAILED

Rekicked with the latest fixes of the day.

@wrowe
Copy link
Contributor

wrowe commented Aug 12, 2020

Last update of Windows failures following merge master and other fixes;
-//test/integration:http_timeout_integration_test FAILED
-//test/integration:overload_integration_test FAILED

So, two fewer failures, but it's likely that they remain flaky.

See what percentage are addressed by #12527

Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
@wrowe
Copy link
Contributor

wrowe commented Aug 14, 2020

Final pass, prior to rebasing to master based on the final simulated time refactoring, one new timeout and a number of previous failures may be resolved...

+//test/extensions/stats_sinks/metrics_service:metrics_service_integration_test TIMEOUT
-//test/test_common:simulated_time_system_test TIMEOUT
-//test/extensions/filters/http/cache:cache_filter_integration_test FAILED
-//test/extensions/filters/http/ratelimit:ratelimit_integration_test FAILED
-//test/extensions/filters/network/sni_dynamic_forward_proxy:proxy_filter_integration_test FAILED
-//test/integration:filter_manager_integration_test FAILED
-//test/integration:listener_filter_integration_test FAILED
-//test/integration:tcp_tunneling_integration_test FAILED
-//test/integration:xds_integration_test FAILED

Will take a pass through the list to re-tag remaining/consistent failures as 'fails_on_windows', and any inconsistent failures as still 'flaky_on_windows'

@jmarantz
Copy link
Contributor

quick FYI: what's merged into master was mostly not sim-time infrastructure, it was integration test infrastructure.

The only material thing that changed in sim-time really was that SImulatedTimeSystem::waitFor() no longer advances the simulated time. It does real-time blocks without advancing sim-time. This turns out to work much better for the integration tests.

There is still a significant pending change for sim-time: #12614 but I think the main effect of this is to make it possible to delete timers that are currently in the process of calling their callbacks. At one point we suspected that might be happening in tests, but it doesn't seem like that anymore, following the cleanup of the integration tests.

#12614 is blocked on some of its own integration test failures, but I think resolution of that may have to wait for @antoniovicente .

sunjayBhatia and others added 8 commits August 14, 2020 14:00
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Will promote retry_on_windows -> flaky_on_windows if they succeed
over some 10 more rounds, and will finally tag remaining retry_on_windows
as consistently fails_on_windows

Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
…-test-flakes

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
…-test-flakes

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
@wrowe
Copy link
Contributor

wrowe commented Aug 18, 2020

FYI @mattklein123 is a draft PR in the first place. Has anyone tracked down why draft PR's refused to run through the entire CI cycle? That was the only reason we toggled this to an open PR.

It seems that we might be better off closing this PR and beginning smaller new ones over more focused experimental changes, since I'm unaware of any means to untag the very very many code owners impacted by this particular experimental change set. We can simply kick the next smaller-scope PR's once 12695 is merged to master, hopefully with less ownership-related spam.

@sunjayBhatia
Copy link
Member Author

Closing in favor of #12695, a fresh start

@sunjayBhatia sunjayBhatia deleted the windows-integration-test-flakes branch August 18, 2020 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants