Draft: Diagnosing Windows integration test flakes #11904

sunjayBhatia · 2020-07-06T19:48:51Z

DO NOT MERGE (until/unless we have some actionable changes)

This is just for diagnosing integration test flakes in CI that to not occur locally

List of flaky tests (non-exhaustive):

//test/extensions/filters/http/router:auto_sni_integration_test
//test/integration:tcp_proxy_integration_test

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

sunjayBhatia · 2020-07-06T19:54:36Z

cc @wrowe @davinci26 @nigriMSFT @alyssawilk

alyssawilk · 2020-07-08T14:18:04Z

Doesn't look like this caught the listener timeout, but at a cursory look at https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/44131/logs/59
this looks like it might be a real bug. It looks like we're getting in an infinite write loop.
I just merged the fix @antoniovicente wrote which fixes issues on delay close. Not sure if this is the same bug but I'd suggest a master merge and see if we can repro or not.

alyssawilk · 2020-07-08T14:21:07Z

TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool

on inspection I don't think it's fixed by antonio's fix, because we only delay close for HTTP. TCP gets flush-write. Again just intuition but it still smells like event loop bug to me.

alyssawilk · 2020-07-08T15:56:34Z

Either way if we get a pass in this build I'd be inclined to check it in. windows is failing so often it'd be great to get debug info on everyone's PRs.

I'm going to throw this @lizan's way for final approval.

alyssawilk

I'd be happy to check this in with -l trace without runs-per-test and see what we can learn from CI failures.

antoniovicente · 2020-07-08T16:01:13Z

Doesn't look like this caught the listener timeout, but at a cursory look at https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/44131/logs/59
this looks like it might be a real bug. It looks like we're getting in an infinite write loop.
I just merged the fix @antoniovicente wrote which fixes issues on delay close. Not sure if this is the same bug but I'd suggest a master merge and see if we can repro or not.

I see the following 2 failures in the log which look like timeout waiting for listeners. The large number of logs like: [source/common/network/connection_impl.cc:607] [C101] write ready

Is likely due to LEVEL trigger for FDs. If the events come from select in LEVEL mode, the event loop does check for fds at each iteration, so this is not a true infinite loop. A potential explanation may include starvation of some fds if there are too many fds that are returning Write events and there is a limit on the number of FDs that can be returned by select. I'm not familiar with the select API; I know that the epoll polling mechanism does have a limitation on number of fds with events returned by each call to epoll_wait.

2020-07-07T15:42:27.0511627Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestCloseOnHealthFailure/IPv4_OriginalConnPool
2020-07-07T15:46:57.0156092Z [2020-07-07 15:37:41.728][3712][critical][assert] [test/integration/integration.cc:489] assert failure: 0. Details: Timed out waiting for listeners.

2020-07-07T15:48:38.9565959Z [ RUN ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestNoCloseOnHealthFailure/IPv6_OriginalConnPool
2020-07-07T15:52:50.6531359Z [2020-07-07 15:37:40.290][1944][critical][assert] [test/integration/integration.cc:489] assert failure: 0. Details: Timed out waiting for listeners.

antoniovicente · 2020-07-08T16:20:48Z

Actually, see if merging master helps. I think that the change in #11833 may actually fix this issue if the timeout is caused by the bug that could cause delayed close timers to never fire.

stale · 2020-07-18T10:56:00Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

sunjayBhatia · 2020-07-21T15:26:22Z

we will continue diagnosing these today and onwards

sunjayBhatia · 2020-07-21T15:27:05Z

/nostalebot

stale · 2020-07-29T08:57:52Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

sunjayBhatia · 2020-07-29T16:15:10Z

From the latest run with trace logging: https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/45657/logs/52

wrowe · 2020-08-04T22:17:57Z

Rebased for simplicity's sake as this is still in draft, looking at master again with the previous month of progress. This is a more aggressive flavor of #12343 where we expect to continue to see failures/flakes.

sunjayBhatia · 2020-08-05T15:34:37Z

Set to reviewable just to get CI to run

sunjayBhatia · 2020-08-06T19:41:19Z

We're actually seeing that the integration test setup is not actually waiting for a real 10s and causing integration tests to time out waiting for listeners

When we applied the following d38afb1 we were able to see tests pass regularly in RBE

We applied that commit to #12343 and will hopefully see integration tests no longer flake on Windows (the change needs to be refactored before a real commit, this is just a POC)

mattklein123 · 2020-08-06T19:50:33Z

@sunjayBhatia this is the same problem as #12480. I was just discussing this with @jmarantz. This is a problem on all test that use SimTime. I was going to do a similar fix. Do you want me to do it or do you want to clean up the patch?

wrowe · 2020-08-06T21:06:57Z

I was going to do a similar fix. Do you want me to do it or do you want to clean up the patch?

Can we simply fix the waitFor() / advanceTimeWait() / advanceTimeAsync() to always use wall clock for waiting, and resolve the sleeping-once and then falling through bugs? Otherwise it looks like some 30 problematic waitFor()'s to clean up.

wrowe · 2020-08-11T20:06:20Z

@mattklein123 The state of this PR is your submitted work combined against enabling every integration test on Windows. We expect some (a few? many?) to still fail, but with simulated time corrections, a good number should be passing now. We'll refresh again against your additional corrections, or you can cherry-pick this commit back to your patch, modulo any still-broken tests which can be left with the fails_on_windows tag.

wrowe · 2020-08-11T20:33:53Z

The patch enables 89 integration-related tests. With the progress so far, only 31-some are now failing. We can tickle this a few times to see if the failing set changes from run to run.

wrowe · 2020-08-11T20:46:37Z

1st pass today of af6e1fd in #12527 against Windows;

/test/extensions/filters/network/mysql_proxy:mysql_integration_test TIMEOUT
/test/integration:eds_integration_test TIMEOUT
/test/integration:http_subset_lb_integration_test TIMEOUT
/test/integration:load_stats_integration_test TIMEOUT
/test/test_common:simulated_time_system_test TIMEOUT
/test/extensions/clusters/redis:redis_cluster_integration_test FAILED
/test/extensions/filters/http/cache:cache_filter_integration_test FAILED
/test/extensions/filters/http/ext_authz:ext_authz_integration_test FAILED
/test/extensions/filters/http/ratelimit:ratelimit_integration_test FAILED
/test/extensions/filters/network/sni_dynamic_forward_proxy:proxy_filter_integration_test FAILED
/test/extensions/filters/network/thrift_proxy:integration_test FAILED
/test/extensions/filters/network/thrift_proxy:translation_integration_test FAILED
/test/extensions/filters/udp/dns_filter:dns_filter_integration_test FAILED
/test/extensions/filters/udp/udp_proxy:udp_proxy_integration_test FAILED
/test/integration:api_listener_integration_test FAILED
/test/integration:api_version_integration_test FAILED
/test/integration:filter_manager_integration_test FAILED
/test/integration:http2_upstream_integration_test FAILED
/test/integration:http_timeout_integration_test FAILED
/test/integration:integration_admin_test FAILED
/test/integration:listener_filter_integration_test FAILED
/test/integration:proxy_proto_integration_test FAILED
/test/integration:socket_interface_integration_test FAILED
/test/integration:tcp_tunneling_integration_test FAILED
/test/integration:uds_integration_test FAILED
/test/integration:vhds_integration_test FAILED
/test/integration:websocket_integration_test FAILED
/test/integration:xds_integration_test FAILED
/test/integration:hds_integration_test FAILED
/test/integration:idle_timeout_integration_test FAILED
/test/integration:http2_integration_test TIMEOUT

wrowe · 2020-08-11T22:09:50Z

Second pass, prior to merging Matt's fix and master merge, traded one failure;
-//test/integration:http2_integration_test TIMEOUT
+//test/integration:overload_integration_test FAILED

Rekicked with the latest fixes of the day.

wrowe · 2020-08-12T13:11:33Z

Last update of Windows failures following merge master and other fixes;
-//test/integration:http_timeout_integration_test FAILED
-//test/integration:overload_integration_test FAILED

So, two fewer failures, but it's likely that they remain flaky.

See what percentage are addressed by #12527 Co-authored-by: William A Rowe Jr <wrowe@vmware.com> Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com> Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

wrowe · 2020-08-14T13:54:19Z

Final pass, prior to rebasing to master based on the final simulated time refactoring, one new timeout and a number of previous failures may be resolved...

+//test/extensions/stats_sinks/metrics_service:metrics_service_integration_test TIMEOUT
-//test/test_common:simulated_time_system_test TIMEOUT
-//test/extensions/filters/http/cache:cache_filter_integration_test FAILED
-//test/extensions/filters/http/ratelimit:ratelimit_integration_test FAILED
-//test/extensions/filters/network/sni_dynamic_forward_proxy:proxy_filter_integration_test FAILED
-//test/integration:filter_manager_integration_test FAILED
-//test/integration:listener_filter_integration_test FAILED
-//test/integration:tcp_tunneling_integration_test FAILED
-//test/integration:xds_integration_test FAILED

Will take a pass through the list to re-tag remaining/consistent failures as 'fails_on_windows', and any inconsistent failures as still 'flaky_on_windows'

jmarantz · 2020-08-14T14:19:21Z

quick FYI: what's merged into master was mostly not sim-time infrastructure, it was integration test infrastructure.

The only material thing that changed in sim-time really was that SImulatedTimeSystem::waitFor() no longer advances the simulated time. It does real-time blocks without advancing sim-time. This turns out to work much better for the integration tests.

There is still a significant pending change for sim-time: #12614 but I think the main effect of this is to make it possible to delete timers that are currently in the process of calling their callbacks. At one point we suspected that might be happening in tests, but it doesn't seem like that anymore, following the cleanup of the integration tests.

#12614 is blocked on some of its own integration test failures, but I think resolution of that may have to wait for @antoniovicente .

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Will promote retry_on_windows -> flaky_on_windows if they succeed over some 10 more rounds, and will finally tag remaining retry_on_windows as consistently fails_on_windows Co-authored-by: William A Rowe Jr <wrowe@vmware.com> Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com> Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com> Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

…-test-flakes Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

…-test-flakes Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

wrowe · 2020-08-18T02:52:10Z

FYI @mattklein123 is a draft PR in the first place. Has anyone tracked down why draft PR's refused to run through the entire CI cycle? That was the only reason we toggled this to an open PR.

It seems that we might be better off closing this PR and beginning smaller new ones over more focused experimental changes, since I'm unaware of any means to untag the very very many code owners impacted by this particular experimental change set. We can simply kick the next smaller-scope PR's once 12695 is merged to master, hopefully with less ownership-related spam.

sunjayBhatia · 2020-08-18T14:45:19Z

Closing in favor of #12695, a fresh start

sunjayBhatia mentioned this pull request Jul 6, 2020

release: cutting 1.15 #11898

Merged

alyssawilk assigned lizan Jul 8, 2020

alyssawilk reviewed Jul 8, 2020

View reviewed changes

alyssawilk mentioned this pull request Jul 8, 2020

Windows: untag passing tests #11741

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 21, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 29, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 29, 2020

sunjayBhatia marked this pull request as ready for review August 5, 2020 15:33

sunjayBhatia requested review from mattklein123 and snowp as code owners August 5, 2020 15:33

sunjayBhatia changed the title ~~Windows: Diagnosing integration test flakes~~ Draft: Diagnosing Windows integration test flakes Aug 5, 2020

mattklein123 self-assigned this Aug 6, 2020

sunjayBhatia mentioned this pull request Aug 6, 2020

FaultIntegrationTestAllProtocols / Timed out waiting for listeners flake #12480

Closed

sunjayBhatia requested review from lizan, PiotrSikora and zuercher as code owners August 11, 2020 19:28

repokitteh-read-only bot removed the waiting label Aug 11, 2020

mattklein123 added the waiting label Aug 11, 2020

repokitteh-read-only bot removed the waiting label Aug 11, 2020

mattklein123 added the waiting label Aug 14, 2020

repokitteh-read-only bot removed the waiting label Aug 14, 2020

sunjayBhatia and others added 8 commits August 14, 2020 14:00

Kick to look for flakes

4e589e6

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Merge remote-tracking branch 'origin/master' into windows-integration…

ac7626c

…-test-flakes Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>

Two more confirmed flaky, repeat spin cycle

c935f93

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

Merge remote-tracking branch 'origin/master' into windows-integration…

0e6624a

…-test-flakes Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

Note one more confirmed flaky test

8343d5d

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

Trigger again 2x more, reduce false postitives/negatives to <1%

a4463d8

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

Trigger final pass to provide 99% confidence in consistent failures

6d7b9d0

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>

wrowe mentioned this pull request Aug 17, 2020

Greenlight and reclassify //test/... with the current progress for Windows #12695

Merged

mattklein123 added the waiting label Aug 17, 2020

sunjayBhatia closed this Aug 18, 2020

sunjayBhatia deleted the windows-integration-test-flakes branch August 18, 2020 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Diagnosing Windows integration test flakes #11904

Draft: Diagnosing Windows integration test flakes #11904

sunjayBhatia commented Jul 6, 2020 •

edited

Loading

sunjayBhatia commented Jul 6, 2020

alyssawilk commented Jul 8, 2020

alyssawilk commented Jul 8, 2020

alyssawilk commented Jul 8, 2020

alyssawilk left a comment

antoniovicente commented Jul 8, 2020

antoniovicente commented Jul 8, 2020

stale bot commented Jul 18, 2020

sunjayBhatia commented Jul 21, 2020

sunjayBhatia commented Jul 21, 2020

stale bot commented Jul 29, 2020

sunjayBhatia commented Jul 29, 2020

wrowe commented Aug 4, 2020

sunjayBhatia commented Aug 5, 2020

sunjayBhatia commented Aug 6, 2020 •

edited

Loading

mattklein123 commented Aug 6, 2020

wrowe commented Aug 6, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 12, 2020

wrowe commented Aug 14, 2020

jmarantz commented Aug 14, 2020

wrowe commented Aug 18, 2020

sunjayBhatia commented Aug 18, 2020

Draft: Diagnosing Windows integration test flakes #11904

Draft: Diagnosing Windows integration test flakes #11904

Conversation

sunjayBhatia commented Jul 6, 2020 • edited Loading

sunjayBhatia commented Jul 6, 2020

alyssawilk commented Jul 8, 2020

alyssawilk commented Jul 8, 2020

alyssawilk commented Jul 8, 2020

alyssawilk left a comment

Choose a reason for hiding this comment

antoniovicente commented Jul 8, 2020

antoniovicente commented Jul 8, 2020

stale bot commented Jul 18, 2020

sunjayBhatia commented Jul 21, 2020

sunjayBhatia commented Jul 21, 2020

stale bot commented Jul 29, 2020

sunjayBhatia commented Jul 29, 2020

wrowe commented Aug 4, 2020

sunjayBhatia commented Aug 5, 2020

sunjayBhatia commented Aug 6, 2020 • edited Loading

mattklein123 commented Aug 6, 2020

wrowe commented Aug 6, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 11, 2020

wrowe commented Aug 12, 2020

wrowe commented Aug 14, 2020

jmarantz commented Aug 14, 2020

wrowe commented Aug 18, 2020

sunjayBhatia commented Aug 18, 2020

sunjayBhatia commented Jul 6, 2020 •

edited

Loading

sunjayBhatia commented Aug 6, 2020 •

edited

Loading