Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP health checker: handle GOAWAY from HTTP2 upstreams #13599

Merged
merged 51 commits into from
Jan 12, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
ce00ea3
HTTP health checker: handle GOAWAY from HTTP2 upstreams
mpuncel Oct 14, 2020
9f83533
fix format of version history to be compliant with check_format.py
mpuncel Oct 15, 2020
736d0ca
fix spelling mistake in comment
mpuncel Oct 15, 2020
f3e9da2
fix incorrect testcomment
mpuncel Oct 16, 2020
b76c46e
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Oct 17, 2020
01fa365
fix alphabetical order in version.rst
mpuncel Oct 19, 2020
cd06c28
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Oct 21, 2020
f899f0a
Add runtime disable option to go back to old GOAWAY behavior
mpuncel Oct 21, 2020
8f21af4
follow runtime naming convention, make new flag default to true, remo…
mpuncel Oct 22, 2020
c23a9a5
document runtime guard in release notes
mpuncel Oct 22, 2020
447d07c
group common mock expectations into helper methods
mpuncel Oct 22, 2020
04458ab
rename received_no_error_goaway_ to reuse_connection_, and consolidat…
mpuncel Oct 22, 2020
0d57547
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Oct 29, 2020
15d4dcd
use runtimeFeatureEnabled() and don't close() twice after reset stream
mpuncel Oct 30, 2020
00503eb
add some integration tests covering GOAWAY
mpuncel Oct 30, 2020
bbd6e08
fix incorrect comments
mpuncel Oct 30, 2020
bb0cc29
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Oct 30, 2020
2b00126
address PR feedback
mpuncel Nov 10, 2020
2352719
attempt, no goaway in fake upstream
mpuncel Nov 11, 2020
ed7a0f1
Merge branch 'master' into HEAD
mpuncel Nov 18, 2020
38cc1bf
various fixes
mpuncel Nov 18, 2020
0bd20e6
make integration test send actual GOAWAY frames
mpuncel Nov 18, 2020
38c4fb8
undo change to workflow
mpuncel Nov 18, 2020
b4496a9
fix github workflow
mpuncel Nov 18, 2020
6dd4400
remove commented line
mpuncel Nov 18, 2020
41a330b
address PR feedback
mpuncel Nov 18, 2020
bbb5b8e
wip amend
mpuncel Dec 1, 2020
8cd1560
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Dec 2, 2020
ec669f2
move protocolErrorForTest to the non-legacy http2 codec
mpuncel Dec 2, 2020
455034a
fix compilation error in health_check_fuzz_test_utils
mpuncel Dec 2, 2020
e75b30a
fix compilation errors in health checker unit test
mpuncel Dec 2, 2020
df8f734
add protocolErrorForTest to the legacy http2 impl
mpuncel Dec 2, 2020
56efbce
undo changes to nghttp2 user callback
mpuncel Dec 8, 2020
2b0d930
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Dec 8, 2020
9e339ee
address PR feedback
mpuncel Dec 9, 2020
0345dba
add unit test for protocolErrorForTest() on the non-legacy codec impl
mpuncel Dec 9, 2020
06652ed
fix protocolErrorForTest() unit test when legacy codec impl is used
mpuncel Dec 10, 2020
322e564
remove some unintended newlines
mpuncel Dec 10, 2020
deb3893
add TODO comment to clean up fake upstream when legacy h2 codec is re…
mpuncel Dec 10, 2020
3ff4d9f
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Dec 10, 2020
bcb7346
add goaway_error stat too health checker, wait on it in integration t…
mpuncel Dec 11, 2020
f74ed68
fix wording of goaway_error doc
mpuncel Dec 11, 2020
1c287c1
commit stat increment line
mpuncel Dec 11, 2020
49e689d
remove unused lambda capture variable
mpuncel Dec 11, 2020
a7cfa59
PR feedback: remove new stat, use real sleep in integration test
mpuncel Dec 16, 2020
d15bbf3
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Dec 16, 2020
31c0641
drop changes accommodating legacy h2 codec, it was removed
mpuncel Dec 16, 2020
5e8b981
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Dec 22, 2020
6a3ad8c
switch to real time
mpuncel Dec 22, 2020
eacd1f4
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Jan 8, 2021
64cc217
Merge branch 'master' into mpuncel/http2-hc-goaway
mpuncel Jan 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,6 @@ If health check is configured, the cluster has an additional statistics tree roo
network_failure, Counter, Number of health check failures due to network error
verify_cluster, Counter, Number of health checks that attempted cluster name verification
healthy, Gauge, Number of healthy members
goaway_error, Counter, Number of health check failures due to a HTTP/2 GOAWAY frame with a code other than NO_ERROR from the upstream (subset of network_failure)

.. _config_cluster_manager_cluster_stats_outlier_detection:

Expand Down
1 change: 0 additions & 1 deletion source/common/upstream/health_checker_base_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ namespace Upstream {
#define ALL_HEALTH_CHECKER_STATS(COUNTER, GAUGE) \
COUNTER(attempt) \
COUNTER(failure) \
COUNTER(goaway_error) \
COUNTER(network_failure) \
COUNTER(passive_failure) \
COUNTER(success) \
Expand Down
1 change: 0 additions & 1 deletion source/common/upstream/health_checker_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,6 @@ void HttpHealthCheckerImpl::HttpActiveHealthCheckSession::onGoAway(
handleFailure(envoy::data::core::v3::NETWORK);
}

parent_.stats_.goaway_error_.inc();
if (client_) {
expect_reset_ = true;
mpuncel marked this conversation as resolved.
Show resolved Hide resolved
client_->close();
Expand Down
15 changes: 13 additions & 2 deletions test/integration/health_check_integration_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,20 @@ TEST_P(HttpHealthCheckIntegrationTest, SingleEndpointGoAwayError) {
test_server_->waitForCounterGe("cluster.cluster_1.health_check.failure", 1);
EXPECT_EQ(0, test_server_->counter("cluster.cluster_1.health_check.success")->value());
EXPECT_EQ(1, test_server_->counter("cluster.cluster_1.health_check.failure")->value());
test_server_->waitForCounterGe("cluster.cluster_1.health_check.goaway_error", 1);

// Advance time to cause another health check.
// Now we want to advance time to cause another health check. Ideally the
// waitForDisconnect() step above would guarantee that the health checker has
// seen the GOAWAY, closed the health check connection, and finally enabled
// the health check interval timer so we can advance simulated time to cause
// another check. However, due to special handling for protocol errors, there
// is a race where both the downstream (health checker) and upstream (fake
// upstream) close the connection, so we may exit waitForDisconnect() before
// the health checker has actually seen the GOAWAY and enabled the timer. To
// address this, we use a small real sleep before advancing thee simulated
// time system to give the health checker time to catch up. If we advance time
// before the health check timer is enabled, the next health check will never
// happen.
timeSystem().realSleepDoNotUseWithoutScrutiny(std::chrono::milliseconds(100));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to keep going back and forth on this, but maybe we should talk on Slack. It's not OK to use a real sleep here since it will flake. Please either stop using simulated time entirely, or just do what we talked about before and wait on the stat. It's not clear to me that simulated time is the best use for this case? Since you are just waiting for an HC failure to occur and then another reconnect?

/wait

timeSystem().advanceTimeWait(std::chrono::milliseconds(500));

ASSERT_TRUE(clusters_[cluster_idx].host_upstream_->waitForHttpConnection(
Expand Down