-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quic: add QUIC downstream connection close error stats. #16584
Conversation
Signed-off-by: Renjie Tang <renjietang@google.com>
/assign @alyssawilk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.
At first pass my thought was that all our existing http3 stats are rooted in the HCM, and it'd be nice to be consistent here.
I'm not sure if that's the right thing here, because of the L4/L7 thing, but given that for TCP we don't have the stats in the connection, but we have the tcp network filter or the HCM do the accounting, I think a utility we can reuse when we add other protocols plus wiring it up in the HCM might be the right thing. Tagging @jmarantz for a second opinion as stats are more his area.
If we go that route, I think we could stash the close details in connection level stream info as one of those generic extension points, and pull it out in the HCM if it's present. We've already got a few h3 stats in the HCM which we don't bother compiling out, so as long as we added lazily I think it wouldn't be a problem.
I saw that HttpConnectionManagerConfig owns http3_codec_stats_ and passes it down to QuicHttpServerConnectionImpl. |
clang-tidy error needs attention:
|
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks great! I just have a few comments about making the stats contexts parallel those for HTTP and GRPC for increased consistency and reduced number of args at various interface-points.
source/common/quic/quic_stats.cc
Outdated
connection_close}); | ||
} | ||
|
||
Stats::StatName QuicStats::ConnectionCloseStatName(quic::QuicErrorCode error_code) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connectionCloseStatName per Envoy style guide
source/server/server.cc
Outdated
@@ -105,6 +105,9 @@ InstanceImpl::InstanceImpl( | |||
|
|||
restarter_.initialize(*dispatcher_, *this); | |||
drain_manager_ = component_factory.createDrainManager(*this); | |||
#ifdef ENVOY_ENABLE_QUIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there also a runtime flag that should be checked? Or it's ifdef-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ifdef alone should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a configuration option for Quic? Or you have to make the decision about whether to support it at compile-time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QUIC support is decided at compile time and is controlled by --//bazel:http3=True/False
source/server/listener_impl.h
Outdated
@@ -424,6 +426,8 @@ class ListenerImpl final : public Network::ListenerConfig, | |||
// callback during the destroy of ListenerImpl. | |||
Init::WatcherImpl local_init_watcher_; | |||
|
|||
Quic::QuicStats* quic_stats_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Envoy style here is to use OptRef<Quic::QuicStats>
here rather than a possibly null pointer.
See include/envoy/common/optref.h
|
||
ActiveQuicListener::ActiveQuicListener( | ||
uint32_t worker_index, uint32_t concurrency, Event::Dispatcher& dispatcher, | ||
Network::UdpConnectionHandler& parent, Network::SocketSharedPtr listen_socket, | ||
Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest bundling QuicStats and quic::QuicConfig into a class Quic::Context, which can parallel Http::Context in most ways.
That will cut down by one on the arg-count in a few interfaces in this PR, and give you other good places to put stuff that may need to be shared across workers.
Moreover, Quic::Context
(or maybe OptRef<Quic::Context>
) could then be returned from Server::FactoryContext
, also parallel to Http::Context and Grpc::Context, cutting down on how much plumbing through needs to happen for those individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally tried to make QuicStats behave similarly to Http::Context but found it to be quite complicated.
To have the server return QuicStats, we need a pure interface class so that Server::Instance can have a getter interface Quic::QuicStats quicStats().
However, the only public method in QuicStats, chargeQuicConnectionCloseStats(), is dependent on QuicErrorCode, which is defined in third_party library Quiche.
In my understanding, Envoy must to be able to build without quiche. And we probably shouldn't introduce external dependency in the include/ directory either.
Having this dependency makes it hard to follow existing pattern when implementing Quic stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this point, to match the pattern in the other Contexts is to forward-declare QuicStats without defining an interface for it.
So you'd have
namespace quic { class QuicConfig; }
namespace Envoy {
namespace Quic {
class QuicStats;
class Context {
public:
virtual QuicStats& stats() PURE:
virtual quic::QuicConfig& config() PURE;
};
...
So you don't need to expose the error-code type or other types at the interface.
@@ -17,13 +17,13 @@ EnvoyQuicDispatcher::EnvoyQuicDispatcher( | |||
uint8_t expected_server_connection_id_length, Network::ConnectionHandler& connection_handler, | |||
Network::ListenerConfig& listener_config, Server::ListenerStats& listener_stats, | |||
Server::PerHandlerListenerStats& per_worker_stats, Event::Dispatcher& dispatcher, | |||
Network::Socket& listen_socket) | |||
Network::Socket& listen_socket, QuicStats* quic_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per above, pass through QuicContext rather than QuicStats and QuicConfig separately.
@@ -46,6 +46,10 @@ void EnvoyQuicDispatcher::OnConnectionClosed(quic::QuicConnectionId connection_i | |||
listener_stats_.downstream_cx_active_.dec(); | |||
per_worker_stats_.downstream_cx_active_.dec(); | |||
connection_handler_.decNumConnections(); | |||
if (quic_stats_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just make QuicStats unconditional at this level. I don't see the problem with instantiating QuicStats in the places below where you are currently passing nullptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Joshua! Glad I got most of the QuicStats class right.
I left some in-place comments on the ownership and plumbing-through of QuicStats.
In general, I think it might not be the best choice to have the Server::InstanceImpl own QuicStats and pass it down all the way.
Alyssa mentioned in comments that a lot of existing QUIC stats live in HttpConnectionManagerConfig. I tried placing QuicStats there and found things to be a lot cleaner. The demo PR can be found at #16625. Let me know what you think on that.
|
||
ActiveQuicListener::ActiveQuicListener( | ||
uint32_t worker_index, uint32_t concurrency, Event::Dispatcher& dispatcher, | ||
Network::UdpConnectionHandler& parent, Network::SocketSharedPtr listen_socket, | ||
Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally tried to make QuicStats behave similarly to Http::Context but found it to be quite complicated.
To have the server return QuicStats, we need a pure interface class so that Server::Instance can have a getter interface Quic::QuicStats quicStats().
However, the only public method in QuicStats, chargeQuicConnectionCloseStats(), is dependent on QuicErrorCode, which is defined in third_party library Quiche.
In my understanding, Envoy must to be able to build without quiche. And we probably shouldn't introduce external dependency in the include/ directory either.
Having this dependency makes it hard to follow existing pattern when implementing Quic stats.
source/server/server.cc
Outdated
@@ -105,6 +105,9 @@ InstanceImpl::InstanceImpl( | |||
|
|||
restarter_.initialize(*dispatcher_, *this); | |||
drain_manager_ = component_factory.createDrainManager(*this); | |||
#ifdef ENVOY_ENABLE_QUIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ifdef alone should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check out the other PR a bit later.
|
||
ActiveQuicListener::ActiveQuicListener( | ||
uint32_t worker_index, uint32_t concurrency, Event::Dispatcher& dispatcher, | ||
Network::UdpConnectionHandler& parent, Network::SocketSharedPtr listen_socket, | ||
Network::ListenerConfig& listener_config, const quic::QuicConfig& quic_config, | ||
Network::Socket::OptionsSharedPtr options, bool kernel_worker_routing, | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled) | ||
const envoy::config::core::v3::RuntimeFeatureFlag& enabled, QuicStats* quic_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this point, to match the pattern in the other Contexts is to forward-declare QuicStats without defining an interface for it.
So you'd have
namespace quic { class QuicConfig; }
namespace Envoy {
namespace Quic {
class QuicStats;
class Context {
public:
virtual QuicStats& stats() PURE:
virtual quic::QuicConfig& config() PURE;
};
...
So you don't need to expose the error-code type or other types at the interface.
Can you explain the reasoning? |
I was mainly concerned that having QuicContext accessible in Server::Instance will require getter implementation in subclasses that don't really need it, for example Server::ValidationInstance. If having QuicStats in HCM works just as well, I'd prefer to go that route so that the code is cleaner. |
ListenerManagerImpl. Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking much better. I think we can just get rid of the one OptRef.
@@ -435,6 +435,10 @@ bool ListenerManagerImpl::addOrUpdateListenerInternal( | |||
workers_started_, hash, server_.options().concurrency()); | |||
} | |||
|
|||
#ifdef ENVOY_ENABLE_QUIC | |||
new_listener->setQuicStatNames(quic_stat_names_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above I think ListenerImpl can just grab the QuicStatNames&
object from the manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or...if that involves changes to the base class you'd prefer not to do, you can just pass in the stats to the constructor rather than setting it after construction.
That just means you need to ifdef the arg-list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately ListenerImpl takes ListenerManagerImpl in its constructors. So I added getter in ListenerManagerImpl and let ListenerImpl grab it during constructions.
Signed-off-by: Renjie Tang <renjietang@google.com>
/retest |
Retrying Azure Pipelines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@envoyproxy/senior-maintainers
thanks. for pushing this all the way through. nice job!
/assign-from @envoyproxy/senior-maintainers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - these are going to be super useful!
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One naming nit, and I think this is good to go from my point of view.
Adding Matt for a non-google pass since stats are forever :-P
:header: Name, Type, Description | ||
:widths: 1, 1, 2 | ||
|
||
[direction].[source].quic_connection_close_error_code_[error_code], Counter, A collection of counters that are lazily initialized to record each quic connection close error code that's present. direction could be *upstream* or *downstream*. source could be *self* or *peer*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doccing up listener stats, won't direction always be downstream?
And actually I hate to catch this now, but probably rather than self/peer it should be tx/rx for consistency with other Envoy stats. I find self and peer more intuitive (maybe 'cause that's what I'm used to) but I think most of the other stats (like resets tx_reset/rx_reset) are documented that way with tx for things sent to the peer and rx being codes received from the peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The listener stats should be all downstreams.
I haven't figured out the details on upstream stats. So I will update the doc again when I add them.
Done changing names to tx/rx.
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
Check CI? /wait |
Signed-off-by: Renjie Tang <renjietang@google.com>
@mattklein123 The flaky tests are now fixed. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM with some small comments.
/wait
// TODO(renjietang): Currently these stats are only available in downstream. Wire it up to upstream | ||
// QUIC also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick read through it's not clear to me that this won't create an unbounded number of stats controlled by downstream (for exampled if the name somehow has the code number in it and the client controls the codes). This is probably not the case, but can you add more comments about why this is safe for use with uncontrolled peers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point!
I initially had an assertion in QuicStatNames::connectionCloseStatName() to make sure we don't create unlimited amount of stats.
But after some investigation, I found that if the connection close is initiated from peers, the error_code is parsed from the wire and no enum range checking is done. So an assertion here is too strong and malicious clients might be able to attack Envoy by sending out-of-range error codes.
Instead, I now added a check in QuicStatNames::chargeQuicConnectionCloseStats() to ignore out-of-range error_codes and log a warning.
I will follow up to dig deeper into the quiche code and see if bad error code handling can be done earlier.
Signed-off-by: Renjie Tang <renjietang@google.com>
@@ -28,15 +28,18 @@ void QuicStatNames::chargeQuicConnectionCloseStats(Stats::Scope& scope, | |||
bool is_upstream) { | |||
ASSERT(&symbol_table_ == &scope.symbolTable()); | |||
|
|||
if (error_code >= quic::QUIC_LAST_ERROR) { | |||
ENVOY_LOG(warn, fmt::format("Error code {} is out of range of QuicErrorCodes.", error_code)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for this? Also, I think warning here can log spam. I would recommend making this debug level and potentially also adding a stat for unknown error code or something like that. Thank you.
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
I utilized quic::QUIC_LAST_ERROR_CODE as a stat for all out of range codes.
Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang <renjietang@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…6584) Signed-off-by: Renjie Tang <renjietang@google.com>
Signed-off-by: Renjie Tang renjietang@google.com
Commit Message: Introduce QuicStatNames and add Quic connection close stats for downstream connections.
Risk Level: Low
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a.
Platform Specific Features: n/a.