Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
98331: ui: remove polling from fingerprints pages, allow new stats requests while one is in flight r=maryliag,j82w a=xinhaoz See individual commits. The changes here serve as a base for the performance improvements we'll be making. Since they're not really related to those changes specifically (e.g. adding limit/sort and splitting the api calls), I've put them in their own PR here to make reviewing easier. Loom verifying behaviour and showing new requests being dispatched while ones are pending Pages shown: - stmt fingerprints - txn fingerprints - txn fingerprint details - stmt fingerprint details https://www.loom.com/share/3448117bfecf404c8d698f4ad1240e8c CC: https://www.loom.com/share/bb30b51ebe5144528ab0c6fabdbfb2f1 98428: flowinfra: fix a couple of rare bugs with flow cleanup r=yuzefovich a=yuzefovich **sql: minor cleanup around flow's wait group** This commit makes minor adjustments so that `WaitGroup.Done` is called closer (visibly) to where `Add` is. This aids when understanding whether the wait group is cleaned up properly. Release note: None **flowinfra: fix a couple of rare bugs with flow cleanup** This commit fixes a couple of rare bugs around the flow cleanup that were encountered when running `tpch_concurrency` roachtest at the concurrency of 1000 or so (i.e. for the bugs to reproduce we needed extremely overloaded nodes). The setup for the bugs is as follows: - we have some number of inbound streams on the node. The inbound streams in this context are the server side of the `FlowStream` RPCs (i.e. the inbox side) - each inbound stream needs a separate goroutine (created by the gRPC framework), and these goroutines are tracked against the wait group of the flow - the main goroutine of the flow registers the flow with the `FlowRegistry` and gives it a timeout (10 seconds by default) to let the inbound streams to arrive - we have synchronization in place to block the inbound streams until the flow is registered - once the inbound stream finds its flow in the registry, it sends a handshake RPC to the producer (i.e. to the outbox). The first bug is about not decrementing the wait group of the flow in some cases. In particular, if the inbound stream timeout occurs _while_ the inbound stream is performing the "handshake" RPC _and_ that RPC results in an error, then the wait group wouldn't be decremented. This was the case because 1. the "timeout" goroutine would observe that the stream is "connected", so it would treat the timeout cancellation as a no-op 2. the inbound stream goroutine would get an error on the handshake, so it would bubble the error up but would never call `InboundStreamInfo.onFinish`. This bug is now fixed by only marking the stream as "connected" once the handshake succeeds. This bug was introduced about a year ago in 62ea0c6 when we refactored the locking around the flow registry. The second bug is related. Previously, if `ConnectInboundStream` resulted in an error, this could leave one of the inbox goroutines blocked forever in `Inbox.Init`. In particular, the inbox waits until of the following occurs: - the stream arrives and successfully performs the handshake - the inbound stream timeout occurs - the flow context is canceled - the inbox context is canceled. In the scenario from the first bug, the first point wasn't true because the handshake resulted in an error. The second point wasn't true because the stream was marked as "connected" so it was skipped when `flowEntry.streamTimer` fired (meaning that `InboundStreamHandler.Timeout` wasn't called). The contexts weren't explicitly canceled either (the incorrect assumption was that the flow context would be canceled by the outboxes on the node, but we might have plans where the gateway flow doesn't have any outboxes). This second bug is now fixed by explicitly cancelling the flow whenever inbound stream connection results in an error. The additional nice-to-have improvement was made to the outboxes to cancel the flow context on their nodes whenever the `FlowStream` RPC fails - the query is doomed, so we might as well cancel the flow sooner. I decided to not include a release note here since I believe the prerequisite for these bugs to occur is that the inbound streams time out exactly while performing the handshake RPC. 10 seconds default value for the timeout makes it extremely unlikely to happen, and the only way we ran into this was by severely overloading the nodes (and the user would have bigger problems at that point than these deadlocks). Fixes: #94113. Release note: None **roachtest: increase bounds for tpch_concurrency** This commit adjusts the bounds for `tpch_concurrency` roachtest. Given that we now set GOMEMLIMIT by default, we can sustain much higher concurrency without falling over. I ran the test manually with `[128, 1024)` range and got 972.938 on average (I had 20 runs of which 4 timed out). To prevent the timeouts this commit uses a small interval (less than 64 in length) so that we make 6 iterations in the binary search. As a result, the new search range is [970, 1030). Additionally, it was observed that Q15 (which performs two schema changes) can take non-trivial amount of time, so it is skipped from this test. This commit also makes a minor improvement to fail the test if all iterations resulted in a node crash. Release note: None Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
- Loading branch information