kvserver: fix incorrect handling of ExportRequests that are pre-empted by the CPU limiter #97886
Labels
A-disaster-recovery
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-disaster-recovery
In #96691 we enabled the elastic CPU limiter for all users of ExportRequests. Since then we have thought of two bugs that were an oversight:
As surfaced by roachtest: c2c/tpcc/warehouses=500/duration=10/cutover=5 failed - source fingerprint scan fails #97693, if the client does not set a TargetBytes on the ExportRequest then DistSender may split the request if it targets >1 range. These requests are then sent in parallel and the responses are combined before returning to the client. The logic that combines the response headers does not expect both the LHS and RHS to have a resume span. With elastic CPU limiting enabled this is very much possible and results in the ExportRequest erroring out https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvpb/api.go#L323-L329.
In a mixed version cluster ExportRequest clients running on older nodes do not know to handle resume spans. Thus, if a span gets pre-empted by the elastic CPU limiter then the client will not resend the remaining span to be exported.
To address 2) we plan to add a bool to the
AdmissionHeader
that indicates whether or not the client is capable of handling paginated ExportRequests. This field will default to false if the client is on older nodes. If the field is true, we will return from kvserver to the client on exceeding CPULimit. If false, we will fall back to retrying the ExportRequest on the server side, as was the case before #96691.To address 1), we will teach DistSender to not parallelize ExportRequests that are sent with the newly added
AdmissionHeader
bool set to true. This is a short-term fix but we expect the loss of parallelization to only affectcrdb_internal.fingerprint
since the other call sites are all on the descriptors table which is usually on a single range. A longer-term fix will be to teach DistSender to combine the responses of parallel requests into a slice of resume spans and then have the client handle multiple, non-overlapping resume spans.cc: @stevendanna
Jira issue: CRDB-24946
Epic CRDB-21944
The text was updated successfully, but these errors were encountered: