-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapter: introspection subscribes (coordinator-managed) #27795
adapter: introspection subscribes (coordinator-managed) #27795
Conversation
1c4b77e
to
07ab2a0
Compare
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The risk associated with this pull request is high, with a score of 83. It is important to note that historically, pull requests with similar characteristics to the predictors used here—namely, average line count in files and executable lines within files—are 132% more likely to introduce bugs compared to the repository's baseline. Additionally, while the repository's observed bug trend is decreasing, the predicted bug trend is on the rise. There are also 8 modified files in this pull request that have recently had a high number of bug fixes, indicating potential risk areas. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this LGTM, but we need to do more thinking about how to handle session-less execution of sequence_staged. To me, doing that in a considered way should be a prerequisite for this PR because we haven't thought through the edge cases of that, and I'm concerned about unexpected errors popping up due to using a dummy session and the coordinator panicing (as you discovered and commented about in this PR).
One initial brainstorm idea: sequence_staged
gets its first argument changed to like
enum StagedContext {
Execute(ExecuteContext),
System(ClientTransmitter<ExecuteResponse>),
}
then the stage()
fns of each impl can match on that and complain if they require an Execute
variant (i.e., require a Session), or merely a ClientTransmitter
which both variants have. Probably:
impl StagedContext {
fn session(&self) -> Option<&Session> {
match self {
StagedContext::Execute(ctx) => Some(ctx.session()),
StagedContext::System(_) => None,
}
}
fn tx(&self) -> &ClientTransmitter<ExecuteResponse> {
match self {
StagedContext::Execute(ctx) => ctx.tx(),
StagedContext::System(tx) => tx,
}
}
}
with _mut
variations too? Will almost certainly need other stuff, but it's one idea that would make this much more typesafe and avoid any assumption about a session existing.
replica_id, | ||
spec, | ||
}; | ||
self.introspection_subscribes.insert(id, subscribe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should assert it returns None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we just minted the GlobalId
at the beginning of the method, so this assert would be equivalent to one that checks that allocate_transient_id
produces unique IDs. I'm not against adding it if you think it's valuable, but it might be more confusing than helpful.
|
||
fn drop_introspection_subscribe(&mut self, id: GlobalId) { | ||
let Some(subscribe) = self.introspection_subscribes.remove(&id) else { | ||
soft_panic_or_log!("attempt to remove unknown introspection subscribe (id={id})"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it safe to log in production instead of panic here? Panicing is bad, but so is continuing to run with a corrupted internal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have become very careful of panicking in envd. In the past it happened a couple times that we got into a crashloop due to some bug in the code, and then could resolve this only be pushing a patch that changed the panicking assert to a soft one. Even if we don't crashloop, a panic now and then might make the environment unstable, causing a maybe harmless bug in the code to be elevated to a high-urgency one. There are of course still asserts that need to be hard because they protect correctness.
In this case I don't see how correctness could be impacted. Most likely we get here by trying to drop an introspection subscribe twice, in which case ignoring the second drop is probably what we want. Less likely, we have somehow mixed up subscribe IDs and forget to drop the subscribe we really wanted to drop, in which case some coordinator memory will be wasted and some entries will not be cleaned up from an introspection relation until the next restart. This seems preferable to panicking for me.
07ab2a0
to
b22445e
Compare
b22445e
to
f01cf37
Compare
This commit introduces the overall infrastructure for introspection subscribes, implemented by a new `coord::introspection` module that extends the coordinator with methods to install introspection subscribes and process updates for them. This commit also introduces a first introspection subscribe that surfaces dataflow error counts, as the first test case for this infrastructure.
f01cf37
to
8e73854
Compare
8e73854
to
359c84a
Compare
359c84a
to
7359ff4
Compare
TFTRs! |
This PR introduces the overall infrastructure for introspection subscribes, implemented by a new
coord::introspection
module that extends the coordinator with methods to install introspection subscribes and process updates for them.The results of introspection subscribes are still discarded here. Follow-up PRs will add the final pieces that writes them to storage.
The relevant design doc is #27548, except that the design proposes to let the controller install the subscribes, to cut down on implementation effort. An implementation of the controller-managed approach is provided in #27709. This PR is an attempt to implement the coordinator-managed approach after all, which has two major benefits over the controller-managed one:
Motivation
Part of https://github.com/MaterializeInc/database-issues/issues/7898
Tips for reviewer
This seems to work, but I don't understand the coordinator code well enough to say whether it's a reasonable implementation.
I initially tried to re-use the existing
SUBSCRIBE
sequencing code, but couldn't get it to work. This code implicitly assumes the existence of a client connection and a transaction, and I wasn't able to work around that. So instead I introduced dedicated sequencing stages for the introspection subscribes, which take inspiration from theSUBSCRIBE
ones but are kept simpler.Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.