adapter: introspection subscribes (coordinator-managed) #27795

teskje · 2024-06-21T14:52:09Z

This PR introduces the overall infrastructure for introspection subscribes, implemented by a new coord::introspection module that extends the coordinator with methods to install introspection subscribes and process updates for them.

The results of introspection subscribes are still discarded here. Follow-up PRs will add the final pieces that writes them to storage.

The relevant design doc is #27548, except that the design proposes to let the controller install the subscribes, to cut down on implementation effort. An implementation of the controller-managed approach is provided in #27709. This PR is an attempt to implement the coordinator-managed approach after all, which has two major benefits over the controller-managed one:

Introspection subscribe queries can be written in SQL rather than in LIR, making the addition of new subscribes significantly easier.
Introspection subscribes can read from all builtin sources, not just per-replica introspection sources.

Motivation

This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/7898

Tips for reviewer

This seems to work, but I don't understand the coordinator code well enough to say whether it's a reasonable implementation.

I initially tried to re-use the existing SUBSCRIBE sequencing code, but couldn't get it to work. This code implicitly assumes the existence of a client connection and a transaction, and I wasn't able to work around that. So instead I introduced dedicated sequencing stages for the introspection subscribes, which take inspiration from the SUBSCRIBE ones but are kept simpler.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/A

src/adapter/src/coord/introspection.rs

shepherdlybot · 2024-06-24T08:07:40Z

Mitigations

Completing required mitigations increases Resilience Coverage.

(Required) Code Review 🔍 Detected
(Required) Feature Flag
(Required) Integration Test 🔍 Detected
(Required) Observability 🔍 Detected
(Required) QA Review 🔍 Detected
(Required) Run Nightly Tests
Unit Test

Risk Summary:

The risk associated with this pull request is high, with a score of 83. It is important to note that historically, pull requests with similar characteristics to the predictors used here—namely, average line count in files and executable lines within files—are 132% more likely to introduce bugs compared to the repository's baseline. Additionally, while the repository's observed bug trend is decreasing, the predicted bug trend is on the rise. There are also 8 modified files in this pull request that have recently had a high number of bug fixes, indicating potential risk areas.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File	Percentile
../controller/instance.rs	99
../inner/create_index.rs	93
../inner/peek.rs	97
../src/coord.rs	98
../sequencer/inner.rs	100
../src/lib.rs	98
../inner/create_view.rs	94
../inner/create_materialized_view.rs	95

maddyblue

In general this LGTM, but we need to do more thinking about how to handle session-less execution of sequence_staged. To me, doing that in a considered way should be a prerequisite for this PR because we haven't thought through the edge cases of that, and I'm concerned about unexpected errors popping up due to using a dummy session and the coordinator panicing (as you discovered and commented about in this PR).

One initial brainstorm idea: sequence_staged gets its first argument changed to like

enum StagedContext {
    Execute(ExecuteContext),
    System(ClientTransmitter<ExecuteResponse>),
}

then the stage() fns of each impl can match on that and complain if they require an Execute variant (i.e., require a Session), or merely a ClientTransmitter which both variants have. Probably:

impl StagedContext {
    fn session(&self) -> Option<&Session> {
        match self {
            StagedContext::Execute(ctx) => Some(ctx.session()),
            StagedContext::System(_) => None,
        }
    }

    fn tx(&self) -> &ClientTransmitter<ExecuteResponse> {
        match self {
            StagedContext::Execute(ctx) => ctx.tx(),
            StagedContext::System(tx) => tx,
        }
    }
}

with _mut variations too? Will almost certainly need other stuff, but it's one idea that would make this much more typesafe and avoid any assumption about a session existing.

src/adapter/src/coord/message_handler.rs

maddyblue · 2024-06-24T18:38:33Z

src/adapter/src/coord/introspection.rs

+            replica_id,
+            spec,
+        };
+        self.introspection_subscribes.insert(id, subscribe);


This should assert it returns None.

Hm, we just minted the GlobalId at the beginning of the method, so this assert would be equivalent to one that checks that allocate_transient_id produces unique IDs. I'm not against adding it if you think it's valuable, but it might be more confusing than helpful.

src/adapter/src/coord/introspection.rs

maddyblue · 2024-06-24T18:51:14Z

src/adapter/src/coord/introspection.rs

+
+    fn drop_introspection_subscribe(&mut self, id: GlobalId) {
+        let Some(subscribe) = self.introspection_subscribes.remove(&id) else {
+            soft_panic_or_log!("attempt to remove unknown introspection subscribe (id={id})");


Why is it safe to log in production instead of panic here? Panicing is bad, but so is continuing to run with a corrupted internal state.

I have become very careful of panicking in envd. In the past it happened a couple times that we got into a crashloop due to some bug in the code, and then could resolve this only be pushing a patch that changed the panicking assert to a soft one. Even if we don't crashloop, a panic now and then might make the environment unstable, causing a maybe harmless bug in the code to be elevated to a high-urgency one. There are of course still asserts that need to be hard because they protect correctness.

In this case I don't see how correctness could be impacted. Most likely we get here by trying to drop an introspection subscribe twice, in which case ignoring the second drop is probably what we want. Less likely, we have somehow mixed up subscribe IDs and forget to drop the subscribe we really wanted to drop, in which case some coordinator memory will be wasted and some entries will not be cleaned up from an introspection relation until the next restart. This seems preferable to panicking for me.

test/testdrive/mzcompose.py

This commit introduces the overall infrastructure for introspection subscribes, implemented by a new `coord::introspection` module that extends the coordinator with methods to install introspection subscribes and process updates for them. This commit also introduces a first introspection subscribe that surfaces dataflow error counts, as the first test case for this infrastructure.

Change request has been addressed

teskje · 2024-06-26T08:16:21Z

TFTRs!

teskje force-pushed the introspection-subscribes-coord branch 5 times, most recently from 1c4b77e to 07ab2a0 Compare June 23, 2024 18:59

teskje commented Jun 24, 2024

View reviewed changes

src/adapter/src/coord/introspection.rs Outdated Show resolved Hide resolved

teskje commented Jun 24, 2024

View reviewed changes

src/adapter/src/coord/introspection.rs Show resolved Hide resolved

teskje marked this pull request as ready for review June 24, 2024 08:07

teskje requested review from a team as code owners June 24, 2024 08:07

teskje requested a review from ParkMyCar June 24, 2024 08:07

teskje mentioned this pull request Jun 24, 2024

compute: introspection subscribes (controller-managed) #27709

Closed

5 tasks

maddyblue reviewed Jun 24, 2024

View reviewed changes

maddyblue force-pushed the introspection-subscribes-coord branch from 07ab2a0 to b22445e Compare June 25, 2024 06:22

teskje force-pushed the introspection-subscribes-coord branch from b22445e to f01cf37 Compare June 25, 2024 10:44

def- previously requested changes Jun 25, 2024

View reviewed changes

test/testdrive/mzcompose.py Outdated Show resolved Hide resolved

teskje added 2 commits June 25, 2024 13:45

compute: feature flag to gate introspection subscribes

f765113

teskje force-pushed the introspection-subscribes-coord branch from f01cf37 to 8e73854 Compare June 25, 2024 11:50

test: adjust tests for new introspection subscribe

dce10d6

teskje force-pushed the introspection-subscribes-coord branch from 8e73854 to 359c84a Compare June 25, 2024 11:52

teskje requested a review from def- June 25, 2024 11:52

maddyblue added 2 commits June 25, 2024 14:00

adapter: better response for introspection subscribe sequencing

8dd8c60

adapter: refactor sequence staged to support session-less execution

7359ff4

teskje force-pushed the introspection-subscribes-coord branch from 359c84a to 7359ff4 Compare June 25, 2024 12:01

maddyblue approved these changes Jun 25, 2024

View reviewed changes

teskje merged commit 9a1b5a8 into MaterializeInc:main Jun 26, 2024
190 of 194 checks passed

teskje deleted the introspection-subscribes-coord branch June 26, 2024 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapter: introspection subscribes (coordinator-managed) #27795

adapter: introspection subscribes (coordinator-managed) #27795

teskje commented Jun 21, 2024 •

edited

Loading

shepherdlybot bot commented Jun 24, 2024 •

edited

Loading

maddyblue left a comment

maddyblue Jun 24, 2024

teskje Jun 25, 2024

maddyblue Jun 24, 2024

teskje Jun 25, 2024

teskje commented Jun 26, 2024

adapter: introspection subscribes (coordinator-managed) #27795

adapter: introspection subscribes (coordinator-managed) #27795

Conversation

teskje commented Jun 21, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

shepherdlybot bot commented Jun 24, 2024 • edited Loading

Mitigations

maddyblue left a comment

Choose a reason for hiding this comment

maddyblue Jun 24, 2024

Choose a reason for hiding this comment

teskje Jun 25, 2024

Choose a reason for hiding this comment

maddyblue Jun 24, 2024

Choose a reason for hiding this comment

teskje Jun 25, 2024

Choose a reason for hiding this comment

teskje commented Jun 26, 2024

teskje commented Jun 21, 2024 •

edited

Loading

shepherdlybot bot commented Jun 24, 2024 •

edited

Loading