Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: introduce spanconfig.Reporter #90016

Merged
merged 3 commits into from
Nov 25, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Oct 14, 2022

This is KV-side API for multi-tenant replication reports (#89987).

 // Reporter generates a conformance report over the given spans, i.e. whether
 // the backing ranges conform to the span configs that apply to them.
 //
 // NB: The standard implementation does not use a point-in-time snapshot of span
 // config state, but could be made to do so if needed. See commentary on the
 // spanconfigreporter.Reporter type for more details ("... we might not have a
 // point-in-time snapshot ...").
 type Reporter interface {
 	SpanConfigConformance(
 		ctx context.Context, spans []roachpb.Span,
 	) (roachpb.SpanConfigConformanceReport, error)
 }
// SpanConfigConformanceReport reports ranges that (i) don't conform to span
// configs that apply over them, and (ii) are unavailable. Also included in this
// report are the IDs of unavailable nodes (possibly contributing to
// under-replication or range-unavailability).
message SpanConfigConformanceReport {
  repeated ConformanceReportedRange under_replicated = 1 [(gogoproto.nullable) = false];
  repeated ConformanceReportedRange over_replicated = 2 [(gogoproto.nullable) = false];
  repeated ConformanceReportedRange violating_constraints = 3 [(gogoproto.nullable) = false];
  repeated ConformanceReportedRange unavailable = 4 [(gogoproto.nullable) = false];
  repeated int32 unavailable_node_ids = 5 [(gogoproto.customname) = "UnavailableNodeIDs"];
};

message ConformanceReportedRange {
  RangeDescriptor range_descriptor = 1 [(gogoproto.nullable) = false];
  SpanConfig config = 2 [(gogoproto.nullable) = false];
}

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 221013.kv-repl-reports branch 6 times, most recently from 3e72dfa to 410e25b Compare October 21, 2022 18:07
@irfansharif irfansharif marked this pull request as ready for review October 21, 2022 18:08
@irfansharif irfansharif requested review from a team as code owners October 21, 2022 18:08
@irfansharif irfansharif requested a review from a team October 21, 2022 18:08
@irfansharif irfansharif requested a review from a team as a code owner October 21, 2022 18:08
@irfansharif irfansharif requested review from herkolategan and smg260 and removed request for a team October 21, 2022 18:08
@irfansharif
Copy link
Contributor Author

irfansharif commented Oct 21, 2022

@arulajmani, this is still incomplete and slightly buggy, but it should be ready for a look. The incompleteness is marked explicitly (XXX:) and I'll address as part of this PR.

@irfansharif irfansharif requested review from arulajmani and removed request for a team, herkolategan and smg260 October 21, 2022 18:09
@irfansharif irfansharif force-pushed the 221013.kv-repl-reports branch 5 times, most recently from 1766498 to 3c1de07 Compare October 23, 2022 20:35
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Should we add authorization for the new RPCs in auth_tenant.go as part of this patch?

Separately, I remember there was some talk about pulling all the information to compute conformance into the tenant and constructing the report there as opposed to having KV perform the computation like this patch does -- could you remind me why we settled on this approach?

Reviewed 40 of 42 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/kvserver/store.go line 1076 at r1 (raw file):

	// infrastructure or not.
	//
	// TODO(irfansharif): We can remove this.

Let's file an issue for this?


pkg/kv/kvserver/reports/replication_stats_report.go line 407 at r1 (raw file):

		// on non-voting replicas. This code will also soon be removed in favor
		// of something that works with multi-tenancy (#89987).
	}, replicationFactor, 0)

nit: /neededNonVoters/


pkg/roachpb/metadata_replicas.go line 415 at r1 (raw file):

// over/under-replication of voters. If the caller is only interested in
// availability of voting replicas, 0 can be passed in. neededNonVoters is the
// counterpart for non-voting replicas.

Should we use -1 as the sentinel value for non-voters? (and maybe make the change for voters as well, for parity).

Unlike voters, it's completely reasonable to expect 0 non-voters. Should we be detecting cases where that's the expectation but non-voters exist as OverReplicatedNonVoters?


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

// the backing ranges conform to the span configs that apply to them.
type Reporter interface {
	SpanConfigConformance(

Could you add some more words about the spans field? Specifically, I was thinking something that talks about how conformance isn't reported at a "snapshot" for all spans passed into this function and why that's okay.


pkg/spanconfig/spanconfigreporter/reporter.go line 102 at r1 (raw file):

	// XXX: Write an end-to-end test using actual SQL and zone configs. Set configs
	// on a table, disable replication, see conformance. Enable repl, change
	// configs, etc. Use tenants as well for this mode. Do this for tenants as well.

Would this have caught the auth_tenant.go omission?


pkg/spanconfig/spanconfigreporter/reporter.go line 107 at r1 (raw file):

	// XXX: Can we improve the SpanConfigConformanceReport proto type? Perhaps
	// include some {meta,}data about the span config being violated as well? Or
	// include the span config directly and provide helper libraries to compute

I'm not sure how this API will be used yet, but including the span config seems useful in understanding why things are in violation, especially given the various steps involved in going from a zone config to the range's placement.


pkg/spanconfig/spanconfigreporter/reporter.go line 109 at r1 (raw file):

	// include the span config directly and provide helper libraries to compute
	// human-readable "why is this in violation" text.
	// - Only include range ID + replica descriptors + keys?

Is this simply to avoid shipping the entire range descriptor back or is there something else?


pkg/spanconfig/spanconfigreporter/reporter.go line 111 at r1 (raw file):

	// - Only include range ID + replica descriptors + keys?
	// - Type to represent exactly which constraint exactly is being violated?
	// - Segment over/under replicated by what replica type (voter/non-voter)

At the least, I think we should do this, especially given we already have this information and we're intentionally coalescing things. Previously if a range appeared as both under and over-replicated we knew exactly why that was the case -- that's no longer true now that this reporting stuff also knows about non-voters and I think the UX here is a bit confusing. You could also argue users care more about voter under/over-replication compared to non-voters, so the distinction is useful to make. What do you think?


pkg/spanconfig/spanconfigreporter/reporter.go line 155 at r1 (raw file):

			for i, c := range overall.Constraints {
				if c.NumReplicas == 0 {
					c.NumReplicas = conf.NumReplicas

nit: Could you add a comment to make this logic a bit more obvious please.

nit2: Do you think it be clearer if we created a new variable instead of assigning NumReplicas?


pkg/kv/kvclient/kvcoord/send_test.go line 107 at r1 (raw file):

	context.Context, *roachpb.SpanConfigConformanceRequest,
) (*roachpb.SpanConfigConformanceResponse, error) {
	panic("implement me")

Should this be s/implement me/unimplemented/g or were you needing to do this as part of this patch?


pkg/kv/kvserver/allocator/allocatorimpl/allocator_test.go line 3970 at r1 (raw file):

			}
		}
		analyzed := constraint.AnalyzeConstraints(a.StorePool, existingRepls, 0, tc.constraints)

nit: /* numReplicas */

@irfansharif irfansharif requested a review from a team November 7, 2022 21:59
@irfansharif irfansharif requested a review from a team as a code owner November 7, 2022 21:59
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add authorization for the new RPCs in auth_tenant.go as part of this patch?

Done.

I remember there was some talk about pulling all the information to compute conformance into the tenant and constructing the report there as opposed to having KV perform the computation like this patch does -- could you remind me why we settled on this approach?

I wasn't sure how much pre-computing this periodically and stashing the result somewhere was actually buying us. In a future PR/benchmark I want to run this conformance report against a million or so ranges and see how it fares -- I suspect it'll be good enough to not need pre-computing. BTW we can still pre-compute data and store it, the way we'd do it is kick off a job that invokes this lower-level API and stashes the result somewhere in SQL/ObsService. This API here though is still just the lower level thing and does not try to do persistence. I felt it was simpler.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvserver/store.go line 1076 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's file an issue for this?

Added to #81009.


pkg/kv/kvserver/reports/replication_stats_report.go line 407 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: /neededNonVoters/

Done.


pkg/roachpb/metadata_replicas.go line 415 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we use -1 as the sentinel value for non-voters? (and maybe make the change for voters as well, for parity).

Unlike voters, it's completely reasonable to expect 0 non-voters. Should we be detecting cases where that's the expectation but non-voters exist as OverReplicatedNonVoters?

Good point, I'm going to use -1 as the sentinel value for non-voters. I'll keep the voters version as is to not accidentally introduce new bugs (there's a bit of tacitness in how neededVoters == 0 is handled below that I don't want to unentangle).


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Could you add some more words about the spans field? Specifically, I was thinking something that talks about how conformance isn't reported at a "snapshot" for all spans passed into this function and why that's okay.

I'll point to this comment block to who ever gets to use this interface to power the "front-end" of multi-tenant repl reports, so pardon the verbosity. The data dependencies in the implementation are:

  • a view of range descriptors (done so transactionally and probably will continue to be so -- so we have a snapshot);
  • a view over what span configs apply to what keyspans; backed by the kvsubscriber machinery you may have heard of. Here it's more interesting in that different nodes might be observing data as of a different timestamp, but that's not quite breaking a "snapshot" of the data. It's true that we aren't grabbing a read lock over the kvsubscriber while reading off each span config, but that's something we can do if it becomes a problem;
  • the store resolver resolving store IDs to store descriptors. It's true that we're not making consistency claims over this data -- it's possible that we're racing against a store being added while that info is not available through gossip; the fallback behaviour is well-defined. You might see such inconsistencies if you're issuing this RPC against different nodes with different views over the gossip data, but I don't think it's worth leaking such details into the interface itself. Each RPC is still working over a "snapshot" in time, at the servicing node.
  • gossip-backed node liveness; same as point above.

pkg/spanconfig/spanconfigreporter/reporter.go line 102 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Would this have caught the auth_tenant.go omission?

I've added the auth_tenant.go checks, and yes, it would've caught the omission. I'm not adding these tests now since it's not hooked into SQL/an endpoint -- that'll come in some follow up PR.


pkg/spanconfig/spanconfigreporter/reporter.go line 107 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I'm not sure how this API will be used yet, but including the span config seems useful in understanding why things are in violation, especially given the various steps involved in going from a zone config to the range's placement.

Agreed, done.


pkg/spanconfig/spanconfigreporter/reporter.go line 109 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Is this simply to avoid shipping the entire range descriptor back or is there something else?

Nothing else, these were just (sloppy) notes for myself. Shipping the entire descriptor back seems fine -- we seem to need most of it.


pkg/spanconfig/spanconfigreporter/reporter.go line 111 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

At the least, I think we should do this, especially given we already have this information and we're intentionally coalescing things. Previously if a range appeared as both under and over-replicated we knew exactly why that was the case -- that's no longer true now that this reporting stuff also knows about non-voters and I think the UX here is a bit confusing. You could also argue users care more about voter under/over-replication compared to non-voters, so the distinction is useful to make. What do you think?

I think the distinction is worth making. I'm going to do something different -- in the report proto itself I'll include the list of unavailable node IDs so callers can


pkg/spanconfig/spanconfigreporter/reporter.go line 155 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: Could you add a comment to make this logic a bit more obvious please.

nit2: Do you think it be clearer if we created a new variable instead of assigning NumReplicas?

Added a comment. I'll keep the assignment as is -- better to have a single place to learn about NumReplicas instead of two.


pkg/kv/kvclient/kvcoord/send_test.go line 107 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should this be s/implement me/unimplemented/g or were you needing to do this as part of this patch?

Done.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty close, just a few minor comments.

I wasn't sure how much pre-computing this periodically and stashing the result somewhere was actually buying us. In a future PR/benchmark I want to run this conformance report against a million or so ranges and see how it fares -- I suspect it'll be good enough to not need pre-computing. BTW we can still pre-compute data and store it, the way we'd do it is kick off a job that invokes this lower-level API and stashes the result somewhere in SQL/ObsService. This API here though is still just the lower level thing and does not try to do persistence. I felt it was simpler.

I think we're talking about different things. When I said "compute conformance in the tenant" I meant instead of doing this under-replicated, over-replicated, constraint conformance etc. in KV, we could instead ship back all the tenant's range descriptors, span configs, and the view of node liveness and compute conformance inside the SQL pod. I remember this was discussed in one of our eng pods, though I don't remember if we settled one way or another. Do you remember why we ended up choosing this approach?

To be clear, I'm not asking you to switch it -- just wondering if you considered the alternative.

Reviewed 58 of 58 files at r2, 42 of 42 files at r3, 14 of 14 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/rpc/auth_tenant.go line 389 at r4 (raw file):

	case *roachpb.SpanConfigTarget_Span:
		return validateSpan(tenID, *spanConfigTarget.GetSpan())
	case *roachpb.SpanConfigTarget_SystemSpanConfigTarget:

What's a system span config target? :trollface:


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

backed by the kvsubscriber machinery you may have heard of.

Lol.

gossip-backed node liveness; same as point above

I might be missing something, but I left a question in-line for you.

It's true that we aren't grabbing a read lock over the kvsubscriber while reading off each span config, but that's something we can do if it becomes a problem;

Could we please add this as a comment in code, instead of trying to reference this thread? It doesn't have to be as verbose as your explanation above, but do get this part -- I think it's important.

To be clear, I'm not asking you to change any of this behavior or saying it's incorrect -- just add some words for it :)


pkg/spanconfig/spanconfigreporter/reporter.go line 102 at r1 (raw file):

I'm not adding these tests now since it's not hooked into SQL/an endpoint -- that'll come in some follow up PR.

Makes sense. However, before we merge, let's add the authentication specific checks to auth_test.go though?


pkg/spanconfig/spanconfigreporter/reporter.go line 111 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I think the distinction is worth making. I'm going to do something different -- in the report proto itself I'll include the list of unavailable node IDs so callers can

Sounds good, I'm all for it.


pkg/spanconfig/spanconfigreporter/reporter.go line 115 at r4 (raw file):

					status := desc.Replicas().ReplicationStatus(
						func(rDesc roachpb.ReplicaDescriptor) bool {
							isLive, err := r.dep.Liveness.IsLive(rDesc.NodeID)

Are we guaranteed that IsLive returns a constant status for a particular node for every call? Is it possible for a node to flap in and out of liveness while we're iterating through range descriptors?


pkg/spanconfig/spanconfigreporter/datadriven_test.go line 189 at r4 (raw file):

							buf.WriteString(fmt.Sprintf("%s:\n", tag))
						}
						buf.WriteString(fmt.Sprintf("  %s applying %s\n", printRangeDesc(r.RangeDescriptor),

💯

We'll depend on just this directly in a future commit, without wanting to
depend on the much larger liveness package. Instead of introducing a
standalone package for it, we'll just re-use the thin livenesspb package
instead.

Release note: None
This is KV-side API for multi-tenant replication reports (cockroachdb#89987)

Release note: None
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Hm, I'm not sure what the benefit of doing it that way would be. There is one downside -- we'd be shipping a lot of data over the wire when all we're interested in is just the result. Perhaps one benefit is that the CPU time spent processing this request is tied directly to the tenant pod which is appropriately billed, but I'm not sure that's a strong reason (+ we could always do proper attribution server-side).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

backed by the kvsubscriber machinery you may have heard of.

Lol.

gossip-backed node liveness; same as point above

I might be missing something, but I left a question in-line for you.

It's true that we aren't grabbing a read lock over the kvsubscriber while reading off each span config, but that's something we can do if it becomes a problem;

Could we please add this as a comment in code, instead of trying to reference this thread? It doesn't have to be as verbose as your explanation above, but do get this part -- I think it's important.

To be clear, I'm not asking you to change any of this behavior or saying it's incorrect -- just add some words for it :)

Done. Added the following:

		// NB: The data dependencies in the implementation are:
		//
		// i.   point-in-time view over gossip-backed node liveness;
		// ii.  point-in-time view of range descriptors (done transactionally);
		// iii. the store resolver resolving store IDs to store descriptors;
		// iv.  view over what span configs apply to what keyspans;
		//
		// TODO(irfansharif): For (iii) and (iv) we might not have a
		// point-in-time snapshot of the data.
		// - For (iii) it's possible that as we iterate through the set of range
		//   descriptors, a few of which refer to some store S, we're racing
		//   against that newly-added store's info not yet being available
		//   through gossip. This is exceedingly unlikely, but if we see it
		//   happen, we can expose some snapshot of the StoreResolver state like
		//   we have for liveness.
		// - For (iv) too we're not grabbing a read lock over the backing
		//   spanconfig.KVSubscriber while reading off each span config, so it's
		//   possible we generate the report for two range descriptors with span
		//   configs from different points in time. If this too becomes a
		//   problem, we can explicitly generate a snapshot like we do for
		//   liveness.

pkg/spanconfig/spanconfigreporter/reporter.go line 102 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I'm not adding these tests now since it's not hooked into SQL/an endpoint -- that'll come in some follow up PR.

Makes sense. However, before we merge, let's add the authentication specific checks to auth_test.go though?

Done.


pkg/spanconfig/spanconfigreporter/reporter.go line 111 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Sounds good, I'm all for it.

Message got clipped off from earlier, but meant to say "unavailable node IDs so callers can use library functions (not written) to determine exactly why the the specific range is unavailable/non-conformant". Done.


pkg/spanconfig/spanconfigreporter/reporter.go line 115 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Are we guaranteed that IsLive returns a constant status for a particular node for every call? Is it possible for a node to flap in and out of liveness while we're iterating through range descriptors?

Great catch! Indeed it can, and that's not something we want. Changed interfaces around slightly to explicitly capture a snapshot of liveness state so such flip-flopping is not possible.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Hm, I'm not sure what the benefit of doing it that way would be.

One thought was that the APIs required to "build the report" would be more generally useful. For example, shipping all range descriptors back to the tenant can be used for both building conformance and enabling SHOW RANGES. However, after thinking about this a bit more, I doubt we want to let tenants have a view of the liveness of the underlying cluster. So what you have here seems good to me.

Anyway, thanks for the iterations. :lgtm: modulo comments.

Reviewed 72 of 72 files at r5, 58 of 58 files at r6, 31 of 31 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done. Added the following:

		// NB: The data dependencies in the implementation are:
		//
		// i.   point-in-time view over gossip-backed node liveness;
		// ii.  point-in-time view of range descriptors (done transactionally);
		// iii. the store resolver resolving store IDs to store descriptors;
		// iv.  view over what span configs apply to what keyspans;
		//
		// TODO(irfansharif): For (iii) and (iv) we might not have a
		// point-in-time snapshot of the data.
		// - For (iii) it's possible that as we iterate through the set of range
		//   descriptors, a few of which refer to some store S, we're racing
		//   against that newly-added store's info not yet being available
		//   through gossip. This is exceedingly unlikely, but if we see it
		//   happen, we can expose some snapshot of the StoreResolver state like
		//   we have for liveness.
		// - For (iv) too we're not grabbing a read lock over the backing
		//   spanconfig.KVSubscriber while reading off each span config, so it's
		//   possible we generate the report for two range descriptors with span
		//   configs from different points in time. If this too becomes a
		//   problem, we can explicitly generate a snapshot like we do for
		//   liveness.

Thanks! It's worth spelling out the guarantees we make (and the ones we don't) on the interface itself -- specifically, it's worth calling out we're not using a point-in-time view of the span config state.


pkg/spanconfig/spanconfigreporter/reporter.go line 111 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Message got clipped off from earlier, but meant to say "unavailable node IDs so callers can use library functions (not written) to determine exactly why the the specific range is unavailable/non-conformant". Done.

Yeah, I figured the gist.


pkg/spanconfig/spanconfigreporter/reporter.go line 115 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Great catch! Indeed it can, and that's not something we want. Changed interfaces around slightly to explicitly capture a snapshot of liveness state so such flip-flopping is not possible.

Nice!


pkg/spanconfig/spanconfigreporter/reporter.go line 121 at r7 (raw file):

) (roachpb.SpanConfigConformanceReport, error) {
	report := roachpb.SpanConfigConformanceReport{}
	unavailableNodes := make(map[roachpb.NodeID]struct{})

Do we still need this now that we have the isLiveMap? Could we instead use that when constructing the response?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Dismissed @arulajmani from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/spanconfig/spanconfig.go line 356 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Thanks! It's worth spelling out the guarantees we make (and the ones we don't) on the interface itself -- specifically, it's worth calling out we're not using a point-in-time view of the span config state.

Done.


pkg/spanconfig/spanconfigreporter/reporter.go line 121 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we still need this now that we have the isLiveMap? Could we instead use that when constructing the response?

Not strictly, but I'll still keep things as is. Right now the list of unavailable node IDs are only those that definitely hold replicas of the keys spans requested, instead of the full set of all unavailable nodes.

@craig
Copy link
Contributor

craig bot commented Nov 25, 2022

Build succeeded:

@craig craig bot merged commit 1a6e9f8 into cockroachdb:master Nov 25, 2022
@irfansharif irfansharif deleted the 221013.kv-repl-reports branch November 25, 2022 18:00
@ajwerner
Copy link
Contributor

@irfansharif
Copy link
Contributor Author

:(

zachlite added a commit to zachlite/cockroach that referenced this pull request Feb 21, 2023
Since cockroachdb#90016, we've had the ability to generate multi-tenant
replication reports via `spanconfig.Reporter.SpanConfigConformance()`.

This commit leverages cockroachdb#90016 to provide multi-tenant friendly
observability of nodes whose replicas are in violation of their
span configs, and are therefore considered critical nodes.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-10630
Release note (ops change): Adds a new `CriticalNodes` rpc accessible
via HTTP. The response includes a list of node descriptors that
are considered critical, and the corresponding SpanConfigConformanceReport
which includes details of non-conforming ranges contributing to the
criticality.
zachlite added a commit to zachlite/cockroach that referenced this pull request Feb 22, 2023
Since cockroachdb#90016, we've had the ability to generate multi-tenant
replication reports via `spanconfig.Reporter.SpanConfigConformance()`.

This commit leverages cockroachdb#90016 to provide multi-tenant friendly
observability of nodes whose replicas are unavailable or underreplicated,
and are therefore considered critical nodes.

A new `CriticalNodes` rpc is available via HTTP. The response
includes a list of node descriptors that are considered critical, and the
corresponding SpanConfigConformanceReport which includes details of
non-conforming ranges contributing to the criticality.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-10630
Release note: None
craig bot pushed a commit that referenced this pull request Mar 2, 2023
97412: server: introduce CriticalNodes rpc r=zachlite a=zachlite

Since #90016, we've had the ability to generate multi-tenant
replication reports via `spanconfig.Reporter.SpanConfigConformance()`.

This commit leverages #90016 to provide multi-tenant friendly
observability of nodes whose replicas are unavailable or underreplicated,
and are therefore considered critical nodes.

A new `CriticalNodes` rpc is available via HTTP. The response
includes a list of node descriptors that are considered critical, and the
corresponding SpanConfigConformanceReport which includes details of
non-conforming ranges contributing to the criticality.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-10630
Release note: None

Co-authored-by: zachlite <zachlite@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants