server, tenant: Hot ranges structured logging for serverless clusters #89511

koorosh · 2022-10-06T18:06:24Z

This patch introduces periodic logging of hot ranges info (ie QPS and
relevant DB objects info). It is done for both dedicated and
serverless cluster types, and by default called every 4h.
Stats collection can be enabled/disabled, or changed interval
through cluster settings.

Resolves: #85415

Release note: None

cockroach-teamcity · 2022-10-06T18:06:33Z

This change is

blathers-crl · 2023-01-27T19:57:08Z

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

dhartunian

Reviewed 3 of 11 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)

pkg/server/tenant.go line 737 at r1 (raw file):

	s.sqlServer.isReady.Set(true)

	structlogging.StartHotRangesLoggingScheduler(ctx, s.stopper, s.sqlServer.tenantConnect, *s.sqlServer.internalExecutor)

should this be inside the SQLServer?

pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

func (s *hotRangesDetailsLoggingScheduler) start(ctx context.Context, stopper *stop.Stopper) {
	_ = stopper.RunAsyncTask(ctx, "hot-ranges-struct-logging", func(ctx context.Context) {
		ticker := time.NewTicker(5 * time.Second)

Commit says it's running once per hour but this says 5 seconds.

Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.

pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

func TestStartScheduler(t *testing.T) {
	t.Skip()

I assume this needs a test :)

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

should this be inside the SQLServer?

My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
Currently SqlServer exposes preStart and startServeSQL functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to add postStart func that accepts args to identify if it is tenant or not and then init logging if necessary? ie

func postStart (..., tenant bool) error {
    if tenant {
      structlogging.StartHotRangesLoggingScheduler(...
    }
}

pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Commit says it's running once per hour but this says 5 seconds.

Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.

good point it is still under discussion and should be updated

pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I assume this needs a test :)

Will do ))

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Will do ))

Done.

dhartunian

Reviewed 7 of 12 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)

pkg/server/tenant.go line 737 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
Currently SqlServer exposes preStart and startServeSQL functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to add postStart func that accepts args to identify if it is tenant or not and then init logging if necessary? ie
func postStart (..., tenant bool) error {
    if tenant {
      structlogging.StartHotRangesLoggingScheduler(...
    }
}

👍 no need to create a separate phase. we can leave it here, thx for the notes.

pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

// licenses/APL.txt.

package structlogging

nit: can we use a different name? this seems too generic. maybe just kvtelemetry? or can we put this in an existing telemetry package somewhere?

pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

)

var TelemetryHotRangesStatsLoggingDelay = settings.RegisterDurationSetting(

do we need to respond to these cluster settings changing?

pkg/server/structlogging/hot_ranges_log.go line 56 at r2 (raw file):

// hotRangesDetailsLoggingScheduler is responsible for logging index usage stats
// on a scheduled interval.
type hotRangesDetailsLoggingScheduler struct {

nit: can we rename to hotRangesLoggingScheduler to match the function? I don't think we need "Details".

pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

}

func logEventsWithDelay(

Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.

koorosh

Dismissed @dhartunian from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

good point it is still under discussion and should be updated

FYI: Updated to once per day

pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can we use a different name? this seems too generic. maybe just kvtelemetry? or can we put this in an existing telemetry package somewhere?

As I see, another packages define such logic in separate file as part of package where it is used (ie pkg/ccl/changefeedccl/telemetry.go).
In our case, I could lift it up on one level up in server package, and rename hot_ranges_log.go to telemetry.go. WDYT?

pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

do we need to respond to these cluster settings changing?

I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:

logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))

pkg/server/structlogging/hot_ranges_log.go line 56 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can we rename to hotRangesLoggingScheduler to match the function? I don't think we need "Details".

Will do

pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.

Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func from pkg/sql/scheduledlogging/captured_index_usage_stats.go so going to move it to logutil package to have single version of this helper func.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)

pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func from pkg/sql/scheduledlogging/captured_index_usage_stats.go so going to move it to logutil package to have single version of this helper func.

Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.

Thank you! Done.

koorosh · 2023-03-08T15:08:22Z

@thtruo , I set following default values for cluster settings that manage how logs are collected:

server.telemetry.hot_ranges_stats.interval = 24h (collect stats once per day)
server.telemetry.hot_ranges_stats.enabled is True by default (allows to disable/enable hot ranges stats collection)

also it's guarded by diagnostics.reporting.enabled setting which as I understand is general approach to enable reporting.

Let me know if it looks ok or should be changed.

kvoli

I had a few questions about this that maybe @koorosh or @thtruo could help me with:

The PR and commit message mention "...for serverless clusters" and "...for multi-tenant clusters". Would it be possible to add this generally? Or do we already get this under some unified architecture stuff where every cluster is at least single tenant?

The work here is great, It would be nice have more frequent logging for debugging purposes (think debug.zip). Do you have any thoughts on #98108? It feels as though there is overlap and that issue is more ad-hoc. The period required for that issue is much lower and doesn't relate to telemetry. There seems like potential from this PR to perform more frequent (non-telemetry) logging for a subset of ranges into a log file similar to the health log.

Why is the log period 24h? I assume for telemetry you wish to capture seasonality. A common form of seasonality I've seen on user clusters takes place <24h e.g. people buying items at night and very little in the morning. It would be a shame for the telemetry data a cluster emits to be specific to the start time of the SQL tenant cluster.

I suggest jittering or randomizing when logging occurs, with a budget for how frequent logging can be in a period.
I suggest lowering the period from 24h to 4h or less.

My questions and suggestions shouldn't block this PR.

Reviewed 9 of 12 files at r2, 6 of 6 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)

koorosh

Thanks for really great questions to discuss!

commit message mention "...for serverless clusters" and "...for multi-tenant clusters"

I haven't considered to generalize it for UA and agree it is the right way to do but would still prefer to file separate issue for this.

Regarding interval durations, there is more question to Tommy and how this stats will be used at the end. I imagine that it shouldn't be called often because:
a) for seasonal increase of requests - it is more appropriate to think about the ways we can export hot ranges stats to prometheus at least so user can monitor this data and react on such spikes, because we don't even know if this spice of requests expected or not.
b) telemetry might make use of infrequent stats collection to observe if there's some gradual regression (ie increased QPS) or to correlate hot ranges stats to schema change stats for instance.

@thtruo what's your thoughts on this?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)

thtruo · 2023-03-17T15:25:52Z

Re time intervals - I’m okay collecting more often in a tighter time interval. I’m not sure how randomizing/jittering will look, but I’d be okay starting with a fixed interval that’s more frequent. I was originally open to collecting every 8hrs, but based on Austen’s comment I’m also okay with a 4hr interval (as long as we don’t hit the logging pipeline’s constraints) - so let’s do that too.

@koorosh Will this telemetry fire for Dedicated clusters? The above comments suggest this telemetry will be applied to Serverless/multi-tenant clusters. Can you confirm? I ideally want to see this telemetry for all cloud clusters (including Dedicated and Serverless) - the original issue #85415 does not specify only Serverless. Would capturing Dedicated telemetry come in a separate PR?

blathers-crl · 2023-03-20T13:13:02Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

koorosh · 2023-03-27T14:19:50Z

Updated PR to collect telemetry for all cluster types and changed default interval to 4 hours.

kvoli

This LGTM however I'm not that familiar with best practices in this area of the code. I'll defer to @dhartunian or @knz for stamping.

I would suggest updating the PR title to mention the functionality is for all clusters.

Reviewed 6 of 6 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)

-- commits line 2 at r5:
Are you planning on squashing the two commits together?

-- commits line 4 at r5:
nit: reword to This patch introduces periodic logging of hot ranges info...

koorosh · 2023-03-28T10:51:52Z

Thank you Austen, updated commit message and squashed commits as well!

kvoli

I'm going to stamp this, as it LGTM and but perhaps reach out to the other reviewers.

Are you going to backport this to 23.1?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)

koorosh · 2023-03-29T19:59:55Z

Yes, it has to be backported, almost forgot about this!

I'll wait for another pair of eyes on this, TFTR!

koorosh · 2023-03-29T20:07:53Z

@knz , @dhartunian appreciate your feedback on this!

dhartunian

Just have some minor comments on code organization and stuff and then it should be good to merge.

Reviewed 1 of 12 files at r2, 3 of 6 files at r3, 5 of 6 files at r4, 3 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @koorosh)

pkg/server/server_sql.go line 1677 at r6 (raw file):

	pgL net.Listener,
	socketFileCfg *string,
	sServer serverpb.TenantStatusServer,

nit: call this statusServer for clarity

pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

👍 no need to create a separate phase. we can leave it here, thx for the notes.

I think I preferred the placement of the scheduler here instead of startServeSQL just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated in server.go.

For instance, you end up passing in sqlServer.internalExecutor to startServeSQL, which is a bit confusing.

pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:
logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))

Great. Thanks!

pkg/server/structlogging/hot_ranges_log.go line 91 at r6 (raw file):

			select {
			case <-stopper.ShouldQuiesce():
				return

nit: also respond to ctx cancellation

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)

-- commits line 2 at r5:

Previously, kvoli (Austen) wrote…

Are you planning on squashing the two commits together?

Done.

-- commits line 4 at r5:

Previously, kvoli (Austen) wrote…

nit: reword to This patch introduces periodic logging of hot ranges info...

Done.

pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think I preferred the placement of the scheduler here instead of startServeSQL just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated in server.go.

For instance, you end up passing in sqlServer.internalExecutor to startServeSQL, which is a bit confusing.

Done.

pkg/server/structlogging/hot_ranges_log.go line 91 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: also respond to ctx cancellation

Done.

pkg/server/server_sql.go line 1677 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: call this statusServer for clarity

Done.

dhartunian

thanks @koorosh !

Reviewed 2 of 8 files at r7.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @koorosh, and @kvoli)

pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

As I see, another packages define such logic in separate file as part of package where it is used (ie pkg/ccl/changefeedccl/telemetry.go).
In our case, I could lift it up on one level up in server package, and rename hot_ranges_log.go to telemetry.go. WDYT?

Let's leave this as-is.

This patch introduces periodic logging of hot ranges info (ie QPS and relevant DB objects info). It is done for both dedicated and serverless cluster types, and by default called every 4h. Stats collection can be enabled/disabled, or changed interval through cluster settings. Release note: None

koorosh · 2023-04-20T16:41:27Z

bors r+

craig · 2023-04-20T17:31:18Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-04-20T17:31:35Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error setting reviewers, but backport branch blathers/backport-release-23.1-89511 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/101927/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

koorosh requested a review from a team October 6, 2022 18:06

koorosh requested review from a team as code owners October 6, 2022 18:06

koorosh force-pushed the hot-ranges-telemetry branch from faf66a9 to 397bbac Compare January 27, 2023 19:57

blathers-crl bot added the O-community Originated from the community label Jan 27, 2023

koorosh changed the title ~~server, tenant: WIP. hot ranges structured logging for serverless clusters~~ server, tenant: Hot ranges structured logging for serverless clusters Jan 27, 2023

dhartunian reviewed Feb 6, 2023

View reviewed changes

koorosh commented Feb 6, 2023

View reviewed changes

koorosh force-pushed the hot-ranges-telemetry branch 3 times, most recently from da7d954 to 50cf035 Compare March 6, 2023 23:08

koorosh commented Mar 7, 2023

View reviewed changes

koorosh requested a review from dhartunian March 7, 2023 10:00

koorosh force-pushed the hot-ranges-telemetry branch from 50cf035 to 404f4ef Compare March 7, 2023 10:44

dhartunian reviewed Mar 7, 2023

View reviewed changes

koorosh commented Mar 7, 2023

View reviewed changes

knz reviewed Mar 7, 2023

View reviewed changes

koorosh force-pushed the hot-ranges-telemetry branch from 404f4ef to ac26d88 Compare March 8, 2023 11:08

koorosh requested a review from a team March 8, 2023 11:08

koorosh commented Mar 8, 2023

View reviewed changes

kvoli reviewed Mar 9, 2023

View reviewed changes

koorosh commented Mar 12, 2023

View reviewed changes

kvoli mentioned this pull request Mar 13, 2023

kvserver/load: add logging for top k replicas by load #98108

Closed

koorosh force-pushed the hot-ranges-telemetry branch from ac26d88 to c3b0ebf Compare March 20, 2023 11:40

koorosh force-pushed the hot-ranges-telemetry branch from 03ac01e to ade87ae Compare March 27, 2023 09:00

koorosh requested review from kvoli and dhartunian March 27, 2023 14:20

kvoli reviewed Mar 27, 2023

View reviewed changes

koorosh requested a review from knz March 28, 2023 10:43

koorosh force-pushed the hot-ranges-telemetry branch from ade87ae to 45cf6b5 Compare March 28, 2023 10:50

kvoli approved these changes Mar 29, 2023

View reviewed changes

koorosh added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 29, 2023

dhartunian reviewed Apr 20, 2023

View reviewed changes

koorosh force-pushed the hot-ranges-telemetry branch from 45cf6b5 to 6344ac6 Compare April 20, 2023 09:33

koorosh requested a review from a team April 20, 2023 09:33

koorosh force-pushed the hot-ranges-telemetry branch from 6344ac6 to b288e88 Compare April 20, 2023 10:59

koorosh commented Apr 20, 2023

View reviewed changes

dhartunian approved these changes Apr 20, 2023

View reviewed changes

koorosh force-pushed the hot-ranges-telemetry branch from b288e88 to 440aa71 Compare April 20, 2023 14:28

craig bot merged commit ac03a55 into cockroachdb:master Apr 20, 2023

blathers-crl bot mentioned this pull request Apr 20, 2023

release-23.1: server, tenant: Hot ranges structured logging for serverless clusters #101927

Closed

koorosh mentioned this pull request Apr 24, 2023

release-23.1: server, tenant: Hot ranges structured logging #102123

Merged

kevin-v-ngo mentioned this pull request Mar 27, 2024

Capture diagnostics and observability information for jobs in our telemetry pipeline #121233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server, tenant: Hot ranges structured logging for serverless clusters #89511

server, tenant: Hot ranges structured logging for serverless clusters #89511

koorosh commented Oct 6, 2022 •

edited

Loading

cockroach-teamcity commented Oct 6, 2022

blathers-crl bot commented Jan 27, 2023

dhartunian left a comment

koorosh left a comment

koorosh left a comment

dhartunian left a comment

koorosh left a comment

knz left a comment

koorosh left a comment

koorosh commented Mar 8, 2023

kvoli left a comment •

edited

Loading

koorosh left a comment

thtruo commented Mar 17, 2023

blathers-crl bot commented Mar 20, 2023

koorosh commented Mar 27, 2023

kvoli left a comment

koorosh commented Mar 28, 2023

kvoli left a comment

koorosh commented Mar 29, 2023

koorosh commented Mar 29, 2023

dhartunian left a comment

koorosh left a comment

dhartunian left a comment

koorosh commented Apr 20, 2023

craig bot commented Apr 20, 2023

blathers-crl bot commented Apr 20, 2023

server, tenant: Hot ranges structured logging for serverless clusters #89511

server, tenant: Hot ranges structured logging for serverless clusters #89511

Conversation

koorosh commented Oct 6, 2022 • edited Loading

cockroach-teamcity commented Oct 6, 2022

blathers-crl bot commented Jan 27, 2023

dhartunian left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

koorosh commented Mar 8, 2023

kvoli left a comment • edited Loading

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

thtruo commented Mar 17, 2023

blathers-crl bot commented Mar 20, 2023

koorosh commented Mar 27, 2023

kvoli left a comment

Choose a reason for hiding this comment

koorosh commented Mar 28, 2023

kvoli left a comment

Choose a reason for hiding this comment

koorosh commented Mar 29, 2023

koorosh commented Mar 29, 2023

dhartunian left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

koorosh commented Apr 20, 2023

craig bot commented Apr 20, 2023

blathers-crl bot commented Apr 20, 2023

koorosh commented Oct 6, 2022 •

edited

Loading

kvoli left a comment •

edited

Loading