Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server, tenant: Hot ranges structured logging for serverless clusters #89511

Merged
merged 1 commit into from
Apr 20, 2023

Conversation

koorosh
Copy link
Collaborator

@koorosh koorosh commented Oct 6, 2022

This patch introduces periodic logging of hot ranges info (ie QPS and
relevant DB objects info). It is done for both dedicated and
serverless cluster types, and by default called every 4h.
Stats collection can be enabled/disabled, or changed interval
through cluster settings.

Resolves: #85415

Release note: None

@koorosh koorosh requested a review from a team October 6, 2022 18:06
@koorosh koorosh requested review from a team as code owners October 6, 2022 18:06
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Jan 27, 2023

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Jan 27, 2023
@koorosh koorosh changed the title server, tenant: WIP. hot ranges structured logging for serverless clusters server, tenant: Hot ranges structured logging for serverless clusters Jan 27, 2023
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 11 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)


pkg/server/tenant.go line 737 at r1 (raw file):

	s.sqlServer.isReady.Set(true)

	structlogging.StartHotRangesLoggingScheduler(ctx, s.stopper, s.sqlServer.tenantConnect, *s.sqlServer.internalExecutor)

should this be inside the SQLServer?


pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

func (s *hotRangesDetailsLoggingScheduler) start(ctx context.Context, stopper *stop.Stopper) {
	_ = stopper.RunAsyncTask(ctx, "hot-ranges-struct-logging", func(ctx context.Context) {
		ticker := time.NewTicker(5 * time.Second)

Commit says it's running once per hour but this says 5 seconds.

Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.


pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

func TestStartScheduler(t *testing.T) {
	t.Skip()

I assume this needs a test :)

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

should this be inside the SQLServer?

My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
Currently SqlServer exposes preStart and startServeSQL functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to add postStart func that accepts args to identify if it is tenant or not and then init logging if necessary? ie

func postStart (..., tenant bool) error {
    if tenant {
      structlogging.StartHotRangesLoggingScheduler(...
    }
}

pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Commit says it's running once per hour but this says 5 seconds.

Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.

good point it is still under discussion and should be updated


pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I assume this needs a test :)

Will do ))

@koorosh koorosh force-pushed the hot-ranges-telemetry branch 3 times, most recently from da7d954 to 50cf035 Compare March 6, 2023 23:08
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/structlogging/hot_ranges_log_test.go line 6 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Will do ))

Done.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 12 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)


pkg/server/tenant.go line 737 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
Currently SqlServer exposes preStart and startServeSQL functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to add postStart func that accepts args to identify if it is tenant or not and then init logging if necessary? ie

func postStart (..., tenant bool) error {
    if tenant {
      structlogging.StartHotRangesLoggingScheduler(...
    }
}

👍 no need to create a separate phase. we can leave it here, thx for the notes.


pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

// licenses/APL.txt.

package structlogging

nit: can we use a different name? this seems too generic. maybe just kvtelemetry? or can we put this in an existing telemetry package somewhere?


pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

)

var TelemetryHotRangesStatsLoggingDelay = settings.RegisterDurationSetting(

do we need to respond to these cluster settings changing?


pkg/server/structlogging/hot_ranges_log.go line 56 at r2 (raw file):

// hotRangesDetailsLoggingScheduler is responsible for logging index usage stats
// on a scheduled interval.
type hotRangesDetailsLoggingScheduler struct {

nit: can we rename to hotRangesLoggingScheduler to match the function? I don't think we need "Details".


pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

}

func logEventsWithDelay(

Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @dhartunian from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/server/structlogging/hot_ranges_log.go line 49 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

good point it is still under discussion and should be updated

FYI: Updated to once per day


pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can we use a different name? this seems too generic. maybe just kvtelemetry? or can we put this in an existing telemetry package somewhere?

As I see, another packages define such logic in separate file as part of package where it is used (ie pkg/ccl/changefeedccl/telemetry.go).
In our case, I could lift it up on one level up in server package, and rename hot_ranges_log.go to telemetry.go. WDYT?


pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

do we need to respond to these cluster settings changing?

I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:

logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))

pkg/server/structlogging/hot_ranges_log.go line 56 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can we rename to hotRangesLoggingScheduler to match the function? I don't think we need "Details".

Will do


pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.

Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func from pkg/sql/scheduledlogging/captured_index_usage_stats.go so going to move it to logutil package to have single version of this helper func.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)


pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func from pkg/sql/scheduledlogging/captured_index_usage_stats.go so going to move it to logutil package to have single version of this helper func.

Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.

@koorosh koorosh requested a review from a team March 8, 2023 11:08
Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/server/structlogging/hot_ranges_log.go line 121 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.

Thank you! Done.

@koorosh
Copy link
Collaborator Author

koorosh commented Mar 8, 2023

@thtruo , I set following default values for cluster settings that manage how logs are collected:

  • server.telemetry.hot_ranges_stats.interval = 24h (collect stats once per day)
  • server.telemetry.hot_ranges_stats.enabled is True by default (allows to disable/enable hot ranges stats collection)

also it's guarded by diagnostics.reporting.enabled setting which as I understand is general approach to enable reporting.

Let me know if it looks ok or should be changed.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions about this that maybe @koorosh or @thtruo could help me with:

The PR and commit message mention "...for serverless clusters" and "...for multi-tenant clusters". Would it be possible to add this generally? Or do we already get this under some unified architecture stuff where every cluster is at least single tenant?

The work here is great, It would be nice have more frequent logging for debugging purposes (think debug.zip). Do you have any thoughts on #98108? It feels as though there is overlap and that issue is more ad-hoc. The period required for that issue is much lower and doesn't relate to telemetry. There seems like potential from this PR to perform more frequent (non-telemetry) logging for a subset of ranges into a log file similar to the health log.

Why is the log period 24h? I assume for telemetry you wish to capture seasonality. A common form of seasonality I've seen on user clusters takes place <24h e.g. people buying items at night and very little in the morning. It would be a shame for the telemetry data a cluster emits to be specific to the start time of the SQL tenant cluster.

  • I suggest jittering or randomizing when logging occurs, with a budget for how frequent logging can be in a period.
  • I suggest lowering the period from 24h to 4h or less.

My questions and suggestions shouldn't block this PR.

Reviewed 9 of 12 files at r2, 6 of 6 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for really great questions to discuss!

commit message mention "...for serverless clusters" and "...for multi-tenant clusters"

I haven't considered to generalize it for UA and agree it is the right way to do but would still prefer to file separate issue for this.

Regarding interval durations, there is more question to Tommy and how this stats will be used at the end. I imagine that it shouldn't be called often because:
a) for seasonal increase of requests - it is more appropriate to think about the ways we can export hot ranges stats to prometheus at least so user can monitor this data and react on such spikes, because we don't even know if this spice of requests expected or not.
b) telemetry might make use of infrequent stats collection to observe if there's some gradual regression (ie increased QPS) or to correlate hot ranges stats to schema change stats for instance.

@thtruo what's your thoughts on this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)

@thtruo
Copy link
Contributor

thtruo commented Mar 17, 2023

Re time intervals - I’m okay collecting more often in a tighter time interval. I’m not sure how randomizing/jittering will look, but I’d be okay starting with a fixed interval that’s more frequent. I was originally open to collecting every 8hrs, but based on Austen’s comment I’m also okay with a 4hr interval (as long as we don’t hit the logging pipeline’s constraints) - so let’s do that too.

@koorosh Will this telemetry fire for Dedicated clusters? The above comments suggest this telemetry will be applied to Serverless/multi-tenant clusters. Can you confirm? I ideally want to see this telemetry for all cloud clusters (including Dedicated and Serverless) - the original issue #85415 does not specify only Serverless. Would capturing Dedicated telemetry come in a separate PR?

@blathers-crl
Copy link

blathers-crl bot commented Mar 20, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@koorosh
Copy link
Collaborator Author

koorosh commented Mar 27, 2023

Updated PR to collect telemetry for all cluster types and changed default interval to 4 hours.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM however I'm not that familiar with best practices in this area of the code. I'll defer to @dhartunian or @knz for stamping.

I would suggest updating the PR title to mention the functionality is for all clusters.

Reviewed 6 of 6 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)


-- commits line 2 at r5:
Are you planning on squashing the two commits together?


-- commits line 4 at r5:
nit: reword to This patch introduces periodic logging of hot ranges info...

@koorosh
Copy link
Collaborator Author

koorosh commented Mar 28, 2023

Thank you Austen, updated commit message and squashed commits as well!

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stamp this, as it LGTM and but perhaps reach out to the other reviewers.

Are you going to backport this to 23.1?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)

@koorosh
Copy link
Collaborator Author

koorosh commented Mar 29, 2023

Yes, it has to be backported, almost forgot about this!

I'll wait for another pair of eyes on this, TFTR!

@koorosh koorosh added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 29, 2023
@koorosh
Copy link
Collaborator Author

koorosh commented Mar 29, 2023

@knz , @dhartunian appreciate your feedback on this!

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have some minor comments on code organization and stuff and then it should be good to merge.

Reviewed 1 of 12 files at r2, 3 of 6 files at r3, 5 of 6 files at r4, 3 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @koorosh)


pkg/server/server_sql.go line 1677 at r6 (raw file):

	pgL net.Listener,
	socketFileCfg *string,
	sServer serverpb.TenantStatusServer,

nit: call this statusServer for clarity


pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

👍 no need to create a separate phase. we can leave it here, thx for the notes.

I think I preferred the placement of the scheduler here instead of startServeSQL just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated in server.go.

For instance, you end up passing in sqlServer.internalExecutor to startServeSQL, which is a bit confusing.


pkg/server/structlogging/hot_ranges_log.go line 46 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:

logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))

Great. Thanks!


pkg/server/structlogging/hot_ranges_log.go line 91 at r6 (raw file):

			select {
			case <-stopper.ShouldQuiesce():
				return

nit: also respond to ctx cancellation

Copy link
Collaborator Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)


-- commits line 2 at r5:

Previously, kvoli (Austen) wrote…

Are you planning on squashing the two commits together?

Done.


-- commits line 4 at r5:

Previously, kvoli (Austen) wrote…

nit: reword to This patch introduces periodic logging of hot ranges info...

Done.


pkg/server/tenant.go line 737 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think I preferred the placement of the scheduler here instead of startServeSQL just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated in server.go.

For instance, you end up passing in sqlServer.internalExecutor to startServeSQL, which is a bit confusing.

Done.


pkg/server/structlogging/hot_ranges_log.go line 91 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: also respond to ctx cancellation

Done.


pkg/server/server_sql.go line 1677 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: call this statusServer for clarity

Done.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks @koorosh !

Reviewed 2 of 8 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz, @koorosh, and @kvoli)


pkg/server/structlogging/hot_ranges_log.go line 11 at r2 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

As I see, another packages define such logic in separate file as part of package where it is used (ie pkg/ccl/changefeedccl/telemetry.go).
In our case, I could lift it up on one level up in server package, and rename hot_ranges_log.go to telemetry.go. WDYT?

Let's leave this as-is.

This patch introduces periodic logging of hot ranges info (ie QPS and
relevant DB objects info). It is done for both dedicated and
serverless cluster types, and by default called every 4h.
Stats collection can be enabled/disabled, or changed interval
through cluster settings.

Release note: None
@koorosh
Copy link
Collaborator Author

koorosh commented Apr 20, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 20, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Apr 20, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-89511 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/101927/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

telemetry: track hot ranges from a cluster
6 participants