-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server, tenant: Hot ranges structured logging for serverless clusters #89511
Conversation
faf66a9
to
397bbac
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 11 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)
pkg/server/tenant.go
line 737 at r1 (raw file):
s.sqlServer.isReady.Set(true) structlogging.StartHotRangesLoggingScheduler(ctx, s.stopper, s.sqlServer.tenantConnect, *s.sqlServer.internalExecutor)
should this be inside the SQLServer?
pkg/server/structlogging/hot_ranges_log.go
line 49 at r1 (raw file):
func (s *hotRangesDetailsLoggingScheduler) start(ctx context.Context, stopper *stop.Stopper) { _ = stopper.RunAsyncTask(ctx, "hot-ranges-struct-logging", func(ctx context.Context) { ticker := time.NewTicker(5 * time.Second)
Commit says it's running once per hour but this says 5 seconds.
Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.
pkg/server/structlogging/hot_ranges_log_test.go
line 6 at r1 (raw file):
func TestStartScheduler(t *testing.T) { t.Skip()
I assume this needs a test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/server/tenant.go
line 737 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
should this be inside the SQLServer?
My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
Currently SqlServer
exposes preStart
and startServeSQL
functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to add postStart
func that accepts args to identify if it is tenant or not and then init logging if necessary? ie
func postStart (..., tenant bool) error {
if tenant {
structlogging.StartHotRangesLoggingScheduler(...
}
}
pkg/server/structlogging/hot_ranges_log.go
line 49 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Commit says it's running once per hour but this says 5 seconds.
Do we want this configurable via a cluster setting? Or provide the ability to disable? I think we already have a setting that determines whether telemetry is enabled. Perhaps if that's not on, we don't start the async task at all.
good point it is still under discussion and should be updated
pkg/server/structlogging/hot_ranges_log_test.go
line 6 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I assume this needs a test :)
Will do ))
da7d954
to
50cf035
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/server/structlogging/hot_ranges_log_test.go
line 6 at r1 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
Will do ))
Done.
50cf035
to
404f4ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 12 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @koorosh)
pkg/server/tenant.go
line 737 at r1 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
My assumption was that we need to structured logs for tenant servers only and didn't find a way to distinguish it inside of SqlServer.
CurrentlySqlServer
exposespreStart
andstartServeSQL
functions that doesn't look correct to init logging.
Do you think it'll be more appropriate to addpostStart
func that accepts args to identify if it is tenant or not and then init logging if necessary? iefunc postStart (..., tenant bool) error { if tenant { structlogging.StartHotRangesLoggingScheduler(... } }
👍 no need to create a separate phase. we can leave it here, thx for the notes.
pkg/server/structlogging/hot_ranges_log.go
line 11 at r2 (raw file):
// licenses/APL.txt. package structlogging
nit: can we use a different name? this seems too generic. maybe just kvtelemetry
? or can we put this in an existing telemetry package somewhere?
pkg/server/structlogging/hot_ranges_log.go
line 46 at r2 (raw file):
) var TelemetryHotRangesStatsLoggingDelay = settings.RegisterDurationSetting(
do we need to respond to these cluster settings changing?
pkg/server/structlogging/hot_ranges_log.go
line 56 at r2 (raw file):
// hotRangesDetailsLoggingScheduler is responsible for logging index usage stats // on a scheduled interval. type hotRangesDetailsLoggingScheduler struct {
nit: can we rename to hotRangesLoggingScheduler
to match the function? I don't think we need "Details".
pkg/server/structlogging/hot_ranges_log.go
line 121 at r2 (raw file):
} func logEventsWithDelay(
Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @dhartunian from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/server/structlogging/hot_ranges_log.go
line 49 at r1 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
good point it is still under discussion and should be updated
FYI: Updated to once per day
pkg/server/structlogging/hot_ranges_log.go
line 11 at r2 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: can we use a different name? this seems too generic. maybe just
kvtelemetry
? or can we put this in an existing telemetry package somewhere?
As I see, another packages define such logic in separate file as part of package where it is used (ie pkg/ccl/changefeedccl/telemetry.go
).
In our case, I could lift it up on one level up in server
package, and rename hot_ranges_log.go
to telemetry.go
. WDYT?
pkg/server/structlogging/hot_ranges_log.go
line 46 at r2 (raw file):
Previously, dhartunian (David Hartunian) wrote…
do we need to respond to these cluster settings changing?
I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:
logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))
pkg/server/structlogging/hot_ranges_log.go
line 56 at r2 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: can we rename to
hotRangesLoggingScheduler
to match the function? I don't think we need "Details".
Will do
pkg/server/structlogging/hot_ranges_log.go
line 121 at r2 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Can you add a comment about why we want to rate limit this? It's not immediately obvious to me. the events are already generated.
Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func from pkg/sql/scheduledlogging/captured_index_usage_stats.go
so going to move it to logutil
package to have single version of this helper func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)
pkg/server/structlogging/hot_ranges_log.go
line 121 at r2 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
Telemetry logging has limitation to send 10 log lines (with limited length) per second on each node and sending multiple entries (each can have arbitrary length) at the same time. This func allows defer sending logs with provided delay to make sure it doesn't exceed limits or blocks others to send logs.
I borrowed this func frompkg/sql/scheduledlogging/captured_index_usage_stats.go
so going to move it tologutil
package to have single version of this helper func.
Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.
404f4ef
to
ac26d88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/server/structlogging/hot_ranges_log.go
line 121 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Thanks. My recommendation however would be to pre-populate the event.Timestamp field (instead of relying on log.StructuredEvent doing it), so that the timestamp represents the time at which the HotRanges request was performed and not the time at which the event was logged.
Thank you! Done.
@thtruo , I set following default values for cluster settings that manage how logs are collected:
also it's guarded by Let me know if it looks ok or should be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few questions about this that maybe @koorosh or @thtruo could help me with:
The PR and commit message mention "...for serverless clusters" and "...for multi-tenant clusters". Would it be possible to add this generally? Or do we already get this under some unified architecture stuff where every cluster is at least single tenant?
The work here is great, It would be nice have more frequent logging for debugging purposes (think debug.zip). Do you have any thoughts on #98108? It feels as though there is overlap and that issue is more ad-hoc. The period required for that issue is much lower and doesn't relate to telemetry. There seems like potential from this PR to perform more frequent (non-telemetry) logging for a subset of ranges into a log file similar to the health log.
Why is the log period 24h? I assume for telemetry you wish to capture seasonality. A common form of seasonality I've seen on user clusters takes place <24h e.g. people buying items at night and very little in the morning. It would be a shame for the telemetry data a cluster emits to be specific to the start time of the SQL tenant cluster.
- I suggest jittering or randomizing when logging occurs, with a budget for how frequent logging can be in a period.
- I suggest lowering the period from 24h to 4h or less.
My questions and suggestions shouldn't block this PR.
Reviewed 9 of 12 files at r2, 6 of 6 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for really great questions to discuss!
commit message mention "...for serverless clusters" and "...for multi-tenant clusters"
I haven't considered to generalize it for UA and agree it is the right way to do but would still prefer to file separate issue for this.
Regarding interval durations, there is more question to Tommy and how this stats will be used at the end. I imagine that it shouldn't be called often because:
a) for seasonal increase of requests - it is more appropriate to think about the ways we can export hot ranges stats to prometheus at least so user can monitor this data and react on such spikes, because we don't even know if this spice of requests expected or not.
b) telemetry might make use of infrequent stats collection to observe if there's some gradual regression (ie increased QPS) or to correlate hot ranges stats to schema change stats for instance.
@thtruo what's your thoughts on this?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)
Re time intervals - I’m okay collecting more often in a tighter time interval. I’m not sure how randomizing/jittering will look, but I’d be okay starting with a fixed interval that’s more frequent. I was originally open to collecting every 8hrs, but based on Austen’s comment I’m also okay with a 4hr interval (as long as we don’t hit the logging pipeline’s constraints) - so let’s do that too. @koorosh Will this telemetry fire for Dedicated clusters? The above comments suggest this telemetry will be applied to Serverless/multi-tenant clusters. Can you confirm? I ideally want to see this telemetry for all cloud clusters (including Dedicated and Serverless) - the original issue #85415 does not specify only Serverless. Would capturing Dedicated telemetry come in a separate PR? |
ac26d88
to
c3b0ebf
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
03ac01e
to
ade87ae
Compare
Updated PR to collect telemetry for all cluster types and changed default interval to 4 hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM however I'm not that familiar with best practices in this area of the code. I'll defer to @dhartunian or @knz for stamping.
I would suggest updating the PR title to mention the functionality is for all clusters.
Reviewed 6 of 6 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @koorosh)
-- commits
line 2 at r5:
Are you planning on squashing the two commits together?
-- commits
line 4 at r5:
nit: reword to This patch introduces periodic logging of hot ranges info...
ade87ae
to
45cf6b5
Compare
Thank you Austen, updated commit message and squashed commits as well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to stamp this, as it LGTM and but perhaps reach out to the other reviewers.
Are you going to backport this to 23.1?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @koorosh)
Yes, it has to be backported, almost forgot about this! I'll wait for another pair of eyes on this, TFTR! |
@knz , @dhartunian appreciate your feedback on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have some minor comments on code organization and stuff and then it should be good to merge.
Reviewed 1 of 12 files at r2, 3 of 6 files at r3, 5 of 6 files at r4, 3 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @koorosh)
pkg/server/server_sql.go
line 1677 at r6 (raw file):
pgL net.Listener, socketFileCfg *string, sServer serverpb.TenantStatusServer,
nit: call this statusServer
for clarity
pkg/server/tenant.go
line 737 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
👍 no need to create a separate phase. we can leave it here, thx for the notes.
I think I preferred the placement of the scheduler here instead of startServeSQL
just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated in server.go
.
For instance, you end up passing in sqlServer.internalExecutor
to startServeSQL
, which is a bit confusing.
pkg/server/structlogging/hot_ranges_log.go
line 46 at r2 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
I think it is not necessary. This cluster setting is fetched every time on new cycle of logging stats, so updated value will be used:
logEventsWithDelay(ctx, events, stopper, TelemetryHotRangesStatsLoggingDelay.Get(&s.st.SV))
Great. Thanks!
pkg/server/structlogging/hot_ranges_log.go
line 91 at r6 (raw file):
select { case <-stopper.ShouldQuiesce(): return
nit: also respond to ctx
cancellation
45cf6b5
to
6344ac6
Compare
6344ac6
to
b288e88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @knz, and @kvoli)
Previously, kvoli (Austen) wrote…
Are you planning on squashing the two commits together?
Done.
Previously, kvoli (Austen) wrote…
nit: reword to
This patch introduces periodic logging of hot ranges info...
Done.
pkg/server/tenant.go
line 737 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I think I preferred the placement of the scheduler here instead of
startServeSQL
just to keep the SQL startup as minimal as possible. It's fine if the code is duplicated inserver.go
.For instance, you end up passing in
sqlServer.internalExecutor
tostartServeSQL
, which is a bit confusing.
Done.
pkg/server/structlogging/hot_ranges_log.go
line 91 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: also respond to
ctx
cancellation
Done.
pkg/server/server_sql.go
line 1677 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: call this
statusServer
for clarity
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @koorosh !
Reviewed 2 of 8 files at r7.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @koorosh, and @kvoli)
pkg/server/structlogging/hot_ranges_log.go
line 11 at r2 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
As I see, another packages define such logic in separate file as part of package where it is used (ie
pkg/ccl/changefeedccl/telemetry.go
).
In our case, I could lift it up on one level up inserver
package, and renamehot_ranges_log.go
totelemetry.go
. WDYT?
Let's leave this as-is.
This patch introduces periodic logging of hot ranges info (ie QPS and relevant DB objects info). It is done for both dedicated and serverless cluster types, and by default called every 4h. Stats collection can be enabled/disabled, or changed interval through cluster settings. Release note: None
b288e88
to
440aa71
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-89511 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/101927/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This patch introduces periodic logging of hot ranges info (ie QPS and
relevant DB objects info). It is done for both dedicated and
serverless cluster types, and by default called every 4h.
Stats collection can be enabled/disabled, or changed interval
through cluster settings.
Resolves: #85415
Release note: None