-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-7338 license: add enterprise_license_expiry_sec
metric
#23367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems straightforward. couple of structural suggestions, at your discretion
src/v/features/feature_table.cc
Outdated
if (ss::this_shard_id() != 0) { | ||
return; | ||
} | ||
|
||
if (!config::shard_local_cfg().disable_metrics()) { | ||
setup_metrics(_metrics); | ||
} | ||
|
||
if (!config::shard_local_cfg().disable_public_metrics()) { | ||
setup_metrics(_public_metrics); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I would lean toward sticking this logic in setup_metrics
and calling it directly from feature_table::start
or something.
not sure it's 100% needed in this case, but hitting the metrics stuff in the constructor always makes me a bit nervous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair point. There's no feature_table::start
method at the moment, and similarly there are no start
methods on any controller-log-backed _table.cc
's. I think that's because they are ss::sharded<>
and they are expected to be constructed and started at the same time with ss::sharded::start()
. So what I've done now is I've added a setup_metrics
method that gets called inside the constructor of feature_table
right after the probe is created to make it explicit that we start the probe immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected to be constructed and started at the same time with
ss::sharded::start()
Not necessarily, you'll see this pattern throughout application.cc
https://github.com/redpanda-data/redpanda/blob/dev/src/v/redpanda/application.cc#L2475-L2478
But yeah, it doesn't seem like this probe should need an async lifetime since the lifetime of the sharded<feature_table>
tracks with the whole program...the test failures[1] are real and maybe a result of feature_table being con/destructed manually across various tests and fixtures.
[1] build - these do go away if I set _probe = nullptr
in feature_table::feature_table(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are not _table
's though 😉 but yeah fair point, there are certainly examples of decoupled sharded service construction and start. I'm wondering what seastar-people would think of that.
Thanks for the pointer on the build, the cause turned out to be multiple feature_table
's being constructed at the same time.
c29dc0c
to
1d76d0a
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54738#01920b30-447c-414f-a20b-3d774ec8f978 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54843#019210b8-8f03-47b2-b7d8-57a720868301 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54843#019210b8-8f05-4210-a2a8-92316692f089 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54843#019210ba-d69c-482f-8d6f-43239679bcd7 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54971#0192204f-23d4-4490-94a2-39776385d208 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55015#019223a0-4ead-440f-9fa1-6c344a245886 |
Threw @BenPope on the review because he's helped me with metrics footguns before...not sure exactly what the problem is here, but my knee-jerk guess is lifetime issues. |
src/v/features/feature_table.cc
Outdated
"enterprise_license_expiry_sec", | ||
[this]() { return calculate_expiry_metric(_parent.get_license()); }, | ||
sm::description( | ||
"Number of seconds remaining until the Enterprise License expires")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Number of seconds remaining until the Enterprise License expires")) | |
"Number of seconds remaining until the Enterprise license expires")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've updated this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion (we won't need to capitalize "license" in the docs), lgtm otherwise
src/v/security/license.cc
Outdated
const auto now = std::chrono::duration_cast<std::chrono::seconds>( | ||
std::chrono::system_clock::now().time_since_epoch()); | ||
clock::now().time_since_epoch()); | ||
return now > expiry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to reimplement this:
const auto now = std::chrono::duration_cast<std::chrono::seconds>( | |
std::chrono::system_clock::now().time_since_epoch()); | |
clock::now().time_since_epoch()); | |
return now > expiry; | |
return clock::now() > expiration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -655,6 +683,7 @@ class feature_table { | |||
|
|||
ss::gate _gate; | |||
ss::abort_source _as; | |||
std::unique_ptr<probe> _probe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holding the probes by pointer is useful when the owning class:
- is movable; the metrics often hold a reference to a data member
- has an asynchronous lifetime; to deregister the metrics on
ss::future<> stop()
Even if neither of these were true, holding it by pointer is still useful as a compiler firewall; if ~feature_table() noexcept() = default;
is defined in the implementation file, then probe can be forward declared, and metrics can be an implementation dep.
In order to keep your tests, you'll have to move feature_table::probe::calculate_expiry_metric
to a more visible place, perhapsfeature_table::calculate_expiry_metric
.
Please destroy the probe in stop()
.
The probe is constructed in feature_table
constructor and holds a reference to it, so it's worth either making feature_table
explicitly non-movable, or give it an async lifetime and construct the probe in ss::future<> start()
.
std::unique_ptr<probe> _probe; | |
class probe; | |
std::unique_ptr<probe> _probe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, very useful. I've changed this now to have the probe be forward declared and its implementation hidden inside the implementation file.
src/v/features/feature_table.cc
Outdated
@@ -697,6 +706,49 @@ void feature_table::assert_compatible_version(bool override) { | |||
} | |||
} | |||
|
|||
feature_table::probe::probe(const feature_table& parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature_table
can be passed through setup_metrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left it to be passed in the constructor because then I can hold it as a reference instead of a pointer inside the probe which means that I don't have to deal with a nullptr
. tldr: it's simpler to pass it in the constructor and setup_metrics
gets called right after the constructor so there's not much of a difference here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinking to capture it directly into the lambda:
[&ft]() {
return calculate_expiry_metric(ft.get_license());
},
But it's a minor point.
if (!license) { | ||
return -1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussion:
If there is no license, it may be possible to not set up this metric (or even the probe), but then setting up the probe or metric must be attached to set_license
/ revoke_license
/ feature_table_snapshot::apply
.
For the latter, it may be worth:
diff --git a/src/v/features/feature_table_snapshot.cc b/src/v/features/feature_table_snapshot.cc
index 6e1cfb406e..b5477a88c5 100644
--- a/src/v/features/feature_table_snapshot.cc
+++ b/src/v/features/feature_table_snapshot.cc
@@ -42,7 +42,11 @@ void feature_table_snapshot::apply(feature_table& ft) const {
applied_offset);
ft.set_active_version(version);
- ft._license = license;
+ if (license) {
+ ft.set_license(*license);
+ } else {
+ ft.revoke_license();
+ }
But I'm also OK with returning -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point to discuss but I've kept this as is for now to keep things simpler. I kinda like having the -1 explicitly present to signal that the license it unset (as opposed to just the metric missing because of network issues, Prometheus scraper being down, etc.). But I don't feel strongly either way which makes me prefer simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be possible to not set up this metric (or even the probe)
I kinda like that. But yeah, doesn't seem too critical
1d76d0a
to
f2fc540
Compare
|
@@ -232,9 +284,15 @@ feature_table::feature_table() { | |||
} | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declaring the destructor made this true, but I wonder if it's worth being explicit:
static_assert(
!std::is_move_constructible_v<feature_table>
&& !std::is_move_assignable_v<feature_table>
&& !std::is_copy_constructible_v<feature_table>
&& !std::is_copy_assignable_v<feature_table>,
"The probe captures a reference to this");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like it, I've adopted this now
|
||
internal_val = get_metrics_value(MetricsEndpoint.METRICS) | ||
public_val = get_metrics_value(MetricsEndpoint.PUBLIC_METRICS) | ||
assert internal_val == public_val, f"Mismatch: {internal_val} != {public_val}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's possible to get unlucky and have the scrapes cross over a 1 second boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fine though because it's called from a wait_until
with retry_on_exc=True
so in that case we would just retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
src/v/features/feature_table.cc
Outdated
|
||
auto rem = license->expiration() - now; | ||
auto rem_capped = std::max(rem.zero(), rem); | ||
return std::chrono::duration_cast<std::chrono::seconds>(rem_capped).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rem_capped / 1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
f2fc540
to
f630e93
Compare
Force-push: adopt remaining minor improvement suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Expose the expiration as a time_point.
f630e93
to
89ccd73
Compare
These two fixture tests create multiple `feature_table`s at the same time. This commit disables emitting metrics on these tests to prevent double metric registration errors in the follow up commit of this PR.
Adds a metric to allow easier monitoring of when the enterprise license is going to expiry. The dependency on the `security` module existed even previously in the bazel build but it was missing from the cmake build.
Adds a bazel build for the features module. The include for base/vlog.h was unused, so it was removed instead of adding a dependency.
Adds an integration test for the license expiry metric.
89ccd73
to
baece2c
Compare
{ | ||
sm::make_gauge( | ||
"enterprise_license_expiry_sec", | ||
[&ft = _parent]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was so that you didn't need the _parent
member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, now I get what you meant there, that makes more sense. I will leave this as is now since the behaviour is pretty much the same.
/backport v24.2.x |
/backport v24.1.x |
/backport v23.3.x |
Failed to create a backport PR to v24.1.x branch. I tried:
|
Failed to create a backport PR to v24.2.x branch. I tried:
|
Failed to create a backport PR to v23.3.x branch. I tried:
|
This introduces a new metric for monitoring the expiry time of the enterprise license.
cluster_features_enterprise_license_expiry_sec
Fixes https://redpandadata.atlassian.net/browse/CORE-7338
Backports Required
Release Notes
Features