Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-7338 license: add enterprise_license_expiry_sec metric #23367

Merged
merged 5 commits into from
Sep 24, 2024

Conversation

pgellert
Copy link
Contributor

This introduces a new metric for monitoring the expiry time of the enterprise license.

cluster_features_enterprise_license_expiry_sec

  • Description: “Number of seconds remaining until the Enterprise License expires”
  • Endpoints: both public and internal metrics
  • Labels: none

Fixes https://redpandadata.atlassian.net/browse/CORE-7338

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Features

  • A new metric (cluster_features_enterprise_license_expiry_sec) is added for easier monitoring of the enterprise license's expiry time.

@pgellert pgellert requested review from a team September 18, 2024 16:31
@pgellert pgellert self-assigned this Sep 18, 2024
@pgellert pgellert requested review from andijcr and oleiman and removed request for a team September 18, 2024 16:31
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems straightforward. couple of structural suggestions, at your discretion

src/v/security/license.cc Outdated Show resolved Hide resolved
Comment on lines 708 to 723
if (ss::this_shard_id() != 0) {
return;
}

if (!config::shard_local_cfg().disable_metrics()) {
setup_metrics(_metrics);
}

if (!config::shard_local_cfg().disable_public_metrics()) {
setup_metrics(_public_metrics);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would lean toward sticking this logic in setup_metrics and calling it directly from feature_table::start or something.

not sure it's 100% needed in this case, but hitting the metrics stuff in the constructor always makes me a bit nervous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point. There's no feature_table::start method at the moment, and similarly there are no start methods on any controller-log-backed _table.cc's. I think that's because they are ss::sharded<> and they are expected to be constructed and started at the same time with ss::sharded::start(). So what I've done now is I've added a setup_metrics method that gets called inside the constructor of feature_table right after the probe is created to make it explicit that we start the probe immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected to be constructed and started at the same time with ss::sharded::start()

Not necessarily, you'll see this pattern throughout application.cc

https://github.com/redpanda-data/redpanda/blob/dev/src/v/redpanda/application.cc#L2475-L2478

But yeah, it doesn't seem like this probe should need an async lifetime since the lifetime of the sharded<feature_table> tracks with the whole program...the test failures[1] are real and maybe a result of feature_table being con/destructed manually across various tests and fixtures.

[1] build - these do go away if I set _probe = nullptr in feature_table::feature_table(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are not _table's though 😉 but yeah fair point, there are certainly examples of decoupled sharded service construction and start. I'm wondering what seastar-people would think of that.

Thanks for the pointer on the build, the cause turned out to be multiple feature_table's being constructed at the same time.

@pgellert pgellert force-pushed the elf/expiry-metric branch 2 times, most recently from c29dc0c to 1d76d0a Compare September 19, 2024 12:58
@pgellert
Copy link
Contributor Author

Force-push:

  • First: rebase to dev
  • Second: Fixed the linking error in the CI build (missing dependency on security), addressed code review feedback

@pgellert pgellert requested a review from oleiman September 19, 2024 13:00
src/v/features/BUILD Outdated Show resolved Hide resolved
@oleiman oleiman requested a review from BenPope September 19, 2024 19:52
@oleiman
Copy link
Member

oleiman commented Sep 19, 2024

Threw @BenPope on the review because he's helped me with metrics footguns before...not sure exactly what the problem is here, but my knee-jerk guess is lifetime issues.

"enterprise_license_expiry_sec",
[this]() { return calculate_expiry_metric(_parent.get_license()); },
sm::description(
"Number of seconds remaining until the Enterprise License expires"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Number of seconds remaining until the Enterprise License expires"))
"Number of seconds remaining until the Enterprise license expires"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've updated this now

Copy link

@kbatuigas kbatuigas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion (we won't need to capitalize "license" in the docs), lgtm otherwise

Comment on lines 186 to 188
const auto now = std::chrono::duration_cast<std::chrono::seconds>(
std::chrono::system_clock::now().time_since_epoch());
clock::now().time_since_epoch());
return now > expiry;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to reimplement this:

Suggested change
const auto now = std::chrono::duration_cast<std::chrono::seconds>(
std::chrono::system_clock::now().time_since_epoch());
clock::now().time_since_epoch());
return now > expiry;
return clock::now() > expiration();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -655,6 +683,7 @@ class feature_table {

ss::gate _gate;
ss::abort_source _as;
std::unique_ptr<probe> _probe;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding the probes by pointer is useful when the owning class:

  1. is movable; the metrics often hold a reference to a data member
  2. has an asynchronous lifetime; to deregister the metrics on ss::future<> stop()

Even if neither of these were true, holding it by pointer is still useful as a compiler firewall; if ~feature_table() noexcept() = default; is defined in the implementation file, then probe can be forward declared, and metrics can be an implementation dep.

In order to keep your tests, you'll have to move feature_table::probe::calculate_expiry_metric to a more visible place, perhapsfeature_table::calculate_expiry_metric.

Please destroy the probe in stop().

The probe is constructed in feature_table constructor and holds a reference to it, so it's worth either making feature_table explicitly non-movable, or give it an async lifetime and construct the probe in ss::future<> start().

Suggested change
std::unique_ptr<probe> _probe;
class probe;
std::unique_ptr<probe> _probe;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, very useful. I've changed this now to have the probe be forward declared and its implementation hidden inside the implementation file.

@@ -697,6 +706,49 @@ void feature_table::assert_compatible_version(bool override) {
}
}

feature_table::probe::probe(const feature_table& parent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature_table can be passed through setup_metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left it to be passed in the constructor because then I can hold it as a reference instead of a pointer inside the probe which means that I don't have to deal with a nullptr. tldr: it's simpler to pass it in the constructor and setup_metrics gets called right after the constructor so there's not much of a difference here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinking to capture it directly into the lambda:

             [&ft]() {
                  return calculate_expiry_metric(ft.get_license());
              },

But it's a minor point.

Comment on lines +729 to +771
if (!license) {
return -1;
}
Copy link
Member

@BenPope BenPope Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussion:

If there is no license, it may be possible to not set up this metric (or even the probe), but then setting up the probe or metric must be attached to set_license / revoke_license / feature_table_snapshot::apply.

For the latter, it may be worth:

diff --git a/src/v/features/feature_table_snapshot.cc b/src/v/features/feature_table_snapshot.cc
index 6e1cfb406e..b5477a88c5 100644
--- a/src/v/features/feature_table_snapshot.cc
+++ b/src/v/features/feature_table_snapshot.cc
@@ -42,7 +42,11 @@ void feature_table_snapshot::apply(feature_table& ft) const {
       applied_offset);
 
     ft.set_active_version(version);
-    ft._license = license;
+    if (license) {
+        ft.set_license(*license);
+    } else {
+        ft.revoke_license();
+    }

But I'm also OK with returning -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point to discuss but I've kept this as is for now to keep things simpler. I kinda like having the -1 explicitly present to signal that the license it unset (as opposed to just the metric missing because of network issues, Prometheus scraper being down, etc.). But I don't feel strongly either way which makes me prefer simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be possible to not set up this metric (or even the probe)

I kinda like that. But yeah, doesn't seem too critical

@pgellert
Copy link
Contributor Author

Force-push:

  • code review feedback (mostly hiding the probe in the implementation file plus some minor things)
  • added a bazel build for features/test while I was there
  • fixed the failing test -- some storage fixture tests create multiple feature_table instances at the same time (specifically disk_log_builder creates one under the hood) which causes metric double-registration issues, so to solve this I followed the existing pattern and disabled metrics

@@ -232,9 +284,15 @@ feature_table::feature_table() {
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declaring the destructor made this true, but I wonder if it's worth being explicit:

    static_assert(
      !std::is_move_constructible_v<feature_table>
        && !std::is_move_assignable_v<feature_table>
        && !std::is_copy_constructible_v<feature_table>
        && !std::is_copy_assignable_v<feature_table>,
      "The probe captures a reference to this");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like it, I've adopted this now


internal_val = get_metrics_value(MetricsEndpoint.METRICS)
public_val = get_metrics_value(MetricsEndpoint.PUBLIC_METRICS)
assert internal_val == public_val, f"Mismatch: {internal_val} != {public_val}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's possible to get unlucky and have the scrapes cross over a 1 second boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fine though because it's called from a wait_until with retry_on_exc=True so in that case we would just retry.

BenPope
BenPope previously approved these changes Sep 23, 2024
oleiman
oleiman previously approved these changes Sep 23, 2024
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

src/v/features/feature_table.h Show resolved Hide resolved

auto rem = license->expiration() - now;
auto rem_capped = std::max(rem.zero(), rem);
return std::chrono::duration_cast<std::chrono::seconds>(rem_capped).count();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rem_capped / 1s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pgellert pgellert dismissed stale reviews from oleiman and BenPope via f630e93 September 23, 2024 17:59
@pgellert
Copy link
Contributor Author

Force-push: adopt remaining minor improvement suggestions

@pgellert pgellert requested review from BenPope and oleiman September 23, 2024 18:01
oleiman
oleiman previously approved these changes Sep 23, 2024
Copy link
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Expose the expiration as a time_point.
These two fixture tests create multiple `feature_table`s at the same
time. This commit disables emitting metrics on these tests to prevent
double metric registration errors in the follow up commit of this PR.
Adds a metric to allow easier monitoring of when the enterprise license
is going to expiry.

The dependency on the `security` module existed even previously in the
bazel build but it was missing from the cmake build.
Adds a bazel build for the features module.

The include for base/vlog.h was unused, so it was removed instead of
adding a dependency.
Adds an integration test for the license expiry metric.
@pgellert
Copy link
Contributor Author

@pgellert pgellert requested a review from oleiman September 24, 2024 09:31
{
sm::make_gauge(
"enterprise_license_expiry_sec",
[&ft = _parent]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was so that you didn't need the _parent member.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now I get what you meant there, that makes more sense. I will leave this as is now since the behaviour is pretty much the same.

@pgellert pgellert merged commit 3ff5d86 into redpanda-data:dev Sep 24, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-23367-v24.1.x-169 remotes/upstream/v24.1.x
git cherry-pick -x 78de1c702ef22c1becb876913674b9c5fa7de31b 1c8f66b467b0b5e84de6e6c424fbe65e825f5e98 edb5f48493aaec25ebcb10f35eb26b3ad5ae1a10 c44d9d8f53222dc4c50063cafc8ff020dc4e586d baece2ccd6c9986a85c1af108dbb5b8387e85f1f

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-23367-v24.2.x-227 remotes/upstream/v24.2.x
git cherry-pick -x 78de1c702ef22c1becb876913674b9c5fa7de31b 1c8f66b467b0b5e84de6e6c424fbe65e825f5e98 edb5f48493aaec25ebcb10f35eb26b3ad5ae1a10 c44d9d8f53222dc4c50063cafc8ff020dc4e586d baece2ccd6c9986a85c1af108dbb5b8387e85f1f

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-23367-v23.3.x-530 remotes/upstream/v23.3.x
git cherry-pick -x 78de1c702ef22c1becb876913674b9c5fa7de31b 1c8f66b467b0b5e84de6e6c424fbe65e825f5e98 edb5f48493aaec25ebcb10f35eb26b3ad5ae1a10 c44d9d8f53222dc4c50063cafc8ff020dc4e586d baece2ccd6c9986a85c1af108dbb5b8387e85f1f

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants