Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: Use cluster monitoring alertmanager by default on openshift clusters #7272

Merged
merged 13 commits into from
Oct 4, 2022

Conversation

aminesnow
Copy link
Contributor

@aminesnow aminesnow commented Sep 27, 2022

What this PR does / why we need it:
This PR enables the use of cluster monitoring alertmanager by default on Openshift clusters when applicable. The configuration can however be overridden by the user config.

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.1%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0.6%

@aminesnow aminesnow marked this pull request as ready for review September 28, 2022 16:15
@aminesnow aminesnow requested a review from a team as a code owner September 28, 2022 16:15
Mohamed-Amine Bouqsimi added 2 commits September 28, 2022 18:15
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks good to me. Only a couple of code separation improvements suggested to keep openshift things in a single package.

Comment on lines 253 to 265
if stack.Spec.Tenants != nil && opts.Stack.Tenants.Mode == lokiv1.OpenshiftLogging {
var svc corev1.Service
key := client.ObjectKey{Name: manifests.MonitoringSVCOperated, Namespace: manifests.MonitoringNS}

err = k.Get(ctx, key, &svc)
if err != nil && !apierrors.IsNotFound(err) {
return kverrors.Wrap(err, "failed to lookup alertmanager service", "name", key)
}

if err == nil {
opts.Ruler.OCPAlertManagerEnabled = true
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to push this code block in a separate package internal to handlers package. This handler is already too big and needs some uncluttering, let's start with new additions like this one.

Comment on lines 63 to 87
if opt.Stack.Tenants != nil && opt.Ruler.OCPAlertManagerEnabled && opt.Stack.Tenants.Mode == lokiv1.OpenshiftLogging {
rulerEnabled = true

if opt.Ruler.Spec == nil {
opt.Ruler.Spec = &lokiv1beta1.RulerConfigSpec{
AlertManagerSpec: &lokiv1beta1.AlertManagerSpec{},
}
}

if opt.Ruler.Spec.AlertManagerSpec == nil || len(opt.Ruler.Spec.AlertManagerSpec.Endpoints) == 0 {
ams := &lokiv1beta1.AlertManagerSpec{
Endpoints: []string{"https://_web._tcp.alertmanager-operated.openshift-monitoring.svc"},
EnableV2: true,
DiscoverySpec: &lokiv1beta1.AlertManagerDiscoverySpec{
EnableSRV: true,
RefreshInterval: "1m",
},
}

if err := mergo.Merge(opt.Ruler.Spec.AlertManagerSpec, ams); err != nil {
return config.Options{}, kverrors.Wrap(err, "failed merging RulerSpec options")
}
}

}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general pattern to apply openshift partial patches for automation looks like this:

  1. Create a configure method in the openshift package
  2. Apply the configure after ConfigOptions() returns and handle any error

See ConfigureDeploymentForTenantMode.

Comment on lines 47 to 51
monitoringSVCMain = "alertmanager-main"
// MonitoringNS is the namespace containing cluster monitoring objects such as alertmanager.
MonitoringNS = "openshift-monitoring"
// MonitoringSVCOperated is the name of the alertmanager service used for alerts.
MonitoringSVCOperated = "alertmanager-operated"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these belongs to the openshift package.

wantOptions: nil,
},
{
desc: "openshift-logging mode",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fourth mode to test nowadays openshift-network

operator/internal/manifests/config_test.go Show resolved Hide resolved
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Comment on lines 24 to 28
if opt.OpenShiftOptions.BuildOpts.OCPAlertManagerEnabled {
if err = ConfigureOptionsForMode(&cfg, opt.Stack.Tenants.Mode); err != nil {
return nil, "", err
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you flip the outer if block inside openshift.ConfigureOptions(cfg). This way we limit exposing OpenShiftOptions outside of the openshift package.

Comment on lines 124 to 139
// OCPAlertManagerEnabled returns true if the Openshift AlertManager is present in the cluster.
func OCPAlertManagerEnabled(ctx context.Context, opts manifests.Options, k k8s.Client) (bool, error) {
if opts.Stack.Tenants != nil && opts.Stack.Tenants.Mode == lokiv1.OpenshiftLogging {
var svc corev1.Service
key := client.ObjectKey{Name: manifests.MonitoringSVCOperated, Namespace: manifests.MonitoringNS}

err := k.Get(ctx, key, &svc)
if err != nil && !apierrors.IsNotFound(err) {
return false, kverrors.Wrap(err, "failed to lookup alertmanager service", "name", key)
}

return err == nil, nil
}

return false, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we create a separate openshift package and place the code in a file alertmanager.go, naming the function as openshift.alertManagerSvcExists

Comment on lines 245 to 254
func ConfigureOptionsForMode(cfg *config.Options, mode lokiv1.ModeType) error {
switch mode {
case lokiv1.Static, lokiv1.Dynamic:
return nil // nothing to configure
case lokiv1.OpenshiftLogging, lokiv1.OpenshiftNetwork:
return openshift.ConfigureOptions(cfg)
}

return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the gateway_tenants.go file where all the other configure function live.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why gateway_tenants? it has nothing to do with the gateway..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the filename is wrong nowadays it should be more tenants.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see, I will rename it then to avoid confusion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename in a separate PR, because this is a crucial file that will break many PRs right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy that!

GatewaySvcTargetPort string
RulerName string
Labels map[string]string
OCPAlertManagerEnabled bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can skip the OCP prefix here, as the field is accessible via openshift.BuiltOptions.AlertManagerEnabled

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Comment on lines 18 to 30
if opts.Stack.Tenants != nil && opts.Stack.Tenants.Mode == lokiv1.OpenshiftLogging {
var svc corev1.Service
key := client.ObjectKey{Name: openshift.MonitoringSVCOperated, Namespace: openshift.MonitoringNS}

err := k.Get(ctx, key, &svc)
if err != nil && !apierrors.IsNotFound(err) {
return false, kverrors.Wrap(err, "failed to lookup alertmanager service", "name", key)
}

return err == nil, nil
}

return false, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if opts.Stack.Tenants != nil && opts.Stack.Tenants.Mode == lokiv1.OpenshiftLogging {
var svc corev1.Service
key := client.ObjectKey{Name: openshift.MonitoringSVCOperated, Namespace: openshift.MonitoringNS}
err := k.Get(ctx, key, &svc)
if err != nil && !apierrors.IsNotFound(err) {
return false, kverrors.Wrap(err, "failed to lookup alertmanager service", "name", key)
}
return err == nil, nil
}
return false, nil
if opts.Stack.Tenants == nil || opts.Stack.Tenants.Mode != lokiv1.OpenshiftLogging {
return false, nil
}
var svc corev1.Service
key := client.ObjectKey{Name: openshift.MonitoringSVCOperated, Namespace: openshift.MonitoringNS}
err := k.Get(ctx, key, &svc)
if err != nil && !apierrors.IsNotFound(err) {
return false, kverrors.Wrap(err, "failed to lookup alertmanager service", "name", key)
}
return err == nil, nil

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor bits and we are set to go.

Comment on lines 254 to 260
enabled, err := openshift.AlertManagerSVCExists(ctx, opts, k)
if err != nil {
ll.Error(err, "failed to check OCP AlertManager")
return err
}

opts.OpenShiftOptions.BuildOpts.AlertManagerEnabled = enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this up to the the if-branch on Line 173 to make this call optional only if rules are enabled.

@@ -45,7 +55,7 @@ func LokiConfigMap(opt Options) (*corev1.ConfigMap, string, error) {
}

// ConfigOptions converts Options to config.Options
func ConfigOptions(opt Options) config.Options {
func ConfigOptions(opt Options) (config.Options, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is throwing an error here anymore. Let's remove the error return type.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@periklis periklis merged commit 4103b32 into grafana:main Oct 4, 2022
lxwzy pushed a commit to lxwzy/loki that referenced this pull request Nov 7, 2022
changhyuni pushed a commit to changhyuni/loki that referenced this pull request Nov 8, 2022
Abuelodelanada pushed a commit to canonical/loki that referenced this pull request Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants