Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stn/correctly mark api silences #1733

Merged
merged 8 commits into from
Feb 18, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 35 additions & 2 deletions api/v2/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ type API struct {
peer *cluster.Peer
silences *silence.Silences
alerts provider.Alerts
muter types.Muter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If muter is only about inhibition, let's also call it that way. In addition, this could also just be a function. What do you think?

Suggested change
muter types.Muter
isInhibited func(model.LabelSet) bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already type MuteFunc. A function f of the suggested signature can be converted into a Muter by MuteFunc(f). I'd say accepting the interface here (which could be satisfied by the function you suggest) is preferred.

But I'm with you WRT naming the member variable with something that suggests this is only about inhibitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Muter interface's description of its method, and its single implementater Inhibitor, is a bit confusing. Our implementation relies on a side effect of the particular impl from Inhibitor, and isn't actually checking if the alert is inhibited -- instead, if it is, internal to the method, the alert's status is updated.

This is a long way of saying I'm thinking of naming the struct member something along the lines of setInhibitedStatus, and separately (an issue or something) discuss how we should either drop or update this interface.

Because we're relying on a particular implementation detail, I think the "correct" thing to do is NOT use the interface, but the actual concrete implementation.

Comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a bit has changed now. It looks good to me as it is now, but I leave the final call to @mxinden as he raised this point in the first place.

marker types.Marker
getAlertStatus getAlertStatusFn
uptime time.Time

Expand All @@ -72,9 +74,17 @@ type API struct {
type getAlertStatusFn func(prometheus_model.Fingerprint) types.AlertStatus

// NewAPI returns a new Alertmanager API v2
func NewAPI(alerts provider.Alerts, sf getAlertStatusFn, silences *silence.Silences, peer *cluster.Peer, l log.Logger) (*API, error) {
func NewAPI(
alerts provider.Alerts,
marker types.Marker,
sf getAlertStatusFn,
silences *silence.Silences,
peer *cluster.Peer,
l log.Logger,
) (*API, error) {
api := API{
alerts: alerts,
marker: marker,
getAlertStatus: sf,
peer: peer,
silences: silences,
Expand Down Expand Up @@ -115,13 +125,14 @@ func NewAPI(alerts provider.Alerts, sf getAlertStatusFn, silences *silence.Silen
}

// Update sets the configuration string to a new value.
func (api *API) Update(cfg *config.Config, resolveTimeout time.Duration) error {
func (api *API) Update(cfg *config.Config, resolveTimeout time.Duration, muter types.Muter) error {
api.mtx.Lock()
defer api.mtx.Unlock()

api.resolveTimeout = resolveTimeout
api.alertmanagerConfig = cfg
api.route = dispatch.NewRoute(cfg.Route, nil)
api.muter = muter
return nil
}

Expand Down Expand Up @@ -238,6 +249,28 @@ func (api *API) getAlertsHandler(params alert_ops.GetAlertsParams) middleware.Re
continue
}

// TODO: Add a timestamp now()+X, after which the Mutes() will
// be refreshed?
// Will check if the alert is inhibited, and update its status.
api.muter.Mutes(a.Labels)
// Will check if the alert is silenced, and update its status.
sils, err := api.silences.Query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this logic determines whether a given alert is silenced or not. If so, that domain logic should rather live in github.com/prometheus/alertmanager/silence. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. this is now mostly (but not exactly) replicated from https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L407-L425
Perhaps this could be unified in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The silences package exposes this Query api so as to be externally consumed. It seems to me that this is being used correctly. If we want to change/simplify the silence package's Query api, such as adding a QueryActiveAlerts method or something, maybe that should be a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with solving this later, but perhaps add a TODO then in the setAlertStatus function to somehow dedupe the code with https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L407-L425 eventually?

@mxinden is currently working on un-munging the many pieces of logic munged together in main.go, and this puts this quite significant (and partially duplicated) piece of logic somewhere in main.go. I can believe that doing the un-munging in a separate and coordinated effort might even make more sense.

silence.QState(types.SilenceStateActive),
silence.QMatches(a.Labels),
)
if err != nil {
level.Error(api.logger).Log("msg", "Querying silences failed", "err", err)
}

if len(sils) > 0 {
ids := make([]string, len(sils))
for i, s := range sils {
ids[i] = s.Id
}
api.marker.SetSilenced(a.Labels.Fingerprint(), ids...)
}

// Get alert's current status after seeing if it is suppressed.
status := api.getAlertStatus(a.Fingerprint())

if !*params.Active && status.State == types.AlertStateActive {
Expand Down
22 changes: 12 additions & 10 deletions cmd/alertmanager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@ func run() int {

apiV2, err := apiv2.NewAPI(
alerts,
marker,
marker.Status,
silences,
peer,
Expand Down Expand Up @@ -350,16 +351,6 @@ func run() int {

hash = md5HashAsMetricValue(plainCfg)

err = apiV1.Update(conf, time.Duration(conf.Global.ResolveTimeout))
if err != nil {
return err
}

err = apiV2.Update(conf, time.Duration(conf.Global.ResolveTimeout))
if err != nil {
return err
}

tmpl, err = template.FromGlobs(conf.Templates...)
if err != nil {
return err
Expand All @@ -381,6 +372,17 @@ func run() int {
peer,
logger,
)

err = apiV1.Update(conf, time.Duration(conf.Global.ResolveTimeout))
if err != nil {
return err
}

err = apiV2.Update(conf, time.Duration(conf.Global.ResolveTimeout), inhibitor)
if err != nil {
return err
}

disp = dispatch.NewDispatcher(alerts, dispatch.NewRoute(conf.Route, nil), pipeline, marker, timeoutFunc, logger)

go disp.Run()
Expand Down
9 changes: 9 additions & 0 deletions test/with_api_v2/acceptance.go
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,11 @@ func (amc *AlertmanagerCluster) Start() error {
return nil
}

// Members returns the underlying slice of cluster members.
func (amc *AlertmanagerCluster) Members() []*Alertmanager {
return amc.ams
}

// Start the alertmanager and wait until it is ready to receive.
func (am *Alertmanager) Start(additionalArg []string) error {
am.t.Helper()
Expand Down Expand Up @@ -555,6 +560,10 @@ func (am *Alertmanager) GenericAPIV2Call(at float64, f func()) {
am.t.Do(at, f)
}

func (am *Alertmanager) Client() *apiclient.Alertmanager {
return am.clientV2
}

func (am *Alertmanager) getURL(path string) string {
return fmt.Sprintf("http://%s%s%s", am.apiAddr, am.opts.RoutePrefix, path)
}
150 changes: 150 additions & 0 deletions test/with_api_v2/acceptance/api_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
// Copyright 2018 Prometheus Team
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package test

import (
"fmt"
"testing"
"time"

"github.com/go-openapi/strfmt"
a "github.com/prometheus/alertmanager/test/with_api_v2"
"github.com/prometheus/alertmanager/test/with_api_v2/api_v2_client/client/alert"
"github.com/prometheus/alertmanager/test/with_api_v2/api_v2_client/client/silence"
"github.com/prometheus/alertmanager/test/with_api_v2/api_v2_client/models"
"github.com/prometheus/common/model"
"github.com/stretchr/testify/require"
)

// TestAlertGetReturnsCurrentStatus checks that querying the API returns the
// current status of each alert, i.e. if it is silenced or inhibited.
func TestAlertGetReturnsCurrentAlertStatus(t *testing.T) {
t.Parallel()

conf := `
route:
receiver: "default"
group_by: []
group_wait: 1s
group_interval: 10m
repeat_interval: 1h

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname']

receivers:
- name: "default"
webhook_configs:
- url: 'http://%s'
`

at := a.NewAcceptanceTest(t, &a.AcceptanceOpts{
Tolerance: 1 * time.Second,
})
co := at.Collector("webhook")
wh := a.NewWebhook(co)

amc := at.AlertmanagerCluster(fmt.Sprintf(conf, wh.Address()), 1)
require.NoError(t, amc.Start())
defer amc.Terminate()

am := amc.Members()[0]

labelName := "alertname"
labelValue := "test1"

now := time.Now()
startsAt := strfmt.DateTime(now)
endsAt := strfmt.DateTime(now.Add(5 * time.Minute))

labels := models.LabelSet(map[string]string{labelName: labelValue, "severity": "warning"})
fp := model.LabelSet{model.LabelName(labelName): model.LabelValue(labelValue), "severity": "warning"}.Fingerprint()
pa := &models.PostableAlert{
StartsAt: startsAt,
EndsAt: endsAt,
Alert: models.Alert{Labels: labels},
}
alertParams := alert.NewPostAlertsParams()
alertParams.Alerts = models.PostableAlerts{pa}
_, err := am.Client().Alert.PostAlerts(alertParams)
require.NoError(t, err)

resp, err := am.Client().Alert.GetAlerts(nil)
require.NoError(t, err)
// No silence has been created or inhibiting alert sent, alert should
// be active.
for _, alert := range resp.Payload {
require.Equal(t, models.AlertStatusStateActive, *alert.Status.State)
}

// Wait for group_wait, so that we are in the group_interval period,
// when the pipeline won't update the alert's status.
time.Sleep(2 * time.Second)

// Create silence and verify that the alert is immediately marked
// silenced via the API.
silenceParams := silence.NewPostSilencesParams()

cm := "a"
isRegex := false
ps := &models.PostableSilence{
Silence: models.Silence{
StartsAt: &startsAt,
EndsAt: &endsAt,
Comment: &cm,
CreatedBy: &cm,
Matchers: models.Matchers{
&models.Matcher{Name: &labelName, Value: &labelValue, IsRegex: &isRegex},
},
},
}
silenceParams.Silence = ps
silenceResp, err := am.Client().Silence.PostSilences(silenceParams)
require.NoError(t, err)
silenceID := silenceResp.Payload.SilenceID

resp, err = am.Client().Alert.GetAlerts(nil)
require.NoError(t, err)
for _, alert := range resp.Payload {
require.Equal(t, models.AlertStatusStateSuppressed, *alert.Status.State)
require.Equal(t, fp.String(), *alert.Fingerprint)
require.Equal(t, 1, len(alert.Status.SilencedBy))
require.Equal(t, silenceID, alert.Status.SilencedBy[0])
}

// Create inhibiting alert and verify that original alert is
// immediately marked as inhibited.
labels["severity"] = "critical"
_, err = am.Client().Alert.PostAlerts(alertParams)
require.NoError(t, err)

inhibitingFP := model.LabelSet{model.LabelName(labelName): model.LabelValue(labelValue), "severity": "critical"}.Fingerprint()

resp, err = am.Client().Alert.GetAlerts(nil)
require.NoError(t, err)
for _, alert := range resp.Payload {
require.Equal(t, 1, len(alert.Status.SilencedBy))
require.Equal(t, silenceID, alert.Status.SilencedBy[0])
if fp.String() == *alert.Fingerprint {
require.Equal(t, models.AlertStatusStateSuppressed, *alert.Status.State)
require.Equal(t, fp.String(), *alert.Fingerprint)
require.Equal(t, 1, len(alert.Status.InhibitedBy))
require.Equal(t, inhibitingFP.String(), alert.Status.InhibitedBy[0])
}
}
}