Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stn/correctly mark api silences #1733

Merged
merged 8 commits into from
Feb 18, 2019
Merged

Conversation

stuartnelson3
Copy link
Contributor

fixes #979

to note:

  • @beorn7 is working on Limit number of concurrent GET API requests #1723 to prevent multiple users requesting the same information in parallel
  • open to ideas on how to streamline the test writing
  • i'm interested in adding a "staleness" time on the alerts so that a status doesn't have to be updated every time, but only if the alert hasn't been checked in e.g. 15sec, but this could be added in a separate PR

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/correctly-mark-api-silences branch from ee85580 to 5786f49 Compare February 1, 2019 13:29
@beorn7
Copy link
Member

beorn7 commented Feb 2, 2019

On my review queue.

Note about the plans to make this more efficient: My idea would be to not just add some kind of freshness (as this would still allow out-of-date results, i.e. a naive user will still be confused if they create a silence, immediately after that reload the alerts page, and still see the silenced alerts as not-silenced). My idea is around establishing a versioning of the silence state, to be increased each time a new silence is created or an old silence is expired. Then record for each alert the highest version against which its silenced status was checked. Then, we only need to re-check if the current silence state version is higher than the recorded version for the alert. This will also help during the normal sending of notifications (where we currently have to check against all silences each time a notification is due for sending). A similar technique can be applied for inhibitions.

api/v2/api.go Outdated
@@ -53,6 +53,8 @@ type API struct {
peer *cluster.Peer
silences *silence.Silences
alerts provider.Alerts
muter types.Muter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If muter is only about inhibition, let's also call it that way. In addition, this could also just be a function. What do you think?

Suggested change
muter types.Muter
isInhibited func(model.LabelSet) bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already type MuteFunc. A function f of the suggested signature can be converted into a Muter by MuteFunc(f). I'd say accepting the interface here (which could be satisfied by the function you suggest) is preferred.

But I'm with you WRT naming the member variable with something that suggests this is only about inhibitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Muter interface's description of its method, and its single implementater Inhibitor, is a bit confusing. Our implementation relies on a side effect of the particular impl from Inhibitor, and isn't actually checking if the alert is inhibited -- instead, if it is, internal to the method, the alert's status is updated.

This is a long way of saying I'm thinking of naming the struct member something along the lines of setInhibitedStatus, and separately (an issue or something) discuss how we should either drop or update this interface.

Because we're relying on a particular implementation detail, I think the "correct" thing to do is NOT use the interface, but the actual concrete implementation.

Comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a bit has changed now. It looks good to me as it is now, but I leave the final call to @mxinden as he raised this point in the first place.

api/v2/api.go Outdated
// Will check if the alert is inhibited, and update its status.
api.muter.Mutes(a.Labels)
// Will check if the alert is silenced, and update its status.
sils, err := api.silences.Query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this logic determines whether a given alert is silenced or not. If so, that domain logic should rather live in github.com/prometheus/alertmanager/silence. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. this is now mostly (but not exactly) replicated from https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L407-L425
Perhaps this could be unified in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The silences package exposes this Query api so as to be externally consumed. It seems to me that this is being used correctly. If we want to change/simplify the silence package's Query api, such as adding a QueryActiveAlerts method or something, maybe that should be a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with solving this later, but perhaps add a TODO then in the setAlertStatus function to somehow dedupe the code with https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L407-L425 eventually?

@mxinden is currently working on un-munging the many pieces of logic munged together in main.go, and this puts this quite significant (and partially duplicated) piece of logic somewhere in main.go. I can believe that doing the un-munging in a separate and coordinated effort might even make more sense.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a good start. Let's tackle the performance optimizations later once this has proven to do the right thing.

api/v2/api.go Outdated
@@ -53,6 +53,8 @@ type API struct {
peer *cluster.Peer
silences *silence.Silences
alerts provider.Alerts
muter types.Muter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already type MuteFunc. A function f of the suggested signature can be converted into a Muter by MuteFunc(f). I'd say accepting the interface here (which could be satisfied by the function you suggest) is preferred.

But I'm with you WRT naming the member variable with something that suggests this is only about inhibitions.

api/v2/api.go Outdated
// Will check if the alert is inhibited, and update its status.
api.muter.Mutes(a.Labels)
// Will check if the alert is silenced, and update its status.
sils, err := api.silences.Query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. this is now mostly (but not exactly) replicated from https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L407-L425
Perhaps this could be unified in some way?

@beorn7
Copy link
Member

beorn7 commented Feb 5, 2019

I guess @stuartnelson3 is not planning to work on this during skiing. But let's see… :o)
I hope I can pick this up during the week. @mxinden would be great if you could continue to review.

This encapsulates the actual logic being done
previously into a single closure that gets passed
to the api when loaded/reloaded.

Additionally, the naming is now a bit clearer. The
use of the Muter interface is a bit confusing as
it was previously being used for its side effect
of marking an alert as inhibited or not, and the
return value was being ignored.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/correctly-mark-api-silences branch from 00967ec to e4e404f Compare February 11, 2019 12:15
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 force-pushed the stn/correctly-mark-api-silences branch 2 times, most recently from 85d86fa to a26e464 Compare February 11, 2019 13:16
@beorn7
Copy link
Member

beorn7 commented Feb 11, 2019

Thanks, @stuartnelson3 , for doing this in your one day in between vacations.

@mxinden what do you think? I'd go for merging this (more or less) as is, and then get it straightened in the planned refactoring.

@beorn7
Copy link
Member

beorn7 commented Feb 15, 2019

I merged master into this to now check if things work as intended with the recent changes to master.

Signed-off-by: beorn7 <beorn@soundcloud.com>
@beorn7
Copy link
Member

beorn7 commented Feb 15, 2019

I added a commit for comment nits, but also found an issue. Will comment in the code.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a question after all.

if err != nil {
return fmt.Errorf("failed to query silences: %v", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the code in notify.go, don't we have to unset the silence marker of no silences match the alert? I.e.
n.marker.SetSilenced(labels.Fingerprint()). (And if that's true, it would prove the point that we should not replicate the code but have it shared somewhere.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, good catch. Updating the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. (Would prefer to actually de-replicate the code, but we can do that later.)

This was forgotten in the previous code. If an
alert is not silenced, we need to make sure the
status is updated as such when the API returns the
alert.

Signed-off-by: stuart nelson <stuartnelson3@gmail.com>
@stuartnelson3 stuartnelson3 merged commit 51eebbe into master Feb 18, 2019
@stuartnelson3 stuartnelson3 deleted the stn/correctly-mark-api-silences branch February 18, 2019 16:07
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @stuartnelson3, last week has been pretty busy at Red Hat.

I got one code-placement comment. Thanks for investing into the thorough end-to-end test!

@@ -476,3 +478,28 @@ func md5HashAsMetricValue(data []byte) float64 {
copy(bytes, smallSum)
return float64(binary.LittleEndian.Uint64(bytes))
}

func setAlertStatus(inhibitor *inhibit.Inhibitor, marker types.Marker, silences *silence.Silences) func(model.LabelSet) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be inside main.go? I feel like it would be better placed close to types.Marker, given that marker.Status represents the getAlertStatus call. What do you think? Might be included in the TODO below.

// TODO(beorn7): The following code is almost exactly replicated in notify/notify.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I put this here is because it seemed like a bundling of imperative things specific to this use case, and then passing them on down to the api. I sort of view marker.Status being getAlertStatus as a coincidence, and that we created the getAlertStatus type as a way to abstract away which specific impl we're using.

Since this particular function is tying together three different packages (silences, inhibit, and types), it felt more natural to me to happen in main. My personal (and debatable, of course) philosophy is that it's easier to define interfaces (easier to unit test), and then assemble all of them together at the top level (for a full acceptance test).

This was original inspired by a destroy all software screencast, but I've probably diverged from the author's original intent. Happy to talk more, but maybe not in a PR :)

The short answer is: I'm happy to move it if you feel strongly about it. Binding several different packages together didn't leave me with a clear signal where to put it, so main felt fine for now. When we figure out how to deduplicate this code this should be easy to remove/replace.

@stuartnelson3
Copy link
Contributor Author

Thanks for investing into the thorough end-to-end test!

Definitely! I'm hoping that we can start adding more now that there's sort of a "model" for how it works, and turn some of the verbose testing into helper methods as we see more patterns. The actual testing is fairly simple, it's just the struct creation + api call generates a lot of visual noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Querying alert state (silence/inhibition) lags behind (by up to group_interval)
4 participants