Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bosun sending notifications for closed and inactive alerts #1690

Closed
angadsingh opened this issue Apr 1, 2016 · 20 comments
Closed

Bosun sending notifications for closed and inactive alerts #1690

angadsingh opened this issue Apr 1, 2016 · 20 comments
Assignees
Milestone

Comments

@angadsingh
Copy link

We have a very simple rule file, with 3 notifications (http post to PD and slack, and email) and a bunch of alert rules which trigger them. We are facing a weird issue wherein, the following happens:

  • alert triggers, sends notifications
  • a human acks the alert
  • human solves problem, alert becomes inactive
  • human closes the alert
  • notification still keeps triggering (alert is no where to be seen in the bosun UI/api) - forever!

to explain it through logs, quite literally this is what we're seeing:

2016/04/01 07:56:37 info: check.go:513: check alert masked.masked.write.rate.too.low start
2016/04/01 07:26:38 info: check.go:537: check alert masked.masked.write.rate.too.low done (1.378029647s): 0 crits, 0 warns, 0 unevaluated, 0 unknown
2016/04/01 07:26:38 info: alertRunner.go:55: runHistory on masked.masked.write.rate.too.low took 54.852815ms
2016/04/01 07:26:39 info: search.go:205: Backing up last data to redis
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}
2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}
2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} to [alert-group@xyz.com] sucessfully. Subject: 148 bytes. Body: 3500 bytes.
2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} to [alert-group@xyz.com] sucessfully. Subject: 147 bytes. Body: 3497 bytes.
2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200.
2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200.
2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200.
2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200.

@angadsingh angadsingh reopened this Apr 1, 2016
@kylebrandt
Copy link
Member

Can you share the configuration of the notification declaration? (Don't need real URLs or Emails), just want to see if there are chains etc...

@angadsingh
Copy link
Author

Thanks for the quick reply @kylebrandt

Here's the complete config:

##################### basic settings #######################

stateFile = /opt/bosun/bosun.state
redisHost = localhost:6379
redisDb = 0
graphiteHost = http://graphitehost

checkFrequency=5m

#disable dashboard grouping
minGroupSize=10000

hostname=http://hostname

emailFrom = from@xyz.com
smtpHost = localhost:25

##################### Notifications #######################

notification foo-alerts-email {
    email = alerts-group@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-service1-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-service2-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-service3-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-service4-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-service5-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification foo-infra-email {
    email = alert.this.engineer@xyz.com,alert.that.engineer@xyz.com
    next = foo-alerts-email
    timeout = 2h
    runOnActions = false
    print = true
}

notification bpb-slack {
    timeout = 1h
    runOnActions = false
    print = true

    post = https://hooks.slack.com/services/ASDFASF/ASDFASETC/ASDFASDFASasfaasasf

    body = {"text": {{.|json}}}
    next = bpb-slack
}

notification foo-alerts-slack {
    timeout = 2h
    runOnActions = false
    print = true

    post = https://hooks.slack.com/services/ASDFASF/ASDFASETC/ASDFASDFASasfaasasf

    body = {"text": {{.|json}}}
    next = foo-alerts-slack
}

notification foo-alerts-warning-slack {
    timeout = 5h
    runOnActions = false
    print = true

    post = https://hooks.slack.com/services/ASDFASF/ASDFASETC/ASDFASDFASasfaasasf

    body = {"text": {{.|json}}}
    next = foo-alerts-warning-slack
}

notification foo-alerts-test-slack {
    timeout = 5h
    runOnActions = false
    print = true

    post = https://hooks.slack.com/services/ASDFASF/ASDFASETC/ASDFASDFASasfaasasf

    body = {"text": {{.|json}}}
    next = foo-alerts-test-slack
}

# TODO:
# try https://github.com/bosun-monitor/bosun/pull/1101
# use template body in pagerduty as well

notification foo-general-pagerduty {
  post = https://events.pagerduty.com/generic/2010-04-15/create_event.json
  contentType = application/json
  next = foo-general-pagerduty
  timeout = 30m
  runOnActions = false
  print = true
  body = `{
      "service_key": "sfgasdfasdafsdfa45124fasdfdas",
      "incident_key": {{.|json}},
      "event_type": "trigger",
      "description": {{.|json}},
      "client": "Bosun",
      "client_url": "http://bosun.server.xyz.com:8070/"
    }`
}

notification foo-service-pagerduty {
  post = https://events.pagerduty.com/generic/2010-04-15/create_event.json
  contentType = application/json
  next = foo-service-pagerduty
  timeout = 30m
  runOnActions = false
  print = true
  body = `{
      "service_key": "sfgasdfasdafsdfa45124fasdfdas",
      "incident_key": {{.|json}},
      "event_type": "trigger",
      "description": {{.|json}},
      "client": "Bosun",
      "client_url": "http://bosun.server.xyz.com:8070/"
    }`
}

notification foo-service3-pagerduty {
  post = https://events.pagerduty.com/generic/2010-04-15/create_event.json
  contentType = application/json
  next = foo-service3-pagerduty
  timeout = 30m
  runOnActions = false
  print = true
  body = `{
      "service_key": "sfgasdfasdafsdfa45124fasdfdas",
      "incident_key": {{.|json}},
      "event_type": "trigger",
      "description": {{.|json}},
      "client": "Bosun",
      "client_url": "http://bosun.server.xyz.com:8070/"
    }`
}

notification foo-ucm-pagerduty {
  post = https://events.pagerduty.com/generic/2010-04-15/create_event.json
  contentType = application/json
  next = foo-ucm-pagerduty
  timeout = 30m
  runOnActions = false
  print = true
  body = `{
      "service_key": "sfgasdfasdafsdfa45124fasdfdas",
      "incident_key": {{.|json}},
      "event_type": "trigger",
      "description": {{.|json}},
      "client": "Bosun",
      "client_url": "http://bosun.server.xyz.com:8070/"
    }`
}

##################### Templates #######################

template generic {
    body = `{{template "header" .}}
    {{template "def" .}}

    {{template "tags" .}}

    {{template "computation" .}}

    {{template "graph" .}}`
    subject = [bosun] {{.Last.Status}}: {{replace .Alert.Name "." " " -1}}: {{.Eval .Alert.Vars.q | printf "%.2f"}} {{if .Alert.Vars.unit_string}}{{.Alert.Vars.unit_string}}{{end}} in {{.Group}}
}

template def {
    body = `<p><strong>Alert definition:</strong>
    <table>
        <tr>
            <td>Name:</td>
            <td>{{replace .Alert.Name "." " " -1}}</td></tr>
        <tr>
            <td>Warn:</td>
            <td>{{.Alert.Warn}} {{if .Alert.Vars.unit_string}}{{.Alert.Vars.unit_string}}{{end}}</td></tr>
        <tr>
            <td>Crit:</td>
            <td>{{.Alert.Crit}} {{if .Alert.Vars.unit_string}}{{.Alert.Vars.unit_string}}{{end}}</td></tr>
    </table>`
}

template tags {
    body = `<p><strong>Tags</strong>

    <table>
        {{range $k, $v := .Group}}
            {{if eq $k "host"}}
                <tr><td>{{$k}}</td><td><a href="{{$.HostView $v}}">{{$v}}</a></td></tr>
            {{else}}
                <tr><td>{{$k}}</td><td>{{$v}}</td></tr>
            {{end}}
        {{end}}
    </table>`
}

template computation {
    body = `
    <p><strong>Computation</strong>

    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>`
}

template graph {
    body = `
    Graph:
    {{if .Alert.Vars.graph}}
        {{ .Graph .Alert.Vars.graph }}
    {{end}}
    `
}

template header {
    body = `<p><a href="{{.Ack}}">Acknowledge alert</a>
    <p><a href="{{.Rule}}">View the Rule + Template in the Bosun's Rule Page</a>
    {{if .Alert.Vars.notes}}
    <p>Notes: {{.Alert.Vars.notes}}
    {{end}}
    {{if .Alert.Vars.dashboard}}
    <p><a href="{{.Alert.Vars.dashboard}}&var-colo={{.Group.colo}}&var-server={{.Group.host}}">Grafana link for {{.Alert.Vars.query}}</a>
    {{end}}
    `
}

##################### Alerts #######################

macro alertdefaults {
    ignoreUnknown = true
    template = generic

    warnNotification = foo-alerts-email,foo-alerts-warning-slack
    critNotification = foo-alerts-email,foo-alerts-slack
}

### XYZ COMPONENT ALERTS ###

alert foo.adaptor.write.rate.too.low {
    macro = alertdefaults

    $notes = Checks write rate of all adaptors and alerts on detecting that the rate is too low
    $unit_string = records/minute

    $query = '''aliasByNode(transformNull(sumSeriesWithWildcards(grep(prod.*.foo-adaptor-*.app.*.*.__execute-count.processMessage-*, ".*ToKafka.*"),4,7)),1,2,4)'''
    $metrictags = "colo.adaptor.stream"
    $metric = graphite($query, "15m", "10m", $metrictags)
    $graph = graphite($query, "1h", "", $metrictags)
    $dashboard = http://grafana.server.xyz.com:3000/dashboard/db/foo-uber-dashboard?
    $q = avg($metric)

    $num_adaptor_streams = len(t($q,""))
    $num_adaptor_streams_down = len(t($q == 0, ""))

    warn = $q < 100
    crit = $q <= 10

    critNotification = foo-service3-pagerduty
}

@kylebrandt
Copy link
Member

Which notification(s) are the ones that keep triggering after acknowledgement?

@angadsingh
Copy link
Author

So for "foo.adaptor.write.rate.too.low" alert, all of its critical notifications keep triggering, at their own timeout frequencies:

foo-service3-pagerduty (every 30 minutes)
foo-alerts-email (every 2 hours)
foo-alerts-slack (every 2 hours)

The original incident was a critical (2 criticals), around more than a day ago, which was acked, fixed and closed, and which keeps re-triggering notifications.

@kylebrandt
Copy link
Member

Oh, and Bosun version?

@angadsingh
Copy link
Author

0.5.0-rc2

@angadsingh
Copy link
Author

Here's the complete log from the beginning of the alert instance, and the repeat triggers (real log, sed'ed and grepped to mask stuff)
issue-1690.txt

@lukaspustina
Copy link
Contributor

I can concur. I'm observing the same problem on one of our Bosun instances we upgraded to 0.5.0-rc2 yesterday. E-Mails are sent for non-existing alarm. OpenTSDB version is 2.2.

@kylebrandt If I can assist in tracking this down by suppling logs or configurations, please let me know.

@angadsingh
Copy link
Author

For now, we are resetting bosun state (i.e. clearing the redis DB and state file) every 60 minutes in a cron job.

@captncraig
Copy link
Contributor

I'm having a hard time reproducing this.

Using a trimmed down config from what you provided:

ledisDir = ../ledis_data2
checkFrequency = 30s
httpListen = :8070

template f {
    subject = "aaaa"
    body = "bbbb"
}

notification foo-alerts-email {
    next = foo-alerts-email
    timeout = 15s
    print = true
}

notification foo-alerts-slack {
    timeout = 25s
    print = true
    next = foo-alerts-slack
}

notification foo-service3-pagerduty {
  timeout = 20s
  print = true
  next = foo-service3-pagerduty
}

alert foo {
    crit = 1
    template = f
    critNotification = foo-alerts-email,foo-alerts-slack
    critNotification = foo-service3-pagerduty
}

The notifications will print continuously at the given intervals until I ack it, and then they all stop. I can verify directly in redis that they get purged from the pending notifications bucket as soon as I ack them.

I check that with ZRANGE pendingNotifications 0 -1 WITHSCORES in redis-cli. It should have an entry for each notification while the alert is "active", and should be an empty set after you ack it.

@vitorboschi
Copy link
Contributor

@captncraig I think I have some alerts currently in this state. They're ack'ed, but still keeps sending mails. Is there anything I can check to help you track down the problem?

@lukaspustina
Copy link
Contributor

After upgrading several of our Bosun installations including the migration of the old state file to ledis, we always observed the same behavior: For a few days we see repeated notifications for already acked or even closed alarms. This behavior fades away after a few days.

@captncraig
Copy link
Contributor

@vitorboschi, I pushed a small patch to a "stopNotifications" branch. Would you mind running that and see if it clears the problem up.

It is essentially detecting this scenario at the time notifications are sent and skipping it if it is closed or acked at that point. Not an optimal fix, but I hope it will clear up the issue.

@vitorboschi
Copy link
Contributor

I will try it and report back my results

@erwindwight
Copy link

I saw this when I was pointing all my bosun hosts to use localhost for redisHost. So I made changes for them, I re-configured my redis to have a master/slave (redis sentinel). Then point redis to a VIP instead of localhost. So far, so good.

@angadsingh
Copy link
Author

@erwindwight we were not even using redis when this issue was faced. we were just using the default ledis db.

@angadsingh
Copy link
Author

@captncraig trying out a build from stopNotifications branch. but isn't that a post-facto check that you've put in? the underlying problem might be creating more such (possibly concurrency) issues.

@lukaspustina
Copy link
Contributor

@captncraig The patch works and prevents resending notifications for old alarms. Do you have an idea why this happens in the first place?

@vitorboschi
Copy link
Contributor

I've been using the patch for some time and it looks like everything is fine. I didn't spotted any regression either

@captncraig
Copy link
Contributor

Found something promising. Looks like this occurs when you close an alert without acking it first. The old notification chain does not get cleared. Patch incoming.

captncraig pushed a commit that referenced this issue May 6, 2016
Clearing notifications on all actions, not just ack.

Fixes #1690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants