Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert Instance gets resolved and immediately recreated when multiple Kibana Instances are running #69623

Closed
simioa opened this issue Jun 19, 2020 · 5 comments · Fixed by #71335
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed

Comments

@simioa
Copy link

simioa commented Jun 19, 2020

Kibana version: 7.8.0

Elasticsearch version: 7.8.0

Browser version: Chrome 83.0.4103.106

Original install method (e.g. download page, yum, from source, etc.): Docker

Describe the bug:
I created an Alert which always fires with the new Alerting Framework.
I noticed that the Alert Instance gets resolved and immediately recreated with a different Instance ID in every scheduled run.
This only happens when more than one Kibana Instance is running.
It also looks like the Instance ID is flapping between the two same IDs when two Kibana Instances are running, I did not test this with more than two Instances

Steps to reproduce:

  1. Create a Cluster consisting of 3 elasticsearch Instances, two Kibana Instances and one metricbeat Instance but only start one of the Kibana instances for now

I used the following docker compose File:

Click to expand!
version: '2.2'
services:
es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    container_name: es01
    environment:
    - node.name=es01
    - cluster.name=es-docker-cluster
    - discovery.seed_hosts=es02,es03
    - cluster.initial_master_nodes=es01,es02,es03
    - bootstrap.memory_lock=true
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
    memlock:
        soft: -1
        hard: -1
    volumes:
    - data01:/usr/share/elasticsearch/data
    ports:
    - 9200:9200
    networks:
    - elastic

es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    container_name: es02
    environment:
    - node.name=es02
    - cluster.name=es-docker-cluster
    - discovery.seed_hosts=es01,es03
    - cluster.initial_master_nodes=es01,es02,es03
    - bootstrap.memory_lock=true
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
    memlock:
        soft: -1
        hard: -1
    volumes:
    - data02:/usr/share/elasticsearch/data
    ports:
    - 9201:9201
    networks:
    - elastic

es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.8.0
    container_name: es03
    environment:
    - node.name=es03
    - cluster.name=es-docker-cluster
    - discovery.seed_hosts=es01,es02
    - cluster.initial_master_nodes=es01,es02,es03
    - bootstrap.memory_lock=true
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
    memlock:
        soft: -1
        hard: -1
    volumes:
    - data03:/usr/share/elasticsearch/data
    ports:
    - 9202:9202
    networks:
    - elastic

kib01:
    image: docker.elastic.co/kibana/kibana:7.8.0
    container_name: kib01
    ports:
    - 5601:5601
    environment:
    ELASTICSEARCH_URL: http://es01:9200
    ELASTICSEARCH_HOSTS: http://es01:9200
    XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY: "3Qqi3n5Z251egfai9DiVcNnOqlCD3YBV"
    networks:
    - elastic

kib02:
    image: docker.elastic.co/kibana/kibana:7.8.0
    container_name: kib02
    ports:
    - 5602:5601
    environment:
    ELASTICSEARCH_URL: http://es01:9200
    ELASTICSEARCH_HOSTS: http://es01:9200
    XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY: "3Qqi3n5Z251egfai9DiVcNnOqlCD3YBV"
    networks:
    - elastic

metricbeat:
    image: docker.elastic.co/beats/metricbeat:7.8.0
    container_name: metricbeat
    volumes:
    - "$PWD/metricbeat.docker.yml:/usr/share/metricbeat/metricbeat.yml:ro"
    environment: 
    ELASTICSEARCH_HOSTS: "es01:9200"
    networks:
    - elastic

volumes:
data01:
    driver: local
data02:
    driver: local
data03:
    driver: local

networks:
elastic:
    driver: bridge

and the following metricbeat.docker.yml:

Click to expand!
metricbeat.config:
  modules:
    path: ${path.config}/modules.d/*.yml
    # Reload module configs as they change:
    reload.enabled: false

metricbeat.modules:
- module: system
  metricsets:
    - "cpu"
  period: 10s
  enabled: true

output.elasticsearch:
  hosts: '${ELASTICSEARCH_HOSTS:elasticsearch:9200}'
  username: '${ELASTICSEARCH_USERNAME:}'
  password: '${ELASTICSEARCH_PASSWORD:}'
  1. Wait for metricbeat to Index some documents
  2. Execute the following script to create the metricbeat-* Index Pattern, the Connector and the Alert which always fires
Click to expand!
#!/bin/bash

#Create Index Pattern
curl 'http://localhost:5601/api/saved_objects/index-pattern' \
  -H 'kbn-version: 7.8.0' \
  -H 'Content-Type: application/json' \
  --data-binary '{"attributes":{"title":"metricbeat-*","timeFieldName":"@timestamp","fields":"[]"}}' \
  --compressed

#Create Connector
action_id=$(curl 'http://localhost:5601/api/actions/action' \
  -H 'kbn-version: 7.8.0' \
  -H 'Content-Type: application/json' \
  --data-binary '{"actionTypeId":".index","config":{"index":"test"},"secrets":{},"name":"test"}' \
  --compressed \
  | jq -r '.id')

#Create Alert
curl 'http://localhost:5601/api/alert' \
  -H 'kbn-version: 7.8.0' \
  -H 'Content-Type: application/json' \
  --data-binary "{\"params\":{\"criteria\":[{\"aggType\":\"avg\",\"comparator\":\">\",\"threshold\":[0.3],\"timeSize\":1,\"timeUnit\":\"m\",\"metric\":\"system.cpu.idle.norm.pct\"}],\"sourceId\":\"default\"},\"consumer\":\"alerting\",\"alertTypeId\":\"metrics.alert.threshold\",\"schedule\":{\"interval\":\"30s\"},\"actions\":[{\"id\":\"$action_id\",\"actionTypeId\":\".index\",\"group\":\"metrics.threshold.fired\",\"params\":{\"documents\":[{\"alert\":true}]}}],\"tags\":[],\"name\":\"test\",\"throttle\":\"50m\"}" \
  --compressed

#Create .kibana-event-log-7.8.0-* Index Pattern
curl 'http://localhost:5601/api/saved_objects/index-pattern' \
  -H 'kbn-version: 7.8.0' \
  -H 'Content-Type: application/json' \
  --data-binary '{"attributes":{"title":".kibana-event-log-7.8.0-*","timeFieldName":"@timestamp","fields":"[]"}}' \
  --compressed

  1. Wait a few minutes, the Alert should work just fine right now
  2. Start second Kibana Instance
  3. As soon as the second Kibana Instance is working the Alert Instance is starting to get closed and recreated
    This also has an inpact on Alert Throttling, because the Alert Instance gets deleted and recreated, the throttle is also reset which leads to a renotify with every run

Expected behavior:
The Alert Instance should not get closed and immediately recreated when a second Kibana Instance is running

Provide logs and/or server output (if relevant):
Logs from the .kibana-event-log-*

Jun 19, 2020 @ 19:40:34.949 eventLog starting
Jun 19, 2020 @ 19:41:37.454 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:41:37.455 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' created new instance: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-'
Jun 19, 2020 @ 19:41:38.114 alert: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' instanceId: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-
' scheduled actionGroup: 'metrics.threshold.fired' action: .index:6dee19a5-9fd1-40a8-9765-db33ae077615
Jun 19, 2020 @ 19:41:40.652 action executed: .index:6dee19a5-9fd1-40a8-9765-db33ae077615: test
Jun 19, 2020 @ 19:42:10.375 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:42:43.376 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:43:16.297 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:43:49.299 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:44:22.276 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:44:55.297 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:45:28.253 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:46:01.324 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:46:34.309 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:46:56.367 eventLog starting <-- Second Kibana Instance started here!
Jun 19, 2020 @ 19:47:05.442 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:47:05.443 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' created new instance: '1cfcb860-7d9d-4bc3-98b7-a6c3ff20cb68-'
Jun 19, 2020 @ 19:47:05.444 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' resolved instance: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-
'
Jun 19, 2020 @ 19:47:06.361 alert: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' instanceId: '1cfcb860-7d9d-4bc3-98b7-a6c3ff20cb68-' scheduled actionGroup: 'metrics.threshold.fired' action: .index:6dee19a5-9fd1-40a8-9765-db33ae077615
Jun 19, 2020 @ 19:47:08.392 action executed: .index:6dee19a5-9fd1-40a8-9765-db33ae077615: test
Jun 19, 2020 @ 19:47:37.196 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:47:37.197 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' created new instance: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-
'
Jun 19, 2020 @ 19:47:37.197 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' resolved instance: '1cfcb860-7d9d-4bc3-98b7-a6c3ff20cb68-'
Jun 19, 2020 @ 19:47:38.002 alert: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' instanceId: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-
' scheduled actionGroup: 'metrics.threshold.fired' action: .index:6dee19a5-9fd1-40a8-9765-db33ae077615
Jun 19, 2020 @ 19:47:40.210 action executed: .index:6dee19a5-9fd1-40a8-9765-db33ae077615: test
Jun 19, 2020 @ 19:48:08.346 alert executed: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test'
Jun 19, 2020 @ 19:48:08.347 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' created new instance: '1cfcb860-7d9d-4bc3-98b7-a6c3ff20cb68-'
Jun 19, 2020 @ 19:48:08.348 metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' resolved instance: '81300b10-ea7a-4b42-9ae6-098ef0aa98c4-
'
Jun 19, 2020 @ 19:48:09.028 alert: metrics.alert.threshold:3c39b610-3621-4dd6-9ad5-a27315db0213: 'test' instanceId: '1cfcb860-7d9d-4bc3-98b7-a6c3ff20cb68-*' scheduled actionGroup: 'metrics.threshold.fired' action: .index:6dee19a5-9fd1-40a8-9765-db33ae077615
Jun 19, 2020 @ 19:48:10.230 action executed: .index:6dee19a5-9fd1-40a8-9765-db33ae077615: test

@simioa simioa closed this as completed Jun 19, 2020
@simioa simioa reopened this Jun 19, 2020
@timroes timroes added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 23, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@weltenwort weltenwort added the Feature:Metrics UI Metrics UI feature label Jun 25, 2020
@simioa
Copy link
Author

simioa commented Jul 9, 2020

I did some tests but I don't know if this is the root cause and if my understanding of the procedure is correct.
In https://github.com/elastic/kibana/blob/7.8/x-pack/plugins/infra/server/lib/alerting/metric_threshold/metric_threshold_executor.ts#L291
alertId is used to set the Instance ID which seems to be different for every Kibana Instance / Task Runner(?).
As a consequence, the Alert Instances created on Kibana A are getting unscheduled on Kibana B and new Instances created and vice versa.
To test this, I changed the alertID to a static string which is the same on both Kibana Instances and now the Alert Instance does not get resolved and immediately recreated.
Also the throttle seems to work now.

@Zacqary
Copy link
Contributor

Zacqary commented Jul 9, 2020

@simioa We still need to create a unique alertID string for each instance, so that multiple alerts that are watching the same group don't overwrite one another. But it looks like using a uuid to do this is problematic, since it gets regenerated every time Kibana restarts. We'll need to figure out an alternative so that alertIDs stay the same.

@Zacqary Zacqary self-assigned this Jul 9, 2020
@Zacqary
Copy link
Contributor

Zacqary commented Jul 9, 2020

Actually as it turns out alertInstanceFactory already handles uniquifying alert instances based on which particular alert is generating them. The uuid code was a holdover from our prototype that we wrote before that functionality was in there. Removing it entirely should fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants