Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task manager can get in an endless loop when attempting to claim non-saved object tasks #51222

Closed
mikecote opened this issue Nov 20, 2019 · 11 comments · Fixed by #76891
Closed
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

It seems possible when users upgrade to 7.4 that an old version of Kibana re-inserts tasks within the .kibana_task_manager index without the id prefix of task:. When this scenario happens, task manager would get in an infinite loop attempting to claim those tasks but doing the update call on the object containing the prefix (task:).

For example, a task with id of oss_telemetry-vis_telemetry would cause task manager to attempt claiming a task with an id of task:oss_telemetry-vis_telemetry. This would cause a 409 to be returned since the version returned on read doesn't match the document version of the other.

In 7.4, it would cause for example 200 update requests per second of attempting to claim a task while constantly getting a 409 as a result. In 7.5 the issue isn't as severe but every 30 seconds a log like this shows up: [error][task_manager] Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service.

We should make sure when searching / claiming tasks that we also make sure they're saved objects. Similar to how the migrator only migrates documents that are saved objects: https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/migrations/core/migrate_raw_docs.ts#L41.

Could be similar to a 7.1 issue? #47607

@mikecote mikecote added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:Stack Services labels Nov 20, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@ash9146
Copy link

ash9146 commented Feb 18, 2020

Same Error Kibana 7.6 version
7.4 > 7.6 Upgrade ...

Is there a fundamental solution? / How to solve?
Other than the log, it seems to be working normally.

image

@mikecote
Copy link
Contributor Author

@ash9146 to solve the issue, the objects without task: prefix within the .kibana_task_manager index need to be removed.

@ash9146
Copy link

ash9146 commented Feb 19, 2020

@mikecote Thanks. I solved

  1. List up, not prefix 'task:'
    GET /.kibana_task_manager_26/_search
    image
    image

  2. Remove document
    DELETE /.kibana_task_manager_26/_doc/oss_telemetry-vis_telemetry
    DELETE /.kibana_task_manager_26/_doc/Maps-maps_telemetry

  3. Logging verification
    Good!

@wixaw
Copy link

wixaw commented May 29, 2020

Hello , this is not resolve the problem for me
I have delete all DELETE /.kibana_task_manager_*
but this is recreate everytime
{"type":"log","@timestamp":"2020-05-29T14:21:18Z","tags":["error","plugins","taskManager","taskManager"],"pid":26748,"message":"Failed to mark Task vis_telemetry \"oss_telemetry-vis_telemetry\" as running: Task has been claimed by another Kibana service"}

ElasticStack version : 7.7.0

@gmmorris
Copy link
Contributor

gmmorris commented Jun 1, 2020

Hi @wixaw
Sorry you've encountered this issue, lets see if we can figure out what's wrong.
Generally speaking this error would mean that you have two (or more) Kibana instances and they're having trouble resolving an owner for the task and completing it and usually wiping the kibana_task_manager_* indicies is enough unless something recreates the task incorrectly.

Could you expand a bit about your setup?
It would help to know how many instances of Kibana you're running. A single instance or multiple?
Is this error causing Kibana to crash in any way, or is it simply repeatedly appearing in your logs?
Does the error reappear spontaneously or after a restart of Kibana?

Are there any other errors in the logs that might give a hint to what's going on?

@wixaw
Copy link

wixaw commented Jun 2, 2020

Hello @gmmorris
I only have one kibana instance, Maybe a second could have started at one point but it is stopped now.

This is only in the logs, I wanted to discover the free alert system in the kibana logs and if we have to manage events in this one, it is rather a problem to have these false alerts.

This alert appears approximately every 15 seconds without ever stopping.

Thank you

@wixaw
Copy link

wixaw commented Sep 1, 2020

Hello @gmmorris
I have same issue on 7.9.0
Thanks

@gmmorris
Copy link
Contributor

gmmorris commented Sep 2, 2020

This is only in the logs, I wanted to discover the free alert system in the kibana logs and if we have to manage events in this one, it is rather a problem to have these false alerts.

This alert appears approximately every 15 seconds without ever stopping.

Sorry, I'm a little confused - are we talking about an alert or an error in the logs?
The Alerting system is unaffected by the problem you're describing, so we might be talking across purposes here.

In fact, if you're seeing this in 7.9 then I suspect it's unrelated to the the original issue mentioned here... 🤔

@wixaw
Copy link

wixaw commented Sep 2, 2020

With the new system of free alerts in kibana, I want parse logs in python.
This is to send alert emails because to use the kibana mail function , the paid version is needing and we don't have the budget
Except that in kibana.log there are all these false positives that I can't remove

{"type":"log","@timestamp":"2020-09-02T08:34:18Z","tags":["error","plugins","actions","actions"],"pid":27353,"message":"Server log: cpucloud - * is in a state of ALERT;;Reason:;system.cpu.total.pct is greater than a threshold of 2 (current value is 2.954);"}
..........
{"type":"log","@timestamp":"2020-09-02T08:59:45Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:00:00Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:00:15Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:00:30Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:00:45Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:01:03Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:01:15Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}
{"type":"log","@timestamp":"2020-09-02T09:01:33Z","tags":["error","plugins","taskManager","taskManager"],"pid":27353,"message":"Failed to mark Task vis_telemetry "oss_telemetry-vis_telemetry" as running: Task has been claimed by another Kibana service"}

I have this tasks

GET /.kibana_task_manager_*/_search
{
  "took" : 95,
  "timed_out" : false,
  "_shards" : {
    "total" : 18,
    "successful" : 18,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 496,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:8273bef0-a1a9-11ea-832f-f78f416214ca",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:metrics.alert.threshold",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:27:37.464Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:26:37.464Z"}""",
            "params" : """{"alertId":"1af160f1-6524-416d-a319-9fa2c49839d0","spaceId":"web"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-29T12:39:59.839Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:26:37.504Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:a0176290-a1ae-11ea-832f-f78f416214ca",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:metrics.alert.threshold",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:27:37.464Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:26:37.464Z"}""",
            "params" : """{"alertId":"8d82398b-6d3d-4e70-821c-a8b8ad366252","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-29T13:16:37.049Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:26:37.519Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:38d41f61-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:31:13.519Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:26:13.519Z"}""",
            "params" : """{"alertId":"5f2c6925-4665-4e4c-a512-6bd092c9d80b","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:00:46.550Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:26:15.598Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:4992ed40-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:31:13.519Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:26:13.519Z"}""",
            "params" : """{"alertId":"3e349034-2818-4ee9-b3ad-45bf06436d58","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:01:14.643Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:26:15.599Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:38d41f60-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:31:13.520Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:26:13.520Z"}""",
            "params" : """{"alertId":"70c5050d-adef-4bde-a3ed-10d2378d6199","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:00:46.550Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:26:15.600Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:apm-telemetry-task",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "schedule" : {
              "interval" : "720m"
            },
            "taskType" : "apm-telemetry-task",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:46:16.458Z",
            "scope" : [
              "apm"
            ],
            "startedAt" : null,
            "state" : "{}",
            "params" : "{}",
            "ownerId" : null,
            "scheduledAt" : "2020-05-20T13:45:21.508Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T01:46:20.682Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:269aca00-9e67-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:27:40.493Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:12:40.493Z"}""",
            "params" : """{"alertId":"6c4a29dd-2181-46a0-9a6c-2b00816bb8d6","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:07:25.472Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:12:42.157Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:49938980-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:28:04.476Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:23:04.476Z"}""",
            "params" : """{"alertId":"8d36bc81-49f1-4bc1-89b4-3d5964c11941","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:01:14.648Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:23:06.162Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:4993b091-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:28:04.476Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:23:04.476Z"}""",
            "params" : """{"alertId":"2545d87b-54e2-4149-9fdf-7a2fc40c4150","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:01:14.649Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:23:06.163Z",
          "type" : "task"
        }
      },
      {
        "_index" : ".kibana_task_manager_14",
        "_type" : "_doc",
        "_id" : "task:49933b60-9e66-11ea-b47a-65a89684944d",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "alerting:siem.signals",
            "retryAt" : null,
            "runAt" : "2020-06-03T13:28:04.476Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : null,
            "state" : """{"alertInstances":{},"previousStartedAt":"2020-06-03T13:23:04.476Z"}""",
            "params" : """{"alertId":"48d5c624-3595-47d1-ae74-6cbc113f49d2","spaceId":"default"}""",
            "ownerId" : null,
            "scheduledAt" : "2020-05-25T09:01:14.645Z",
            "attempts" : 0,
            "status" : "idle"
          },
          "references" : [ ],
          "updated_at" : "2020-06-03T13:23:06.164Z",
          "type" : "task"
        }
      }
    ]
  }
}

@gmmorris
Copy link
Contributor

gmmorris commented Sep 7, 2020

Hi @wixaw ,
I now understand the context of your issue. I don't know why you're seeing the errors, they don't seem related to this specific issue, but I think we can help you get back on track with using Alerting in the meantime.

I'd recommend against using the kibana log in this manner. As you've found, the Kibana log can get quite noisy and this is going to make it harder for you to act on the alerted instances. As you have found, the log included a lot more than just the results of alerting, but rather everything that gets logged, including errors (which is what seems to be making things difficult for you right now :)). Top be clear, these aren't false positives, they are simply unrelated to your alerts.
We generally recommend using log in order to debug an alert rather than to actually act on it.

Instead of using the log action, have you considered using index? This will allow you to create a document in an ES Index which you can then use as your source of truth. The Index you choose to use will only contain the data from your Actions.
You can then query that index and use the data in it as you see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants