Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps]: error creating event log index template at startup #134098

Closed
pmuellr opened this issue Jun 9, 2022 · 4 comments · Fixed by #136363
Closed

[ResponseOps]: error creating event log index template at startup #134098

pmuellr opened this issue Jun 9, 2022 · 4 comments · Fixed by #136363
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jun 9, 2022

Kibana version: 8.3.0

Describe the bug:

Noticed this is happening intermittently during our kbn-alert-load daily runs:

error initializing elasticsearch resources: error creating index template: illegal_argument_exception: [illegal_argument_exception] Reason: index template [.kibana-event-log-8.3.0-snapshot-template] already exists initialization failed, events will not be indexed

Not great. Appears to be a race condition here:

async createIndexTemplateIfNotExists(): Promise<void> {
const exists = await this.esContext.esAdapter.doesIndexTemplateExist(
this.esContext.esNames.indexTemplate
);
if (!exists) {
const templateBody = getIndexTemplate(this.esContext.esNames);
await this.esContext.esAdapter.createIndexTemplate(
this.esContext.esNames.indexTemplate,
templateBody
);
}
}

I think we'll want to check the other resource creation bits as well, and perhaps when we fix this, we can also fix #127029, which is kinda related - some kind of a timing issue we should be able to work around by refactoring the EL initialization (include the mapping when we create the index, not just the template).

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog labels Jun 9, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Jun 9, 2022

Looks like it's happening with the alias creation as well, haven't seen it for the index creation though.

Happened to notice that some of this code changed in [Elasticsearch client: no longer default to using meta: true
#124488](#124488), which was made available in 8.2.0, and so seems like a good place to start looking ...

@pmuellr
Copy link
Member Author

pmuellr commented Jun 9, 2022

Looking at the implementation, turns out that if we can't create the template, we check (again) to see if it exists, as the message we were getting back indicating it didn't wasn't something we could easily check.

So this is really not good. The second check should have returned true-ish, to prevent the error from being thrown, but wasn't.

public async doesIndexTemplateExist(name: string): Promise<boolean> {
try {
const esClient = await this.elasticsearchClientPromise;
const legacyResult = await esClient.indices.existsTemplate({ name });
const indexTemplateResult = await esClient.indices.existsIndexTemplate({ name });
return (legacyResult as boolean) || (indexTemplateResult as boolean);
} catch (err) {
throw new Error(`error checking existence of index template: ${err.message}`);
}
}
public async createIndexTemplate(name: string, template: Record<string, unknown>): Promise<void> {
try {
const esClient = await this.elasticsearchClientPromise;
await esClient.indices.putIndexTemplate({
name,
body: template,
create: true,
});
} catch (err) {
// The error message doesn't have a type attribute we can look to guarantee it's due
// to the template already existing (only long message) so we'll check ourselves to see
// if the template now exists. This scenario would happen if you startup multiple Kibana
// instances at the same time.
const existsNow = await this.doesIndexTemplateExist(name);
if (!existsNow) {
const error = new Error(`error creating index template: ${err.message}`);
Object.assign(error, { wrapped: err });
throw error;
}
}
}

@pmuellr pmuellr moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Jun 16, 2022
@pmuellr pmuellr self-assigned this Jun 16, 2022
@pmuellr pmuellr moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Jun 16, 2022
@pmuellr
Copy link
Member Author

pmuellr commented Jun 16, 2022

Found a different problem, but in the same place, seems like we can address it when we fix this issue.

I'm seeing a few of these in some logs:

error initializing elasticsearch resources: error creating index template: illegal_argument_exception: [illegal_argument_exception] Reason: index template [.kibana-event-log-8.2.2-template] has index patterns [.kibana-event-log-8.2.2-*] matching patterns from existing templates [ZZZ] with patterns (ZZZ => [*documents*]) that have the same priority [0], multiple index templates may not match during index creation, please use a different priority

The old "user mixing templates into 'system' indices". Unavoidable, for cases where a user uses such a broad template pattern.

Seems like we can at least bump the priority of our own template. We'll have to see what other indices used by the stack set theirs to, I think either 100 or 1000.

Not sure if we can do better. Moving to a datastream would probably fix this (but not sure), and that may be a big task, but worthy of consideration, I think. I'm not sure if we can be more precise in our pattern, but that * could basically be replaced by a regexp \d{6} as I think those are just the ILM-generated suffixes.

pmuellr added a commit to pmuellr/kibana that referenced this issue Jul 14, 2022
resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry
@pmuellr pmuellr moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jul 14, 2022
Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jul 19, 2022
ymao1 pushed a commit that referenced this issue Jul 19, 2022
resolves #134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Jul 19, 2022
…6363)

resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Jul 19, 2022
…6363)

resolves elastic#134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)
kibanamachine added a commit that referenced this issue Jul 19, 2022
…136647)

resolves #134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
kibanamachine added a commit that referenced this issue Jul 19, 2022
…136646)

resolves #134098

Adds retry logic to the initialization of elasticsearch
resources, when Kibana starts up.  Recently, it seems
this has become a more noticeable error - that race
conditions occur where two Kibana's initializing a new
stack version will race to create the event log resources.

We believe we'll see the end of these issues with some
retries, chunked around the 4 resource-y sections of
the initialization code.

We're using [p-retry][] (which uses [retry][]), to do an
exponential backoff starting at 2s, then 4s, 8s, 16s,
with 4 retries (so 5 actual attempted calls).  Some
randomness is added, since there's a race on.

[p-retry]: https://github.com/sindresorhus/p-retry#p-retry
[retry]: https://github.com/tim-kos/node-retry#retry

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit f6e4c2f)

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants