[ResponseOps]: error creating event log index template at startup #134098

pmuellr · 2022-06-09T19:09:06Z

Kibana version: 8.3.0

Describe the bug:

Noticed this is happening intermittently during our kbn-alert-load daily runs:

error initializing elasticsearch resources: error creating index template: illegal_argument_exception: [illegal_argument_exception] Reason: index template [.kibana-event-log-8.3.0-snapshot-template] already exists initialization failed, events will not be indexed

Not great. Appears to be a race condition here:

kibana/x-pack/plugins/event_log/server/es/init.ts

Lines 191 to 202 in 7bfcb52

    
           async createIndexTemplateIfNotExists(): Promise<void> { 
        
             const exists = await this.esContext.esAdapter.doesIndexTemplateExist( 
        
               this.esContext.esNames.indexTemplate 
        
             ); 
        
             if (!exists) { 
        
               const templateBody = getIndexTemplate(this.esContext.esNames); 
        
               await this.esContext.esAdapter.createIndexTemplate( 
        
                 this.esContext.esNames.indexTemplate, 
        
                 templateBody 
        
               ); 
        
             } 
        
           }

I think we'll want to check the other resource creation bits as well, and perhaps when we fix this, we can also fix #127029, which is kinda related - some kind of a timing issue we should be able to work around by refactoring the EL initialization (include the mapping when we create the index, not just the template).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-09T19:09:08Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-06-09T20:15:35Z

Looks like it's happening with the alias creation as well, haven't seen it for the index creation though.

Happened to notice that some of this code changed in [Elasticsearch client: no longer default to using meta: true
#124488](#124488), which was made available in 8.2.0, and so seems like a good place to start looking ...

pmuellr · 2022-06-09T20:54:23Z

Looking at the implementation, turns out that if we can't create the template, we check (again) to see if it exists, as the message we were getting back indicating it didn't wasn't something we could easily check.

So this is really not good. The second check should have returned true-ish, to prevent the error from being thrown, but wasn't.

kibana/x-pack/plugins/event_log/server/es/cluster_client_adapter.ts

Lines 175 to 206 in fd4b8e3

    
           public async doesIndexTemplateExist(name: string): Promise<boolean> { 
        
             try { 
        
               const esClient = await this.elasticsearchClientPromise; 
        
               const legacyResult = await esClient.indices.existsTemplate({ name }); 
        
               const indexTemplateResult = await esClient.indices.existsIndexTemplate({ name }); 
        
               return (legacyResult as boolean) || (indexTemplateResult as boolean); 
        
             } catch (err) { 
        
               throw new Error(`error checking existence of index template: ${err.message}`); 
        
             } 
        
           } 
        
           public async createIndexTemplate(name: string, template: Record<string, unknown>): Promise<void> { 
        
             try { 
        
               const esClient = await this.elasticsearchClientPromise; 
        
               await esClient.indices.putIndexTemplate({ 
        
                 name, 
        
                 body: template, 
        
                 create: true, 
        
               }); 
        
             } catch (err) { 
        
               // The error message doesn't have a type attribute we can look to guarantee it's due 
        
               // to the template already existing (only long message) so we'll check ourselves to see 
        
               // if the template now exists. This scenario would happen if you startup multiple Kibana 
        
               // instances at the same time. 
        
               const existsNow = await this.doesIndexTemplateExist(name); 
        
               if (!existsNow) { 
        
                 const error = new Error(`error creating index template: ${err.message}`); 
        
                 Object.assign(error, { wrapped: err }); 
        
                 throw error; 
        
               } 
        
             } 
        
           }

pmuellr · 2022-06-16T19:11:48Z

Found a different problem, but in the same place, seems like we can address it when we fix this issue.

I'm seeing a few of these in some logs:

error initializing elasticsearch resources: error creating index template: illegal_argument_exception: [illegal_argument_exception] Reason: index template [.kibana-event-log-8.2.2-template] has index patterns [.kibana-event-log-8.2.2-*] matching patterns from existing templates [ZZZ] with patterns (ZZZ => [*documents*]) that have the same priority [0], multiple index templates may not match during index creation, please use a different priority

The old "user mixing templates into 'system' indices". Unavoidable, for cases where a user uses such a broad template pattern.

Seems like we can at least bump the priority of our own template. We'll have to see what other indices used by the stack set theirs to, I think either 100 or 1000.

Not sure if we can do better. Moving to a datastream would probably fix this (but not sure), and that may be a big task, but worthy of consideration, I think. I'm not sure if we can be more precise in our pattern, but that * could basically be replaced by a regexp \d{6} as I think those are just the ILM-generated suffixes.

resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry

resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

…6363) resolves elastic#134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f)

…136647) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>

…136646) resolves #134098 Adds retry logic to the initialization of elasticsearch resources, when Kibana starts up. Recently, it seems this has become a more noticeable error - that race conditions occur where two Kibana's initializing a new stack version will race to create the event log resources. We believe we'll see the end of these issues with some retries, chunked around the 4 resource-y sections of the initialization code. We're using [p-retry][] (which uses [retry][]), to do an exponential backoff starting at 2s, then 4s, 8s, 16s, with 4 retries (so 5 actual attempted calls). Some randomness is added, since there's a race on. [p-retry]: https://github.com/sindresorhus/p-retry#p-retry [retry]: https://github.com/tim-kos/node-retry#retry Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit f6e4c2f) Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>

pmuellr added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog labels Jun 9, 2022

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Jun 9, 2022

pmuellr moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Jun 9, 2022

pmuellr moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Jun 16, 2022

pmuellr self-assigned this Jun 16, 2022

pmuellr moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Jun 16, 2022

pmuellr mentioned this issue Jul 14, 2022

[eventLog] retry resource creation at initialization time #136363

Merged

1 task

pmuellr moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jul 14, 2022

ymao1 closed this as completed in #136363 Jul 19, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jul 19, 2022

ymao1 mentioned this issue Sep 20, 2022

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps]: error creating event log index template at startup #134098

[ResponseOps]: error creating event log index template at startup #134098

pmuellr commented Jun 9, 2022

elasticmachine commented Jun 9, 2022

pmuellr commented Jun 9, 2022

pmuellr commented Jun 9, 2022

pmuellr commented Jun 16, 2022

[ResponseOps]: error creating event log index template at startup #134098

[ResponseOps]: error creating event log index template at startup #134098

Comments

pmuellr commented Jun 9, 2022

elasticmachine commented Jun 9, 2022

pmuellr commented Jun 9, 2022

pmuellr commented Jun 9, 2022

pmuellr commented Jun 16, 2022