[ResponseOps] event log mapping from template not applied #127029

pmuellr · 2022-03-07T14:25:08Z

Kibana version: 7.16.3

Just saw a case where an event log index for 7.16.3 .kibana-event-log-7.16.3-000001 seems to have "auto" mappings set up, instead of using the mappings provided in the template.

Looking in the es diagnostics package file mappings.json, you can see:

  ".kibana-event-log-7.16.3-000001": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },

But for a previous event log index, it looks like this:

".kibana-event-log-7.15.0-000004": {
    "mappings": {
      "dynamic": "false",
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },

I'm assuming the user didn't do this to themselves somehow, by doing some es-level surgery on the indices. Which I think then implies we have a race condition somewhere, where an event doc is written before the index resources are set up, so it uses the auto-mapping.

A quick look at the index template itself seemed to indicate it was fine. And the alias is set up ok - if there was a problem with that as well, you can have a situation where it would try to create an index with the name .kibana-event-log-7.16.3 instead of using the alias, which causes other problems.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-07T14:25:10Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2022-03-08T20:33:27Z

Looking at the first event log documents for that index, as I was wondering if some sort of specific event was managing to avoid the "ready" checks for the index resources (template, etc). Nope. The first two documents were eventLog/starting events, as expected - the event log generates a starting message at Kibana startup time. Note there are two as there are two Kibana instances for this deployment.

first two event log docs

[
            {
                "_index": ".kibana-event-log-7.16.3-000001",
                "_type": "_doc",
                "_score": null,
                "_source": {
                    "@timestamp": "2022-01-31T10:27:40.965Z",
                    "event": {
                        "provider": "eventLog",
                        "action": "starting"
                    },
                    "message": "eventLog starting",
                    "ecs": {
                        "version": "1.8.0"
                    },
                    "kibana": {
                        "server_uuid": "7c1898ba-6cb1-444d-92b1-b45c7fff2668",
                        "version": "7.16.3"
                    }
                }
            },
            {
                "_index": ".kibana-event-log-7.16.3-000001",
                "_type": "_doc",
                "_score": null,
                "_source": {
                    "@timestamp": "2022-01-31T10:27:41.002Z",
                    "event": {
                        "provider": "eventLog",
                        "action": "starting"
                    },
                    "message": "eventLog starting",
                    "ecs": {
                        "version": "1.8.0"
                    },
                    "kibana": {
                        "server_uuid": "410c0f34-0314-4c70-94d1-abdfce00c4ab",
                        "version": "7.16.3"
                    }
                }
            }
]

About 6 seconds after those documents, alerting/execute-start and alerting/execute events started getting written, which is also to be expected.

So, no smoking gun yet.

pmuellr · 2022-03-08T21:04:05Z

Quick tour of the processing going on here.

The elasticsearch resources are created here:

kibana/x-pack/plugins/event_log/server/es/init.ts

Lines 28 to 36 in e3e5df6

    
           async function initializeEsResources(esContext: EsContext) { 
        
             const steps = new EsInitializationSteps(esContext); 
        
             await steps.setExistingAssetsToHidden(); 
        
             await steps.createIlmPolicyIfNotExists(); 
        
             await steps.createIndexTemplateIfNotExists(); 
        
             await steps.createInitialIndexIfNotExists(); 
        
           }

The mappings are in the template, so the template should exist before we try to create the index. When we create the index, we use the "initial name", which has a suffix of -000001. We also set the alias when the index is created:

kibana/x-pack/plugins/event_log/server/es/init.ts

Lines 204 to 216 in e3e5df6

    
           async createInitialIndexIfNotExists(): Promise<void> { 
        
             const exists = await this.esContext.esAdapter.doesAliasExist(this.esContext.esNames.alias); 
        
             if (!exists) { 
        
               await this.esContext.esAdapter.createIndex(this.esContext.esNames.initialIndex, { 
        
                 aliases: { 
        
                   [this.esContext.esNames.alias]: { 
        
                     is_write_index: true, 
        
                     is_hidden: true, 
        
                   }, 
        
                 }, 
        
               }); 
        
             } 
        
           }

One problem we've seen in the past is if for some reason the initial index doesn't actually get created, the event log currently writes to the alias, and the alias doesn't exist, so ES auto-creates an index with the name of the alias, which is obviously a mess.

But that didn't happen here. The alias did get created, and the event log indexed the documents in the right place, but for some reason, didn't apply the mappings, which should have been setup when the template was created, before this.

Also keep in mind there are two Kibanas, starting at the same time, so there's a potential for a race condition. But we've not seen any race conditions with this code AFAIK.

I'm wondering if there's some underlying ES-ish race condition, where one of the Kibanas might call doesIndexTemplateExist(), and it would return true, but not really be quite ready, somehow. Is there some way a long refresh value (or similar) could have an effect like this?

My only thought on trying to lock this down more is to add an additional delay after the initialization code, to have it wait "a bit more".

The other possibility is that the "ready" signal on the write end of the event log is being set too early, or perhaps not being checked for some reason.

The code that checks/waits for initialization complete, before writing, is here:

kibana/x-pack/plugins/event_log/server/es/cluster_client_adapter.ts

Lines 99 to 106 in e3e5df6

    
           async indexDocuments(docs: TDoc[]): Promise<void> { 
        
             // If es initialization failed, don't try to index. 
        
             // Also, don't log here, we log the failure case in plugin startup 
        
             // instead, otherwise we'd be spamming the log (if done here) 
        
             if (!(await this.wait())) { 
        
               this.logger.debug(`Initialization failed, not indexing ${docs.length} documents`); 
        
               return; 
        
             }

The this.wait() function comes in from the constructor, so you'll have to track references to see where it's set, but should eventually land here:

kibana/x-pack/plugins/event_log/server/es/context.ts

Lines 48 to 77 in e3e5df6

    
           constructor(params: EsContextCtorParams) { 
        
             this.logger = params.logger; 
        
             this.esNames = getEsNames(params.indexNameRoot, params.kibanaVersion); 
        
             this.readySignal = createReadySignal(); 
        
             this.initialized = false; 
        
             this.esAdapter = new ClusterClientAdapter({ 
        
               logger: params.logger, 
        
               elasticsearchClientPromise: params.elasticsearchClientPromise, 
        
               wait: () => this.readySignal.wait(), 
        
             }); 
        
           } 
        
           initialize() { 
        
             // only run the initialization method once 
        
             if (this.initialized) return; 
        
             this.initialized = true; 
        
             this.logger.debug('initializing EsContext'); 
        
             setImmediate(async () => { 
        
               try { 
        
                 const success = await this._initialize(); 
        
                 this.logger.debug(`readySignal.signal(${success})`); 
        
                 this.readySignal.signal(success); 
        
               } catch (err) { 
        
                 this.logger.debug('readySignal.signal(false)'); 
        
                 this.readySignal.signal(false); 
        
               } 
        
             }); 
        
           }

pmuellr · 2022-03-08T21:43:43Z

A bit more poking around the ES diagnostic bundle. Looking at the settings.json for the 7.16.3 EL (bad) and 7.16.2 EL (good), you can see these differences:

7.16.3 event log settings

  {
    "settings" : {
      "index" : {





        "creation_date_string" : "2022-01-31T10:27:49.225Z",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        
        "number_of_shards" : "1",
        "auto_expand_replicas" : "false",
        "blocks" : {
          "read_only_allow_delete" : "false"
        },
        "provided_name" : ".kibana-event-log-7.16.3-000001",
        "creation_date" : "1643624869225",
        "number_of_replicas" : "0",
        "uuid" : "jMdaqkJvSm6Ci9xbb36VBA",
        "version" : {
          "created_string" : "7.16.3",
          "created" : "7160399"
        }
      }
    }
  }

7.16.2 event log settings

  {
    "settings" : {
      "index" : {
        "lifecycle" : {
          "name" : "kibana-event-log-policy",
          "rollover_alias" : ".kibana-event-log-7.16.2",
          "indexing_complete" : "true"
        },
        "creation_date_string" : "2022-01-12T00:02:48.113Z",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "hidden" : "true",
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",



        "provided_name" : ".kibana-event-log-7.16.2-000001",
        "creation_date" : "1641945768113",
        "number_of_replicas" : "0",
        "uuid" : "z2rJtrXWSzafIVkGUsLC_g",
        "version" : {
          "created_string" : "7.16.2",
          "created" : "7160299"
        }
      }
    }
  }

The difference is that 7.16.3 is missing the lifecycle property; it should be:

        "lifecycle" : {
          "name" : "kibana-event-log-policy",
          "rollover_alias" : ".kibana-event-log-7.16.3",
          "indexing_complete" : "true"
        },

And the 7.16.3 settings have an extra blocks property that 7.16.2 doesn't have:

        "blocks" : {
          "read_only_allow_delete" : "false"
        },

Seems like that blocks property is probably not significant: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-blocks.html#index-block-settings

The lack of the lifecycle is significant, but that's in the template, so just more proof that the index got created without having the template applied to it.

pmuellr · 2022-03-08T21:50:52Z

Did some searching to see if this was some kind of known thing, but this is the only relevant issue I could find: it's 4 years old from 6.8 - elastic/elasticsearch#30486

pmuellr · 2022-04-11T16:21:43Z

We also opened #128293 as a potential "solution" for cases where either there is some ES race condition or a human modified some event log resources manually. We closed it as "in favour of" this issue - but this issue doesn't actually contain any notes about that "solution".

That "solution" is to create the initial index specifically with mappings, instead of depending on the mappings being in the template. We'd still have the mappings in the template, we'd just change to also create the first index with those same mappings.

This only "solve" the problem for the initial index creation - if somehow the mappings got changed / removed from the template, further indices (created as part of the ILM process) would presumably pick up the mappings from the template (ILM creates those, not us), and could still be problematic.

pmuellr added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog labels Mar 7, 2022

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Mar 7, 2022

pmuellr moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Mar 7, 2022

pmuellr mentioned this issue Mar 23, 2022

[event log] setting event log index mappings when creating initial index #128293

Closed

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Mar 24, 2022

pmuellr mentioned this issue Jun 9, 2022

[ResponseOps]: error creating event log index template at startup #134098

Closed

ymao1 mentioned this issue Sep 20, 2022

[Response Ops][Alerting] Research best practices for bootstrapping alerts as data indices #141146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] event log mapping from template not applied #127029

[ResponseOps] event log mapping from template not applied #127029

pmuellr commented Mar 7, 2022

elasticmachine commented Mar 7, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Apr 11, 2022

[ResponseOps] event log mapping from template not applied #127029

[ResponseOps] event log mapping from template not applied #127029

Comments

pmuellr commented Mar 7, 2022

elasticmachine commented Mar 7, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Mar 8, 2022

pmuellr commented Apr 11, 2022