Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] event log mapping from template not applied #127029

Open
pmuellr opened this issue Mar 7, 2022 · 6 comments
Open

[ResponseOps] event log mapping from template not applied #127029

pmuellr opened this issue Mar 7, 2022 · 6 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Mar 7, 2022

Kibana version: 7.16.3

Just saw a case where an event log index for 7.16.3 .kibana-event-log-7.16.3-000001 seems to have "auto" mappings set up, instead of using the mappings provided in the template.

Looking in the es diagnostics package file mappings.json, you can see:

  ".kibana-event-log-7.16.3-000001": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },

But for a previous event log index, it looks like this:

".kibana-event-log-7.15.0-000004": {
    "mappings": {
      "dynamic": "false",
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "keyword",
              "ignore_above": 1024
            }
          }
        },

I'm assuming the user didn't do this to themselves somehow, by doing some es-level surgery on the indices. Which I think then implies we have a race condition somewhere, where an event doc is written before the index resources are set up, so it uses the auto-mapping.

A quick look at the index template itself seemed to indicate it was fine. And the alias is set up ok - if there was a problem with that as well, you can have a situation where it would try to create an index with the name .kibana-event-log-7.16.3 instead of using the alias, which causes other problems.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog labels Mar 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member Author

pmuellr commented Mar 8, 2022

Looking at the first event log documents for that index, as I was wondering if some sort of specific event was managing to avoid the "ready" checks for the index resources (template, etc). Nope. The first two documents were eventLog/starting events, as expected - the event log generates a starting message at Kibana startup time. Note there are two as there are two Kibana instances for this deployment.

first two event log docs
[
            {
                "_index": ".kibana-event-log-7.16.3-000001",
                "_type": "_doc",
                "_score": null,
                "_source": {
                    "@timestamp": "2022-01-31T10:27:40.965Z",
                    "event": {
                        "provider": "eventLog",
                        "action": "starting"
                    },
                    "message": "eventLog starting",
                    "ecs": {
                        "version": "1.8.0"
                    },
                    "kibana": {
                        "server_uuid": "7c1898ba-6cb1-444d-92b1-b45c7fff2668",
                        "version": "7.16.3"
                    }
                }
            },
            {
                "_index": ".kibana-event-log-7.16.3-000001",
                "_type": "_doc",
                "_score": null,
                "_source": {
                    "@timestamp": "2022-01-31T10:27:41.002Z",
                    "event": {
                        "provider": "eventLog",
                        "action": "starting"
                    },
                    "message": "eventLog starting",
                    "ecs": {
                        "version": "1.8.0"
                    },
                    "kibana": {
                        "server_uuid": "410c0f34-0314-4c70-94d1-abdfce00c4ab",
                        "version": "7.16.3"
                    }
                }
            }
]

About 6 seconds after those documents, alerting/execute-start and alerting/execute events started getting written, which is also to be expected.

So, no smoking gun yet.

@pmuellr
Copy link
Member Author

pmuellr commented Mar 8, 2022

Quick tour of the processing going on here.

The elasticsearch resources are created here:

async function initializeEsResources(esContext: EsContext) {
const steps = new EsInitializationSteps(esContext);
await steps.setExistingAssetsToHidden();
await steps.createIlmPolicyIfNotExists();
await steps.createIndexTemplateIfNotExists();
await steps.createInitialIndexIfNotExists();
}

The mappings are in the template, so the template should exist before we try to create the index. When we create the index, we use the "initial name", which has a suffix of -000001. We also set the alias when the index is created:

async createInitialIndexIfNotExists(): Promise<void> {
const exists = await this.esContext.esAdapter.doesAliasExist(this.esContext.esNames.alias);
if (!exists) {
await this.esContext.esAdapter.createIndex(this.esContext.esNames.initialIndex, {
aliases: {
[this.esContext.esNames.alias]: {
is_write_index: true,
is_hidden: true,
},
},
});
}
}

One problem we've seen in the past is if for some reason the initial index doesn't actually get created, the event log currently writes to the alias, and the alias doesn't exist, so ES auto-creates an index with the name of the alias, which is obviously a mess.

But that didn't happen here. The alias did get created, and the event log indexed the documents in the right place, but for some reason, didn't apply the mappings, which should have been setup when the template was created, before this.

Also keep in mind there are two Kibanas, starting at the same time, so there's a potential for a race condition. But we've not seen any race conditions with this code AFAIK.

I'm wondering if there's some underlying ES-ish race condition, where one of the Kibanas might call doesIndexTemplateExist(), and it would return true, but not really be quite ready, somehow. Is there some way a long refresh value (or similar) could have an effect like this?

My only thought on trying to lock this down more is to add an additional delay after the initialization code, to have it wait "a bit more".

The other possibility is that the "ready" signal on the write end of the event log is being set too early, or perhaps not being checked for some reason.

The code that checks/waits for initialization complete, before writing, is here:

async indexDocuments(docs: TDoc[]): Promise<void> {
// If es initialization failed, don't try to index.
// Also, don't log here, we log the failure case in plugin startup
// instead, otherwise we'd be spamming the log (if done here)
if (!(await this.wait())) {
this.logger.debug(`Initialization failed, not indexing ${docs.length} documents`);
return;
}

The this.wait() function comes in from the constructor, so you'll have to track references to see where it's set, but should eventually land here:

constructor(params: EsContextCtorParams) {
this.logger = params.logger;
this.esNames = getEsNames(params.indexNameRoot, params.kibanaVersion);
this.readySignal = createReadySignal();
this.initialized = false;
this.esAdapter = new ClusterClientAdapter({
logger: params.logger,
elasticsearchClientPromise: params.elasticsearchClientPromise,
wait: () => this.readySignal.wait(),
});
}
initialize() {
// only run the initialization method once
if (this.initialized) return;
this.initialized = true;
this.logger.debug('initializing EsContext');
setImmediate(async () => {
try {
const success = await this._initialize();
this.logger.debug(`readySignal.signal(${success})`);
this.readySignal.signal(success);
} catch (err) {
this.logger.debug('readySignal.signal(false)');
this.readySignal.signal(false);
}
});
}

@pmuellr
Copy link
Member Author

pmuellr commented Mar 8, 2022

A bit more poking around the ES diagnostic bundle. Looking at the settings.json for the 7.16.3 EL (bad) and 7.16.2 EL (good), you can see these differences:

7.16.3 event log settings
  {
    "settings" : {
      "index" : {





        "creation_date_string" : "2022-01-31T10:27:49.225Z",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        
        "number_of_shards" : "1",
        "auto_expand_replicas" : "false",
        "blocks" : {
          "read_only_allow_delete" : "false"
        },
        "provided_name" : ".kibana-event-log-7.16.3-000001",
        "creation_date" : "1643624869225",
        "number_of_replicas" : "0",
        "uuid" : "jMdaqkJvSm6Ci9xbb36VBA",
        "version" : {
          "created_string" : "7.16.3",
          "created" : "7160399"
        }
      }
    }
  }
7.16.2 event log settings
  {
    "settings" : {
      "index" : {
        "lifecycle" : {
          "name" : "kibana-event-log-policy",
          "rollover_alias" : ".kibana-event-log-7.16.2",
          "indexing_complete" : "true"
        },
        "creation_date_string" : "2022-01-12T00:02:48.113Z",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "hidden" : "true",
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",



        "provided_name" : ".kibana-event-log-7.16.2-000001",
        "creation_date" : "1641945768113",
        "number_of_replicas" : "0",
        "uuid" : "z2rJtrXWSzafIVkGUsLC_g",
        "version" : {
          "created_string" : "7.16.2",
          "created" : "7160299"
        }
      }
    }
  }

The difference is that 7.16.3 is missing the lifecycle property; it should be:

        "lifecycle" : {
          "name" : "kibana-event-log-policy",
          "rollover_alias" : ".kibana-event-log-7.16.3",
          "indexing_complete" : "true"
        },

And the 7.16.3 settings have an extra blocks property that 7.16.2 doesn't have:

        "blocks" : {
          "read_only_allow_delete" : "false"
        },

Seems like that blocks property is probably not significant: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-blocks.html#index-block-settings

The lack of the lifecycle is significant, but that's in the template, so just more proof that the index got created without having the template applied to it.

@pmuellr
Copy link
Member Author

pmuellr commented Mar 8, 2022

Did some searching to see if this was some kind of known thing, but this is the only relevant issue I could find: it's 4 years old from 6.8 - elastic/elasticsearch#30486

@pmuellr
Copy link
Member Author

pmuellr commented Apr 11, 2022

We also opened #128293 as a potential "solution" for cases where either there is some ES race condition or a human modified some event log resources manually. We closed it as "in favour of" this issue - but this issue doesn't actually contain any notes about that "solution".

That "solution" is to create the initial index specifically with mappings, instead of depending on the mappings being in the template. We'd still have the mappings in the template, we'd just change to also create the first index with those same mappings.

This only "solve" the problem for the initial index creation - if somehow the mappings got changed / removed from the template, further indices (created as part of the ILM process) would presumably pick up the mappings from the template (ILM creates those, not us), and could still be problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

2 participants