-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016
Comments
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
We had a chat with @marshallmain last week and here's a summary of what we discussed and how decided to proceed with the consolidated implementation:
|
I opened a draft PR (#101453) for the consolidated implementation. Here's a comparison of the 3 implementations in
|
One more thing I forgot to mention is about component templates. In the consolidated implementation, I'm proposing the following "architecture" for templates (as you can see here and here in the code): During index bootstrapping a number of templates will be created by the Event Log mechanism. They will have a certain order of precedence and the "next" template will override properties from all the "previous" ones. Here's the list of templates from "start" (most generic, least precedence) to "finish" (most specific, most precedence):
Template Mechanism-level All the other log-level templates (3 component templates and 1 index template) are created per each log defined via the consolidated implementation. As a developer defining a log (and thus its mappings and settings), you are able to specify templates |
@marshallmain @madirey @peluja1012 @xcrzx @ecezalp @spong here's my 2 cents on our consolidation efforts and #101453 vs #102586 This is all just my subjective opinion, my understanding might be incomplete, and I don't have any strong opinions on anything below. #101453Pros:
Cons:
#102586Pros:
Cons:
Random thoughts regarding a system of component templates and their versioningI really liked the implementation in #102586 which puts a version in the
I also like the idea of lazy index creation. FWIW, in #101453 index bootstrapping is also done lazily. The difference is in #102586 more parts are shared between spaces, and so 2/3 of the bootstrapping happens outside of I'm wondering however what do you think about standardizing templates similar to how I did it in the draft (see #101016 (comment)). Random thoughts regarding the new agreement on namespacing strategyLet me copy-paste this to here
I think this agreement does not have a lot of effect on any of the two approaches. #101453 API does not force a developer to bind a log to a Kibana space. It provides a method for resolving a log which accepts a space id in its parameters: export interface IEventLogResolver {
resolve<TMap extends FieldMap>(
definition: IEventLogDefinition<TMap>,
spaceId: string
): Promise<IEventLog<Event<TMap>>>;
} For its implementation, Final thoughtsI easily could be just biased towards #101453, but I feel like it provides a better API in general (althought it can be not perfect, missing certain methods etc). But given the time left before 7.14 FF and overall fatigue from working on alerts-as-data/event-log/RAC stuff, I think as of now we should proceed with #102586, because "progress, simple perfection". We need to ship something and move forward, and with #102586 theres much more chances to finalize this implementation sooner. |
Thanks for summarizing Georgii! I'm copy pasting the note I sent to Georgii on slack here to summarize my perspective, developed over the past 2 weeks. At a very high level, there are 2 primary goals I have in mind for the eventual converged Alert Data client.
I spent 2 days or so comparing the RuleDataClient with the draft consolidated EventLogService to build a mental model of each implementation, assess the existing features, and start estimating the difficulty of adding on the additional required features to each implementation. After those 2 days, I felt very confident in my mental model of the RuleDataClient, but I was still struggling to remember how the abstraction layers fit together in the EventLogService. I could follow the call stack when looking at a specific piece, but the number of calls into various classes meant I would quickly forget the details of what each class did and stored. As for features, your comment does an excellent job summing up the all the key differences we were aware of at the time. When I started digging in to the implementations, I realized that additionally the namespacing strategy differed between them. The EventLogService creates templates and indices at the same time and creates a separate template for each space, whereas the RuleDataClient does not create concrete indices until documents are written and the template creation process is decoupled from the index creation. This positions the RuleDataClient well to support the arbitrary namespaces that users can define for their RAC alerts since a single set of templates already serves all namespaces for a particular solution's alert indices. Due to the requirement for arbitrary namespaces, it looked to me like the EventLogService would have to be updated to decouple the template creation from index creation - in the process breaking the existing template/index versioning and rollover logic in the EventLogService. So overall, there appeared to be 2 paths forward:
Since (1) there are no blockers preventing the RuleDataClient from being used as the basis for Security Solution rules in RAC - any enhancements are great but not required for shipping so we can more easily make enhancements in parallel with migrating rules, (2) the RuleDataClient is in use already by Observability, and (3) I was more confident in my understanding of the RuleDataClient implementation and thus felt that other devs new to the consolidated client would have an easier time working on it, I decided to try adding the versioning and rollover logic to the RuleDataClient. |
Summarizing next steps after meeting with @banderror and @xcrzx:
Details:
This approach should adopt the declarative style from the event_log. The main difference is it aims to be a lighter-weight implementation that encapsulates template bootstrapping details while providing some flexibility in log definitions. Rather than caching objects for repeated retrieval from the service, the RuleDataPluginService will simply create templates and return a |
I'd like to just add my 2 cents and try to provide a more high-level overview of our recent discussions with @marshallmain and @xcrzx. We had a chat on the following topics:
@marshallmain perfectly described all the details, I just want to again list what we're going to do in the next PR in a more high-level way:
Again, for more details please refer to #101016 (comment). |
@jasonrhodes @dgieselaar @smith please review ^^^^ and give us some feedback 🙂 Let me know if we should elaborate more on anything. The plan is to start working on the next PR soon, would be great to unblock #102586 sooner rather than later because it brings changes to the public API of |
The plan makes sense to me. I'm absolutely in favor of a declarative asset/template API which takes care of the "technical fields". Automatic type derivation sounds nice but a bit of duplication (mapping and io-ts type, for example) might be acceptable to unblock the follow-up efforts sooner IMO. |
As discussed with @banderror, here are some weak spots in the current implementation that we could address in the consolidated implementation. 1. Index name creation & update logicCurrently, index name creation and update logic is split into two pieces.
function createWriteTargetIfNeeded() {
// ...
const concreteIndexName = `${alias}-000001`;
// ...
}
function incrementIndexName(oldIndex: string) {
const baseIndexString = oldIndex.slice(0, -6);
const newIndexNumber = Number(oldIndex.slice(-6)) + 1;
if (isNaN(newIndexNumber)) {
return undefined;
}
return baseIndexString + String(newIndexNumber).padStart(6, '0');
}
Also, it should be flexible and not rely on 6 last digits, because there can be edge cases in general (#102586 (comment)). 2. Index mapping update logicIndex mapping update logic has several white spots. Add my concerns as comments to the function updateAliasWriteIndexMapping({ index, alias }: { index: string; alias: string }) {
const clusterClient = await this.getClusterClient();
const simulatedIndexMapping = await clusterClient.indices.simulateIndexTemplate({
name: index,
});
const simulatedMapping = get(simulatedIndexMapping, ['body', 'template', 'mappings']);
try {
await clusterClient.indices.putMapping({
index,
body: simulatedMapping,
});
return;
} catch (err) {
if (err.meta?.body?.error?.type !== 'illegal_argument_exception') {
/*
* This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?
*/
this.options.logger.error(`Failed to PUT mapping for alias ${alias}: ${err.message}`);
return;
}
const newIndexName = incrementIndexName(index);
if (newIndexName == null) {
/*
* I think we should not fail here. If the index name had no "-000001" suffix, we should append it.
*/
this.options.logger.error(`Failed to increment write index name for alias: ${alias}`);
return;
}
try {
/*
* We don't check response here. It could return "{ acknowledged: false }".
* Should we consider a rollover to be successful in that case?
*/
await clusterClient.indices.rollover({
alias,
new_index: newIndexName,
});
} catch (e) {
if (e?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
/*
* This part is also unclear. We log only 'resource_already_exists_exception' but silence all other exceptions.
* It looks incorrect.
*/
this.options.logger.error(`Failed to rollover index for alias ${alias}: ${e.message}`);
}
/*
* There could be many reasons for the first attempt to update mappings to fail (network, etc.).
* I think we should retry in case of failure.
*/
}
}
} 3. Error handling during bootstrappingCurrently, indices bootstrapping considered to be successful even if some requests failed. There is no way to programmatically retrieve the outcome of index template initialization as we only log errors to console. See initialisation in const initializeRuleDataTemplatesPromise = initializeRuleDataTemplates().catch((err) => {
this.logger!.error(err);
}); It could lead to a situation when index bootstrapping fails, but A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the |
4. Retry logic during bootstrappingWe should also add some retry logic to the bootstrapping logic. I talked to Elasticsearch team, and they recommended to add retries to component and index template updates, or IMO wherever else we could get Here's the log of our discussion: Q: In Kibana (specifically, RAC project), we started to use the new template APIs for bootstrapping our indices (PR for reference #102586). This code is not well tested and not production ready, but it’s going to be used from Security and Observability solutions in the upcoming releases, so I’d like to make sure that we implemented it correctly. Specifically, we started to use:
We make sure to wait for the responses from these endpoints before any other code will be able to write anything to the corresponding indices. So any writing happens only if these calls succeed. In those responses, we normally get 200 OK with the body:
I wasn’t able to find it in the docs or elsewhere, so I’d like to double-check two things. The first thing is about
The second thing is about race conditions between index mgmt and indexing requests. Is it possible to get a race condition between successfully acknowledged template mgmt requests, and the following (potentially, immediately after that) attempt to write a document to an index which name matches the name of the updated index template?
When developing locally, we’ve got into some weird broken state where the concrete index ended up with dynamic mappings instead of mappings from the index template. Most likely, this was caused by our incorrectly implemented code + constant server reloads during development. However, I decided to double-check if any race conditions are possible in general. A:
That means that the change has been acknowledged by the master and accepted with a cluster state update
You can get "false" for the acknowledged, if, when publishing the new cluster state, the request times out before enough nodes have acknowledged and accepted the cluster state there is not any listed errors, so generally a good rule to consider would be to retry in that case
if the template update response have "acknowledged: true", then retrieving those settings through something like the simulate API or creating a new index will use the just updated template Q: If this question even makes sense, how many nodes is enough though? Does acknowledged: true guarantee that any subsequent write request will be handled by a node that received the most recent cluster state with the new mappings etc? Just want to make sure there’s no way to get a race condition, leading to some weird state like let’s say a node which doesn’t know about a newly created index, processes write request and creates an index with dynamic mappings instead of mappings specified in the templates. A: You shouldn't have to worry about this assuming the request is acknowledged: true, because that means the master node has acknowledged it, and all those requests (index creation, mappings updates) go through the master node |
We identified a few potential issues with index bootstrapping and described them in the 2 previous comments. I'm planning to address them as part of this consolidation task. I can split this up to several PRs, or do it in a single one, not sure at this point which way is better. |
Thanks for the writeups @xcrzx and @banderror ! Those sound like good improvements to make. The Q&A with answers from the ES team is especially useful, thanks Georgii for contacting them and writing up the answers!
It's inverted, we skip the rollover if we catch anything except for
This is inverted as well, we silence
I'm a little worried about appending to the index name in the case of failures without knowing what caused the failure. Appending might fix the case where the suffix gets dropped somehow, but it could also introduce other edge cases. If the suffix gets dropped somehow during the rollover process and we fail to find the suffix on the next check and start with "-000001" again, then "-000001" could already exist. In that case it would appear that the rollover process has completed successfully when really nothing has happened. I think the most robust approach would be to go back to using the
This is a great point and something we should definitely handle. "{ acknowledged: false }" seems like a failure to me.
Adding retry logic for the mapping update process in the case of (unexpected) errors would definitely be a good improvement.
Definitely agree with this, we should not allow writes if the templates are not installed correctly. |
Sorry, my bad, didn't see that little
The current implementation doesn't handle this case either. It fails when the index name doesn't contain a numerical suffix. So yeah, going back to using the |
Adding @marshallmain's suggestions regarding index bootstrapping:
A note regarding ILM policy, index template and namespace:
|
Adding suggestions after syncing with Observability folks (cc @weltenwort, btw I don't know GitHub usernames of Kerry and Panagiota, so please feel free to tag them):
|
Just want to socialise here that something like #101541 (a higher-level abstraction on top of index management) is out of scope of this consolidation effort. |
@banderror I took the liberty to add one more ticket in your list for 7.15 (@jasonrhodes can you confirm this is for 7.15?) This is a new bug that we spotted while working on #110788. |
@mgiota Thank you, I can only support that 👍 👍 |
Hey everyone, I removed this ticket from the backlog of the Detection Rules area. We (@elastic/security-detections-response-rules) are not the owners anymore (however feel free to still ping us if you have any tech questions about the ticket). Ownership of this epic and its sub-tasks now goes to the Detection Alerts area ( |
Transferring again to @elastic/response-ops as they now own the rule registry implementation. |
Summary
Consolidate #98353 and #98935 into a single implementation.
We ended up having 4 implementations of index management/writing/reading related/similar to the problem we're trying to solve in RAC: two in
rule_registry
(RuleDataService
andEventLogService
), one insecurity_solution
(for.siem-signals
index), one inevent_log
plugin. We should compare them, mind their strong and weak parts and build a consolidated implementation inrule_registry
.High-level plan:
RuleDataService
andRuleDataClient
. Enhance it, improve its API for better and safer DX, improve its implementation, especially the part responsible for index bootstrapping.EventLogService
) from rule_registry..siem-signals
implementation in Security once we migrate the Detection Engine torule_registry
.event_log
plugin for their monitoring needs. Also, we're planning to implement Rule Execution Log in Security based onevent_log
([Security Solution] Extend event_log plugin with functionality required for Rule Execution Log #106347, [RAC][Security Solution][Detections] Rule Execution Log - technical implementation #101013).Regarding robust index bootstrapping, consider this:
task_manager
to make sure bootstrapping procedure runs only once at a time; c) using some sort of distributed lock, and d) maybe something else I'm missing. Maybe we could check how Saved Objects service bootstraps.kibana
index.When the consolidated implementation is ready, make sure to update all references to it from plugins which already use it:
security_solution
,observability
,apm
,uptime
, etc.Tasks for 7.15
ruleDataClient.bulk()
- ✔️ @banderrorTasks for 7.16
getAuthorizedAlertsIndices
includes Kibana space id into the index name #111154Backlog
Indexing and index bootstrapping logic:
API enhancements for RuleDataService and RuleDataClient:
RuleDataClient
User-defined resources::
Misc:
Consider these as well:
The text was updated successfully, but these errors were encountered: