You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ISM Validation Service to pre check actions and notify users before they are executed.
Motivation
Index State Management (ISM) is a plugin for OpenSearch that automates reoccurring operations between the lifecycles and manages metadata on specific indices through user defined policies. Every ISM operation (documented here) is managed by a policy for the action states and transitions. There are generally three reasons as to why an action will fail: an unmet action execution prerequisite, an invalid policy configuration, or a transient failure like a timeout or circuit breaker exception. For the first two failures, a user can redefine their policy or make a cluster change and retry the action to see if it passes. However, this process is not clearly defined and may often take a long time to fully execute. Additionally, it is often clear that the defined policy action will fail beforehand and sometimes even at policy creation.
Problem Statement
There can be potential action failures for ISM with no good way of understanding and handling why these failures are happening and what the causes are. There needs to be an investigation into the causes for these errors as well as an overarching analysis in order to gain better understanding on how to prevent them from happening in the future.
In order to manage and prevent catchable errors, an error validation and prevention structure must be created that allows users to preemptively check whether or not the next action is in danger of failing with an explanation of the cause of failure. With this validation and prevention structure, users will be able to manage errors easier and instantaneously fix action failures, thus reducing operational burden for ISM.
User Story
As a user I want to mitigate as many errors as possible with the least amount of manual work.
As a user I want to be able to apply a policy to an index and immediately check if the next action to be executed is in danger of failing without having to wait for the condition(s) to be met.
As a user I want to have the ability to check if my action is projected to fail at any point.
As a user I want sufficient information about the errors and suggested solutions I receive so that I am able to troubleshoot it quick and easily.
Tenets
User experience is of the utmost importance and must be held to high standards.
Design choices must be driven by data and research.
Performance of ISM is a major consideration and must not be affected.
Challenges
Action failures encompasses around 100 unique causes and not all errors may be preventable so the existing errors need to be differentiated and categorized.
Validation code must not impact the overall performance of the ISM lifecycle.
Notifications must be user friendly and contain enough information for the user to understand and fix the error.
In Scope
Implement a validation structure that will check if the next action is in danger of failing and provide notification to users if an action error is projected to occur.
Refactor existing validation code to adhere to the new validation structure.
Implement validation checks and unit tests for all new validation functions.
Out of Scope
The validation structure will not try to fix the error, this will need to be done manually by the user.
Users will not be able to manually call and check for action failures.
Users will not be able to indicate specific indices they would like the validation structure to be performed on.
Error Analysis
There are a variety of failures that may occur when executing an action, but when analyzing these action failures a few common errors begin to arise.
In general, action failures can be categorized into four succinct groups:
Preventable errors
This type of error may be validated ahead of time and fixed by the user after changing some type of configuration.
Preventable errors caught by an API call exception
This type of error may only be checked by an API call instead of validated ahead of time.
Preventable errors that may cause the cluster state to worsen
This type of error may worsen the cluster state if ISM continues to run.
Errors messages that still need investigation
Validation Framework
High Level Design
To mitigate action failures, a validation service will be created. This framework will be implemented in two steps such that it 1. validates the potential action failure before the action is performed and determines the best course of action depending on the error cause and 2. informs the user in a well mannered way the potential failure along with suggested solutions or documentation on how to fix the error through the Explain API. The framework will be enabled by default but can be turned off by the user in the cluster settings. However, it is not enabled by default in the Explain API but can be called by setting the query parameters to "validate_action=true".
Validating action failure
Validation logic will be implemented in a separate validation class for each action and implementation will be tailored to the cause of action failure. This will be called from the managed index runner and checked before an action is supposed to be executed.
Depending on the error cause, some API’s need to be called upon to catch errors rather than validating them.
Once the validation logic finishes and checks confirm that there is an action failure, the action in question returns with an error indication and the validation service is either retried at the next job scheduler interval or the action fails forever.
Notifying action failure
Notification will be called after the validation logic from part 1 is performed through a flag in the Explain API. This will allow users to validate future actions while ISM is running and find potential action errors.
The notification will provide remedy on how to prevent these action failures from occurring and will include either simple steps on how to fix or reconfigure the problem area or it will provide documentation pointing users to a solution.
By providing a link to the documentation, users will be able to find the actual solution.
By decoupling the solution from the codebase, the document itself may be updated instead of the codebase.
Pros:
Validation is only executed before an action is set to take place preventing unnecessary checks.
If validation fails, the action is automatically retried or failed so that less manual labor needs to be performed.
Because validation is enabled by default, users don’t need to proactively check and query for errors when running the cluster.
Cons:
Execution is limited to the job scheduler interval (currently at 5 minutes) so once a user implements a fix, the update will not take place until the next interval.
There may be additional overhead and a slow down in performance based on the validation implementation for each action failure.
Workflow:
Architectural Design
Design Alternatives
Validation service not enabled by default
Pros:
ISM performance would not be affected until the user decides to enable the validation framework.
Users may not always need this validation framework when running ISM.
Cons:
Users need to manually enable the validation which can lead to underutilization of the validation service.
If there is an error, fail all actions forever and allow users to manually re-validate instead of automatically re-validating and retrying the action.
Pros:
No unnecessary retries if the action failure is not yet fixed by the user.
Cons:
Some actions don’t need to be failed and re-validating and retrying the action wouldn’t harm the cluster state.
Less automation and more manual work for the users to perform.
Implementing self-healing for certain actions by fixing or reconfiguring the cluster.
Pros:
If user’s don’t have to deal with cluster level issues, it would provide a better user experience and more automation.
User’s may not know how to fix the cluster level problems themselves.
Cons:
Risky for ISM to configure the cluster directly and the responsibility of fixing the error should be on the user not ISM.
Implementation Details
ValidationService.kt
class ValidationService(
val settings: Settings,
val clusterService: ClusterService
) {
fun validate(actionName: String, indexName: String): ValidationResult {
// map action to validation class
val validation = when (actionName) {
"rollover" -> ValidateRollover(settings, clusterService, jvmService).execute(indexName)
"delete" -> ValidateDelete(settings, clusterService, jvmService).execute(indexName)
"force_merge" -> ValidateForceMerge(settings, clusterService, jvmService).execute(indexName)
else -> {
// temporary call until all actions are mapped
ValidateNothing(settings, clusterService, jvmService).execute(indexName)
}
}
return ValidationResult(validation.validationMessage.toString(), validation.validationStatus)
}
}
Validate.kt
abstract class Validate(
val settings: Settings,
val clusterService: ClusterService
) {
var validationStatus = ValidationStatus.PASS
abstract fun execute(context: StepContext): Validate
enum class ValidationStatus(val status: String) : Writeable {
PASSED("passed"),
RE_VALIDATE("re_validate"),
FAILED("failed");
}
}
Every action to be validated will adhere to the Validate.kt abstract class and will implement the necessary functions.
Demo
Validation Framework Demo using missing rollover alias example. In the demo, the framework is called using the Explain API and the results are shown when the service is both enabled and disabled. It is also called through the Managed Index Runner and then notifies the user through Amazon Chime.
Validation.Framework.Demo.mp4
Testing
Integration testing for each action
Unit testing on validation logic when appropriate
Limitations
Not all action errors are preventable and can be caught before runtime.
Appendix
Terminology
Policy - a user defined set of rules that describe how to run certain OpenSearch operations on an index and manage them through the use of states and transitions.
Action - steps that the policy sequentially executes upon entering a specific state.
Step - individual jobs broken down from the action that execute transition conditions or an action itself.
Overview
ISM Validation Service to pre check actions and notify users before they are executed.
Motivation
Index State Management (ISM) is a plugin for OpenSearch that automates reoccurring operations between the lifecycles and manages metadata on specific indices through user defined policies. Every ISM operation (documented here) is managed by a policy for the action states and transitions. There are generally three reasons as to why an action will fail: an unmet action execution prerequisite, an invalid policy configuration, or a transient failure like a timeout or circuit breaker exception. For the first two failures, a user can redefine their policy or make a cluster change and retry the action to see if it passes. However, this process is not clearly defined and may often take a long time to fully execute. Additionally, it is often clear that the defined policy action will fail beforehand and sometimes even at policy creation.
Problem Statement
There can be potential action failures for ISM with no good way of understanding and handling why these failures are happening and what the causes are. There needs to be an investigation into the causes for these errors as well as an overarching analysis in order to gain better understanding on how to prevent them from happening in the future.
In order to manage and prevent catchable errors, an error validation and prevention structure must be created that allows users to preemptively check whether or not the next action is in danger of failing with an explanation of the cause of failure. With this validation and prevention structure, users will be able to manage errors easier and instantaneously fix action failures, thus reducing operational burden for ISM.
User Story
Tenets
Challenges
In Scope
Out of Scope
Error Analysis
There are a variety of failures that may occur when executing an action, but when analyzing these action failures a few common errors begin to arise.
In general, action failures can be categorized into four succinct groups:
Validation Framework
High Level Design
To mitigate action failures, a validation service will be created. This framework will be implemented in two steps such that it 1. validates the potential action failure before the action is performed and determines the best course of action depending on the error cause and 2. informs the user in a well mannered way the potential failure along with suggested solutions or documentation on how to fix the error through the Explain API. The framework will be enabled by default but can be turned off by the user in the cluster settings. However, it is not enabled by default in the Explain API but can be called by setting the query parameters to
"validate_action=true"
.Validating action failure
Notifying action failure
Pros:
Cons:
Workflow:
Architectural Design
Design Alternatives
Validation service not enabled by default
If there is an error, fail all actions forever and allow users to manually re-validate instead of automatically re-validating and retrying the action.
Implementing self-healing for certain actions by fixing or reconfiguring the cluster.
Implementation Details
ValidationService.kt
Validate.kt
Validate.kt
abstract class and will implement the necessary functions.Demo
Validation Framework Demo using missing rollover alias example. In the demo, the framework is called using the Explain API and the results are shown when the service is both enabled and disabled. It is also called through the Managed Index Runner and then notifies the user through Amazon Chime.
Validation.Framework.Demo.mp4
Testing
Limitations
Appendix
Terminology
Policy - a user defined set of rules that describe how to run certain OpenSearch operations on an index and manage them through the use of states and transitions.
Action - steps that the policy sequentially executes upon entering a specific state.
Step - individual jobs broken down from the action that execute transition conditions or an action itself.
Related issue(s): #27
The text was updated successfully, but these errors were encountered: