[DISCUSS] Introduce New CRD Chaos #290
Labels
in: chaos
modules of shardingsphere chaos
in: feature
operator
solutions of operator
type: discussion
Milestone
Chaos CRD Design document
1、Background
It is necessary to introduce the automatic experiment flow of chaos into ss to enhance the toughness and failure recovery ability of ss.
2、Problem description
Chaos experiment should be automated to avoid the experimental environment, injection flow, verification of the duplication of work
2.1 Question 1: How to inject
How can specific failure scenarios be introduced into ss
2.2 Question 2: How to generate pressure
How would a large number of specified requests be sent to ss-proxy during a failure to simulate a real production environment
2.3 Question 3: How to verify the Result
During the experiment, how to collect relevant information and set the steady-state to prove whether the system is in steady-state
3、Technical research
Chaos Mesh or Litmus provides different kinds of chaos experiments, covering most usage scenarios. It only has the ability to inject faults, while experimental environments and verifying the influence of faults on steady state need to be repeated in each experiment. Therefore, we need to define our own crd to realize the automated experiment process for ss-proxy, and use kubebuilder to generate the skeleton code of crd
4、Scheme design
4.1 Program summary
injection:
In order to solve the problem of how to inject faults into ss, the commonly used solution is pingCAP open source Chaos Mesh or Litmus Chaos, which provides a variety of common fault types, but for the construction of automated ss chaotic scenario flow, it can not be introduced directly because of its complexity and independence of configuration.
Chaos Mesh has provided the corresponding API of all CRD resource definitions, which provides the possibility of simplifying the operation. We can abstract our own chaotic scenarios and interact with Chao Mesh to obtain experimental information. For the implementation of interaction, you can refer to Chaos Mesh's official Chaos DashBoard.
Generating pressure:
With regard to the configuration environment and pressure, you can use DistSQL to make a request to the ss-proxy, inject data into the environment, and use it as proof to verify the steady state.
Verification:
In the verification of steady state, we can grab the monitoring log to observe whether the CPU,NetWork IO fluctuates in the steady state, and verify the correctness of the previous request in the pressure phase by DistSQL.
4.2 Holistic design
The chaos experiment for ss-proxy has the following parts
.spec.accountReq
) into jobs, and jobs send traffic requests to the experimental environment.The specific process is as follows:
4.3 Function design
It is functionally divided into three parts: injection fault, voltage generation and fault; users can use related functions by defining cr declaration files
4.3.1 Feature list
Convert the fault declared by the user to the fault type in Chaos Mesh and inject it into the specified experimental environment
Inject traffic into the experimental environment
Collect the CPU, network IO and other important indicators and program output of the experimental target and
compare them with the steady-state condition; And verify the correctness of the flow rate in the pressure
generation stage.
4.4 CRD design
4.4.1 Spec
Generating pressure
It is used to specify the tools to be used and the configuration of the pressure request
Injection fault
.spec.chaosKind
Used to specify the type of injection failureTo specify the type of injection fault, the common fault field is configured in the spec. When accessing the fault provided by the platform, the platform type needs to be written in the annotations, and the fields not mentioned in the fault spec for this platform are written in the annotations.
Common configuration field
Fault target selector
This part of the statement is in
spec.podChaos
A fault that defines the type of pod, and the action field declares the type of fault that is injected into pod
This part of the statement is in
.spec.networkChaos
Define faults of network type
delay.correlation
delay.jitter
correlation: Indicates the correlation between the current latency and the previous one
jitter: Indicates the range of the network latency
loss.loss
correlation: Indicates the correlation between the probability of current packet loss and the previous time's packet loss.
duplicate.duplicate
duplicate: Indicates the probability of packet duplicating
corrupt.correlation
correlation: Indicates the correlation between the probability of current packet corruption and the previous time's packet corruption.
Specific configuration spec
This part needs to be declared in annotations or env
spec/value <-----> selector.value
spec/pod/action <-----> specify .action
spec/pod/gracePeriod <-----> specify .gracePeriod
spec/targetDevice <-----> .targetDevice
spec/target/mode <-----> .selector.mode
spec/target/value <-----> .value
spec/network/action <-----> specify .action
spec/network/rate <-----> .bandwidth.rate
spec/network/limit <-----> .bandwidth.limit
spec/network/buffer <-----> .bandwidth.buffer
spec/network/peakrate <-----> .bandwidth.peakrate
spec/network/minburst <-----> .bandwidth.minburst
spec/random <-------> RANDOMNESS
- Container-kill
spec/signal <------> SIGNAL
spec/chaos_interval <-----> CHAOS_INTERVAL
spec/ramp_time <-----> RAMP_TIME
spec/duration <-------> TOTAL_CHAOS_DURATION
spec/sequence <-----> SEQUENCE
spec/lib_image <-----> LIB_IMAGE
spec/lib <----> LIB
spec/force <-----> FORCE
Collect logs and indicators based on the point in time when the fault is injected, and collect indicators in the steady state and fault to determine whether the test passes
The verification is realized in the way of controlled experiment, which is divided into steady state experimental group and fault experimental group
Ideally, the only variable for both sets of experiments is whether there is a fault in the experimental environment
Whether the results meet the expectations is judged by the steady-state fluctuation and pressure job execution results set by us
As shown in the above picture, the specific process is as follows:
Steady state:
Failure:
Perform a job during steady state and a job during a fault.
After the chao recovers, verify the execution result of the pressure job when the fault occurs and record it in the status
4.4.2 Status
This field records the progress of the injection failure, which has the following five phasesThis field records the progress of the injection chaos, which has the following four phase
4.4.3 Controller design
update
.status.ChaosCondition
Chaos Mesh displays the progress of the current experiment by updating the status of four types of Type. They are used as the basis for changes of.status.ChaosConditionThe change logic is as follows:
Only after all the failures we are currently concerned with in chaos-mesh have entered the AllInjected phase can we change our state from creating to AllInjected.
paused, we should check whether the pod and container we selected are running properly when the fault is paused.
When all faults are Recovered, we update our status to AllRecovered
As mentioned in the chaos-mesh document, it also serves as the evaluation basis for the updated status
The different stages of
.status.ChaosCondition
are pressed and verifiedData collection was performed for steady-state requests prior to injection of the fault, and specified requests were made to the environment to collect data after injection (in Allnjected state)
update phase
BeforeReq-- AfterReq is the initial stage, at which the experiment job is created and pressure request is injected into the environment. In this stage, logs, indicators and steady-state are collected
AfterReq----Injected into this phase after the log collection and job had been successfully executed in the previous phase, where fault injection was carried out
Injected---Recovered: When the chaosCondition was Injected and the phase was in AfterReq, the injected phase entered into the injected phase, carried out the pressure job and experiment job execution, collected logs and indicators at this time, and compared them with the steady state. The comparison results were written back into the result
Recovered: When the chaosCondition is Recovered and the phase is in the Injected stage, it enters this stage and has recovered from the fault; verify job execution and obtain the podlog of the job to check whether the pressure job is successful. And write the result back to result
Result
Two experimental results are presented
When you need to extend more API interfaces of chaos, the interfaces that need to be implemented for pod and network types are as follows:
About the
get/set
interface of chaosAbout the
update/create/New
interface of chaos4.5 Expected
4.5.1 Expected effect
Create a definition yaml file for CR
After applying, the chaos object is created successfully, and you can see the following information
5、Demo
6、References
#272
The change logic is as follows:
Only after all the failures we are currently concerned with in chaos-mesh have entered the AllInjected phase can we change our state from creating to AllInjected.
paused, we should check whether the pod and container we selected are running properly when the fault is paused.
When all faults are Recovered, we update our status to AllRecovered
As mentioned in the chaos-mesh document, it also serves as the evaluation basis for the updated status
The text was updated successfully, but these errors were encountered: