TmpPage for Latent Fault

Execution time

$\large c_i = \sum\limits_{q}^{M} \sum\limits_{k}^F n_{i,q}^{k}p_{i}w_q$

p_i is the period for task i (assume to be deadline as well)
n_{i,q}^{k} is the number of events arrives within a period p_i in component q by thread i over function k
w_{q} is the WCET in spd q

Each thread ,say i, can execute through several components. The monitor would track the event (which can be thought as the state changed due to task execution across different components, say by invocation/return/context switch, etc.) Thread i is supposed to be a periodic task (p_i, c_i). During its execution time c_i, it can walk through different components (and kernel). So c_i consists of the different execution time in different component and therefore produces the event record in each component's buffer it visits.

In each period, the max number of events that arrives in component j is given by n_{i}^j and the WCET in component j is given by w^j. So above equation defines the total execution time of thread i in n_{i}^j for all components it visits. In practice, this rate may be obtained from the measurement (e.g. scheduler has function sched_block, and we can track how many times it is invoked by a thread and how long it takes to execute. Then we can have the worst n_{i}^j and w^j). However, for the analysis we need some bound. Therefore we assume that the WCET of thread i in component j (w^{j}) is known for now.

Keep this in mind: the worst case number of events not necessary implies the worst case execution time. It is possible that a task only has much fewer event occurs, but each has the near worst case execution time.

Buffer size

$\large B_{\Delta}^{q} = \sum_{i}^N \sum_{k}^F n_{i,q}^{k}p_{i}\lceil{\frac{\Delta}{p_i}\rceil}$

$\large B_{\Delta}^{cs} = 2\left | T_{\Delta} \right|\Delta$

$\large B_{\Delta}^{int} = N_{\Delta}^{max}\Delta$

$\large B_{\Delta} = \sum_{q}^{M}B_{\Delta}^{q}+B_{\Delta}^{cs}+B_{\Delta}^{int}$

M is the total number of components
N is the total number of threads
F is the total number of functions on a component
p_i is the period for task i
n_{i,q}^{k} is the number of events arrives within a period p_i in component q by thread i over function k
\Delta is the timing window
T_\Delta is the total number of threads in time window \Delta
N_\Delta^(max) is the max number of interrupts allowed to occur in \Delta

Assume the monitor thread is running at period p_{m} and the task i has p_{i}. So the max number the thread i can get the chance to run is \lceil{\frac{p_m}{p_i}\rceil}. Since n_{i}^j defines the number of arrival event in component j in a period p_{i}, above equation defines the following:

total number of arrival events by all threads in component q over \Delta.
total number of events caused by all context in component q over \Delta ?
total number of events due to the interrupt when the current execution is in component q over \Delta
total number of events in the buffer over \Delta

Processing time

$\large c_m = B_{\Delta} \lceil \frac{p_{m}}{\Delta} \rceil w_{v}$

p_m is the monitor thread period
w_v is WCET of single verification

This basically defines the overhead of the log processing, including the latent fault verification.

priority inversion detection
unbounded number of interrupts
unbounded WCET
(e, p, ) checking (e.g.deadline)
infinite loop detection
etc

Response time analysis (RTA)

$\large R^{i+1}=c_i+\sum_{j \le i} \lceil \frac{R^i}{p_j} \rceil c_j+ \lceil \frac{R^i}{p_m} \rceil (c_m +\sum_{j \le i}(c_j-c_j^{p_{f}}))$

where

$\large c_j^{p_{f}} = \sum\limits_{q}^{M} \sum\limits_{k}^F (n_{j,q}^{k}\frac{p_{f}}{p_j}p_{j}-1)w_q$

p_f is the latent fault period (assume a single latent fault in a period p_f)
p_m is the period of monitoring thread
c_m is the WCET of the verification process
this is similar to the c_i definition
"-1" means the closet event entry point (the last healthy point before the detected event which is shown as the red down arrow)
c_j^{p_f} is the execution time up to the last healthy event point

RTA should also consider both the monitored task set and the monitor thread. B and p_{m} will be the variable to be decided by RTA and the goal is to not miss task's deadline.

Important and challenging Once the latent fault is detected, some recovery action has to be taken. However, instead of rebooting the entire service or task, it is possible to resume the execution of the task from the most recent "healthy" event entry point. We need take this into RTA along with any other recovery overhead (e.g. u-reboot and on-demand recovery if required)

Example about how recovery fits into RTA

In above figure, there are 4 tasks. One monitor thread with highest priority of (p_{m}, c_{m}) and three regular tasks 1, 2 and 3 with (p_1, c_1), (p_2, c_2) and (p_3, c_3). The black dotted line is p_{m} and the down arrows represent when the event logged for the thread i in different components (e.g. q) or due to the interrupts and context switches.
When the monitor thread starts polling all records, it will check if any latent fault occurred. Let's say a latent fault occurs every p_{f} (shown in red dotted line).
- For events, there should be a pair of entry points in the buffer to indicate the start and end of the event. For example, invoke and return, invoke and leave, switch from and switch to, etc. Let's call these evt_s and evt_e
- Find the closet evt_s time from p_{f} for different task (red down arrow in above figure)
- All records after p_{f} should not be trusted anymore since the latent fault could effect them.
- where the re-execute should start? The re-execution part of the higher priority tasks need to be considered in RTA.
- The re-execution parts are shown as the gray/blue blocks in above figure.
- Therefore, the last term in RTA equation is sum of re-execution for all higher tasks from its closet healthy point

Terminology:

p_i is the period for task i (assume to be deadline as well)
c_i is the worst case execution time for task i
p_m is the period of monitoring thread
c_m is the WCET of the verification process
p_f is the period of latent fault
B_j is the buffer size in a component j for logging
B is max{B_j} \forall j
t_{evt} is the verification overhead of a single event entry
w_^{j} is the WCET in spd j
n_{i}^{j} is the number of events arrives in a period in component j by thread i
n_{m} is the number of events can be processed within in \Delta
r_{c} is event consumption rate
r_{cs} is event arrival rate due to context switch
N is the total number of threads
M is the total number of components
F is the total number of functions over a component
\Delta is the time window in which the buffer size is investigated

Something else

currently it is single producer single consumer. Need multiple producer and single consumer
share event RB with interrupt RB, CS RB?
need take the overhead of tracking/inserting event record into above formula...., really necessary?
how long a record can stay in the buffer before processed? (buffering time)
how many slots are empty? how many are occupied?
what happen if a thread spends much time in a component but very little, if not zero in another component?
the WCET of an event in a component is based on the invocation time stamp. How to measure the execution time that has no return or leaving from a component?
the verification process is not round robin, instead it is a trace of execution in time.
can be multiple consumer threads? Each is for a buffer? Seems not necessary....
some application? pipeline image processing?
experiments: need manually introduce some latent fault

Provide feedback

Saved searches

Use saved searches to filter your results more quickly