CFE_ES_ProcessCoreException() is not interrupt-safe #76

skliper · 2019-09-30T17:50:53Z

The CFE_ES_ProcessCoreException() architecture needs to be re-examined.

This is the equivalent of an interrupt handler and likely triggered as the result of a hardware interrupt. However the implementation calls other CFE_ES functions, some of which take the global data mutex (CFE_ES_LockSharedData), do console output (CFE_ES_WriteToSyslog and other functions which ultimately do printf), or call OS_TaskDelay.

All of these operations are unsafe to do in an interrupt context on most platforms.

This may not be an issue if the end-result of the exception is a processor reset, which is probably the only safe thing to do. Although the option to restart the task does exist, the system may be too far gone after this.

skliper · 2019-09-30T17:50:55Z

Imported from trac issue 45. Created by jphickey on 2015-05-07T11:48:29, last modified: 2019-04-24T13:18:47

skliper · 2019-09-30T17:50:55Z

Trac comment by jphickey on 2015-06-15 17:44:06:

Another issue with CFE_ES_ProcessCoreException():

The API is somewhat flawed in that the first parameter is a uint32 for which the PSP uses the "raw" (unabstracted) underlying OS task identifier.

Depending on the OS, the actual OS task identifier may not be an integer and may not fit into 32 bits. (Cygwin is one example where the {{{pthread_t}}} type is actually a structure and is not a simple integer).

It would be better to move the logic to figure out which OSAL task is associated with that host task ID //before// calling {{{CFE_ES_ProcessCoreException()}}}. This way, the first parameter could be an OSAL task ID which would make the code more straightforward as well (removing the loop to call {{{OS_TaskGetInfo()}}} etc).

skliper · 2019-09-30T17:50:56Z

Trac comment by jphickey on 2018-05-14 11:19:12:

This should be considered for the next CFE release. It will likely need a coordinated change between PSP and CFE.

skliper · 2019-09-30T17:50:57Z

Trac comment by sseeger on 2018-06-23 11:07:50:

This seems to be used in vxWorks implementations. There is a CFE_PSP_ExceptionHook for rtems, but I don't see where it's registered. Possibly built-in to RTEMS source (in-house modification) or something?

After analyzing the vxWorks use case, and what I know of vxWorks, the answer to this problem seems to be to have the handler wake up a task created by PSP to handle the exception for us. vxWorks guarantees the task that caused the exception will be suspended, so this helper task can handle it. If we go this route, then we should be able to leave CFE_ES_ProcessCoreException() unmodified. Let's discuss this at the next CCB to see if there is consensus with this idea, or if there are other suggestions.

skliper · 2019-09-30T17:50:57Z

Trac comment by jhageman on 2019-03-04 09:20:25:

CCB - Needs review for path forward.

skliper · 2019-09-30T17:50:58Z

Trac comment by jhageman on 2019-04-10 13:31:36:

CCB 4/10/19

Discussed review in context of it only works on one old system, verification fails and likely more appropriate at a lower level and/or a very project/mission specific solution in many cases. Suggestion is to deprecate at the cFE level.

skliper · 2019-09-30T17:50:58Z

Trac comment by jphickey on 2019-04-10 14:25:19:

Brainstorming about this one a bit, on most systems the "exceptions" that occur lie in one of three categories:

Arithmetic Error
Physical memory bus error
Memory access violation / virtual memory error

Case (1): Arithmetic errors are almost always just bad input and (or bad code that didn't check its input) and generally have low likelihood of affecting the other tasks running in the system. For this, restarting only the specific task that caused the problem might be a suitable response, and there could be a reasonably high level of confidence that the remainder of the system will continue to work properly.

Case (2): A bus error means that the memory itself is bad, or the external memory-mapped device on the bus is faulted, or that the CPU is trying to access a place outside its valid physical memory map. All of which suggest that the system is seriously ill and probably nothing less than a full system restart (including external devices) will restore it to normal operation. The only possibility here is that if the PSP can positively identify the cause of the bus error and fix the actual root cause. But without this, the external bus could be left in an indeterminate state and all future accesses might fail too.

Case (3): This only applies to systems with an MMU, and means the software accessed something outside its virtual address space. CFE ES/OSAL does //not// currently run apps in separate memory spaces, all tasks share the same virtual address space. If you are lucky enough to catch this on the first bad access (like a NULL pointer dereference) then task-specific recovery might be possible. But more likely, this happens with e.g. a memset() with a size too big or a strcpy() of an unterminated string or something of that nature. In these cases, it could overwrite vast swaths of //other// task's data prior to actually triggering the access exception. If, in a future version of OSAL/CFE, we embrace separate MMU memory spaces per-app, then recovery would be more feasible. But with the current architecture, it probably isn't wise to even attempt it, there are just too many unknowns.

My opinion:
In a system like CFE where the reliability of the software is crucial, so the action for an exception should be the minimum required to bring the system back to a //fully known state//. As such, the default action for 2 & 3 above needs to be a system restart, unless the system integration engineers can determine a reliable recovery mechanism for some faults that entails less than a full restart. But this would be a mission and PSP-specific option, not the default.

skliper · 2019-09-30T17:50:59Z

Trac comment by jhageman on 2019-04-24 13:15:26:

Also related to discussion on #310, mission specific handling should go in PSP.

skliper · 2019-09-30T17:50:59Z

Trac comment by jhageman on 2019-04-24 13:18:47:

Latest actions for this ticket:

Update app guide (normal tasks only)
Update PSP guide/API/docs ISR handling goes here
Deprecate exception handling at cFE level

skliper · 2020-04-08T21:46:34Z

Whatever path is chosen, there is no requirement since it's not an application API. Recommend moving out of cfe_es_api.c.

ghost · 2020-04-09T21:07:08Z

If this is done along with #411, then can the new ES background task log what is necessary and do either the App Restart or PSP based reset?
It would satisfy @skliper's recommendation of removing it from cfe_es_api.c.
Exception classifications sound reasonable, but it would be good to have the ability for a systems integration engineer to customize this at the PSP level.
But it's probably true that most of the exception cases will merit a reset, not just an app restart.

skliper · 2020-04-09T21:13:44Z

I should likely write a functional requirement (that it logs the information and takes the platform defined response) or whatever it ends up doing so it gets build verified, just not a "upon request" type of requirement.

skliper added this to the 7.0.0 milestone Sep 30, 2019

skliper self-assigned this Sep 30, 2019

skliper added the bug label Sep 30, 2019

skliper mentioned this issue Sep 30, 2019

CFE_ES_ProcessCoreException and CFE_ES_ResetCFE are not ISR safe (GSFC DCR 18593) #122

Closed

skliper assigned ghost and unassigned skliper Sep 30, 2019

skliper mentioned this issue Sep 30, 2019

Calling CFE_TIME_GetReference in interrupt context can lock up system #310

Closed

jphickey mentioned this issue Nov 18, 2019

Exception and Reset Log possible race conditions #411

Closed

skliper unassigned ghost Feb 26, 2020

skliper modified the milestones: 7.0.0, 6.8.0 Feb 26, 2020

skliper assigned jphickey Mar 2, 2020

jphickey mentioned this issue Apr 14, 2020

Improve ES reset area memory allocation #610

Open

This was referenced Apr 23, 2020

Deprecate OStask_id field from OS_task_prop_t nasa/osal#437

Closed

Fix #411, rework exception handling in CFE #653

Merged

Implement exception logging/capture in PSP nasa/PSP#158

Closed

Fix #158, Implement exception storage in PSP nasa/PSP#159

Merged

skliper linked a pull request May 19, 2020 that will close this issue

Integration Candidate COMBINED 2020-04-29 and 2020-05-06 #692

Merged

skliper closed this as completed May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFE_ES_ProcessCoreException() is not interrupt-safe #76

CFE_ES_ProcessCoreException() is not interrupt-safe #76

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Apr 8, 2020

ghost commented Apr 9, 2020

skliper commented Apr 9, 2020

CFE_ES_ProcessCoreException() is not interrupt-safe #76

CFE_ES_ProcessCoreException() is not interrupt-safe #76

Comments

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Sep 30, 2019

skliper commented Apr 8, 2020

ghost commented Apr 9, 2020

skliper commented Apr 9, 2020