Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFE_ES_ProcessCoreException() is not interrupt-safe #76

Closed
skliper opened this issue Sep 30, 2019 · 12 comments · Fixed by #653 or #692
Closed

CFE_ES_ProcessCoreException() is not interrupt-safe #76

skliper opened this issue Sep 30, 2019 · 12 comments · Fixed by #653 or #692
Assignees
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Sep 30, 2019

The CFE_ES_ProcessCoreException() architecture needs to be re-examined.

This is the equivalent of an interrupt handler and likely triggered as the result of a hardware interrupt. However the implementation calls other CFE_ES functions, some of which take the global data mutex (CFE_ES_LockSharedData), do console output (CFE_ES_WriteToSyslog and other functions which ultimately do printf), or call OS_TaskDelay.

All of these operations are unsafe to do in an interrupt context on most platforms.

This may not be an issue if the end-result of the exception is a processor reset, which is probably the only safe thing to do. Although the option to restart the task does exist, the system may be too far gone after this.

@skliper skliper added this to the 7.0.0 milestone Sep 30, 2019
@skliper skliper self-assigned this Sep 30, 2019
@skliper skliper added the bug label Sep 30, 2019
@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Imported from trac issue 45. Created by jphickey on 2015-05-07T11:48:29, last modified: 2019-04-24T13:18:47

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2015-06-15 17:44:06:

Another issue with CFE_ES_ProcessCoreException():

The API is somewhat flawed in that the first parameter is a uint32 for which the PSP uses the "raw" (unabstracted) underlying OS task identifier.

Depending on the OS, the actual OS task identifier may not be an integer and may not fit into 32 bits. (Cygwin is one example where the {{{pthread_t}}} type is actually a structure and is not a simple integer).

It would be better to move the logic to figure out which OSAL task is associated with that host task ID //before// calling {{{CFE_ES_ProcessCoreException()}}}. This way, the first parameter could be an OSAL task ID which would make the code more straightforward as well (removing the loop to call {{{OS_TaskGetInfo()}}} etc).

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2018-05-14 11:19:12:

This should be considered for the next CFE release. It will likely need a coordinated change between PSP and CFE.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sseeger on 2018-06-23 11:07:50:

This seems to be used in vxWorks implementations. There is a CFE_PSP_ExceptionHook for rtems, but I don't see where it's registered. Possibly built-in to RTEMS source (in-house modification) or something?

After analyzing the vxWorks use case, and what I know of vxWorks, the answer to this problem seems to be to have the handler wake up a task created by PSP to handle the exception for us. vxWorks guarantees the task that caused the exception will be suspended, so this helper task can handle it. If we go this route, then we should be able to leave CFE_ES_ProcessCoreException() unmodified. Let's discuss this at the next CCB to see if there is consensus with this idea, or if there are other suggestions.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-03-04 09:20:25:

CCB - Needs review for path forward.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-10 13:31:36:

CCB 4/10/19

  • Discussed review in context of it only works on one old system, verification fails and likely more appropriate at a lower level and/or a very project/mission specific solution in many cases. Suggestion is to deprecate at the cFE level.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-04-10 14:25:19:

Brainstorming about this one a bit, on most systems the "exceptions" that occur lie in one of three categories:

  1. Arithmetic Error
  2. Physical memory bus error
  3. Memory access violation / virtual memory error

Case (1): Arithmetic errors are almost always just bad input and (or bad code that didn't check its input) and generally have low likelihood of affecting the other tasks running in the system. For this, restarting only the specific task that caused the problem might be a suitable response, and there could be a reasonably high level of confidence that the remainder of the system will continue to work properly.

Case (2): A bus error means that the memory itself is bad, or the external memory-mapped device on the bus is faulted, or that the CPU is trying to access a place outside its valid physical memory map. All of which suggest that the system is seriously ill and probably nothing less than a full system restart (including external devices) will restore it to normal operation. The only possibility here is that if the PSP can positively identify the cause of the bus error and fix the actual root cause. But without this, the external bus could be left in an indeterminate state and all future accesses might fail too.

Case (3): This only applies to systems with an MMU, and means the software accessed something outside its virtual address space. CFE ES/OSAL does //not// currently run apps in separate memory spaces, all tasks share the same virtual address space. If you are lucky enough to catch this on the first bad access (like a NULL pointer dereference) then task-specific recovery might be possible. But more likely, this happens with e.g. a memset() with a size too big or a strcpy() of an unterminated string or something of that nature. In these cases, it could overwrite vast swaths of //other// task's data prior to actually triggering the access exception. If, in a future version of OSAL/CFE, we embrace separate MMU memory spaces per-app, then recovery would be more feasible. But with the current architecture, it probably isn't wise to even attempt it, there are just too many unknowns.

My opinion:
In a system like CFE where the reliability of the software is crucial, so the action for an exception should be the minimum required to bring the system back to a //fully known state//. As such, the default action for 2 & 3 above needs to be a system restart, unless the system integration engineers can determine a reliable recovery mechanism for some faults that entails less than a full restart. But this would be a mission and PSP-specific option, not the default.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-24 13:15:26:

Also related to discussion on #310, mission specific handling should go in PSP.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-24 13:18:47:

Latest actions for this ticket:

  • Update app guide (normal tasks only)
  • Update PSP guide/API/docs ISR handling goes here
  • Deprecate exception handling at cFE level

@skliper
Copy link
Contributor Author

skliper commented Apr 8, 2020

Whatever path is chosen, there is no requirement since it's not an application API. Recommend moving out of cfe_es_api.c.

@ghost
Copy link

ghost commented Apr 9, 2020

If this is done along with #411, then can the new ES background task log what is necessary and do either the App Restart or PSP based reset?
It would satisfy @skliper's recommendation of removing it from cfe_es_api.c.
Exception classifications sound reasonable, but it would be good to have the ability for a systems integration engineer to customize this at the PSP level.
But it's probably true that most of the exception cases will merit a reset, not just an app restart.

@skliper
Copy link
Contributor Author

skliper commented Apr 9, 2020

I should likely write a functional requirement (that it logs the information and takes the platform defined response) or whatever it ends up doing so it gets build verified, just not a "upon request" type of requirement.

@skliper skliper closed this as completed May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants