OS_TaskDelete is not synchronous on POSIX #642

jphickey · 2020-11-03T22:39:06Z

Describe the bug
The current POSIX implementation of OS_TaskDelete() uses pthread_cancel(). This is a cancellation request but the target thread continues to run until it hits a cancellation point. As a result, when the OS_TaskDelete() function returns, it is likely that the target thread is actually still running for a short period of time.

But if the intent is to actually unload the module, as is done in the CFE "restart" and "reload" commands, it is critical to ensure that all uses/references to code within the to-be-unloaded module have actually been released. So it is important to make sure that the cancellation request has been executed and the task is actually deleted before proceeding to the OS_ModuleUnload() call.

Currently with the POSIX implementation, there is no way to guarantee this, so this becomes a race condition during restart/reload operations on this platform.

To Reproduce
Not directly reproducible in current code - this race condition is currently masked by the fact that modules are loaded with RTLD_GLOBAL, so if the runtime loader still sees the module being referenced using its internal refcount, it doesn't actually unload it when dlclose() is called. So in this case the task finishes up normally and there is no apparent problem.

But when this is fixed - such as when using RTLD_LOCAL as suggested in #641 - then this race condition becomes a problem, and as a result a CFE reload or restart command sometimes triggers a segmentation fault in the event that OS_ModuleUnload() gets called before the task has fully been deleted.

Expected behavior
OS_TaskDelete() should ensure that the task has been fully removed, not just pending removal, before it returns to the caller.

Suggestion to achieve this:

Keep threads in the attached (joinable) state by default
Use pthread_join() to wait for the cancelled task to actually exit.

System observed on:
Ubuntu 20.04

Requester Info
Joseph Hickey, Vantage Systems, Inc.

The text was updated successfully, but these errors were encountered:

astrogeco · 2020-11-04T18:00:44Z

CCB 2020-11-04

Can't fix reload problem without fixing this
We have discussed how OS_TaskDelete is dangerous.
- Safer using the appropriate flag in POSIX assuming the we do the correct cleanup
- If app takes a mutex that can't be cancelled we could have issues but we can't control this
Might need a splinter and further discussion

In the POSIX implementation, OS_TaskDelete was implemented in a deferred manner - the API call was a request, and the actual deletion occured sometime thereafter. This is a problem if the task is running code within a dynamically loaded module, and the intent is to delete the task so the module can be unloaded. In this case the app needs to be certain that the task has actually been deleted before unloading can be done safely. To do this requires use of pthread_join() on POSIX which confirms that the task has exited. However, this is a (potentially) blocking call, so to do this requires rework of the EXCLUSIVE lock mode such that the OSAL lock is _not_ held during the join operation.

Fix #642, 645, 701, 702, 703 - OSAL global table management

Fix nasa#641, string operations on GCC9

jphickey self-assigned this Nov 3, 2020

jphickey added the bug label Nov 3, 2020

astrogeco added the CCB-20201104 label Nov 4, 2020

jphickey mentioned this issue Nov 5, 2020

OS_TaskExit() on VxWorks causes task table owner to be the deleted task #645

Closed

This was referenced Dec 7, 2020

Fix #952, OSAL module flags to permit app reload nasa/cFE#1038

Merged

osal core test is not a good example of how to write tasks #670

Closed

skliper mentioned this issue Dec 16, 2020

Fix #642, synchronous task delete on POSIX #676

Closed

skliper linked a pull request Dec 16, 2020 that will close this issue

Fix #642, synchronous task delete on POSIX #676

Closed

jphickey mentioned this issue Dec 22, 2020

Unlock globals during create/delete ops #703

Closed

jphickey mentioned this issue Dec 22, 2020

Fix #642, 645, 701, 702, 703 - OSAL global table management #704

Merged

jphickey added a commit that referenced this issue Jan 11, 2021

Merge pull request #704 from jphickey/fix-642-645-702-703

ccbbcac

Fix #642, 645, 701, 702, 703 - OSAL global table management

astrogeco mentioned this issue Jan 11, 2021

osal Integration candidate: 2021-01-12 #750

Merged

astrogeco closed this as completed in #750 Jan 13, 2021

skliper added this to the 6.0.0 milestone Sep 24, 2021

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Merge pull request nasa#642 from jphickey/fix-641-stringops

0f363dd

Fix nasa#641, string operations on GCC9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OS_TaskDelete is not synchronous on POSIX #642

OS_TaskDelete is not synchronous on POSIX #642

jphickey commented Nov 3, 2020 •

edited

Loading

astrogeco commented Nov 4, 2020

OS_TaskDelete is not synchronous on POSIX #642

OS_TaskDelete is not synchronous on POSIX #642

Comments

jphickey commented Nov 3, 2020 • edited Loading

astrogeco commented Nov 4, 2020

jphickey commented Nov 3, 2020 •

edited

Loading