Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS_TaskDelete is not synchronous on POSIX #642

Closed
jphickey opened this issue Nov 3, 2020 · 1 comment · Fixed by #704 or #750
Closed

OS_TaskDelete is not synchronous on POSIX #642

jphickey opened this issue Nov 3, 2020 · 1 comment · Fixed by #704 or #750
Assignees
Labels
Milestone

Comments

@jphickey
Copy link
Contributor

jphickey commented Nov 3, 2020

Describe the bug
The current POSIX implementation of OS_TaskDelete() uses pthread_cancel(). This is a cancellation request but the target thread continues to run until it hits a cancellation point. As a result, when the OS_TaskDelete() function returns, it is likely that the target thread is actually still running for a short period of time.

But if the intent is to actually unload the module, as is done in the CFE "restart" and "reload" commands, it is critical to ensure that all uses/references to code within the to-be-unloaded module have actually been released. So it is important to make sure that the cancellation request has been executed and the task is actually deleted before proceeding to the OS_ModuleUnload() call.

Currently with the POSIX implementation, there is no way to guarantee this, so this becomes a race condition during restart/reload operations on this platform.

To Reproduce
Not directly reproducible in current code - this race condition is currently masked by the fact that modules are loaded with RTLD_GLOBAL, so if the runtime loader still sees the module being referenced using its internal refcount, it doesn't actually unload it when dlclose() is called. So in this case the task finishes up normally and there is no apparent problem.

But when this is fixed - such as when using RTLD_LOCAL as suggested in #641 - then this race condition becomes a problem, and as a result a CFE reload or restart command sometimes triggers a segmentation fault in the event that OS_ModuleUnload() gets called before the task has fully been deleted.

Expected behavior
OS_TaskDelete() should ensure that the task has been fully removed, not just pending removal, before it returns to the caller.

Suggestion to achieve this:

  • Keep threads in the attached (joinable) state by default
  • Use pthread_join() to wait for the cancelled task to actually exit.

System observed on:
Ubuntu 20.04

Requester Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey self-assigned this Nov 3, 2020
@jphickey jphickey added the bug label Nov 3, 2020
@astrogeco
Copy link
Contributor

CCB 2020-11-04

  • Can't fix reload problem without fixing this
  • We have discussed how OS_TaskDelete is dangerous.
    • Safer using the appropriate flag in POSIX assuming the we do the correct cleanup
    • If app takes a mutex that can't be cancelled we could have issues but we can't control this
  • Might need a splinter and further discussion

jphickey added a commit to jphickey/osal that referenced this issue Dec 9, 2020
In the POSIX implementation, OS_TaskDelete was implemented in a
deferred manner - the API call was a request, and the actual
deletion occured sometime thereafter.  This is a problem if the
task is running code within a dynamically loaded module, and the
intent is to delete the task so the module can be unloaded.  In
this case the app needs to be certain that the task has actually
been deleted before unloading can be done safely.

To do this requires use of pthread_join() on POSIX which confirms
that the task has exited.  However, this is a (potentially) blocking
call, so to do this requires rework of the EXCLUSIVE lock mode
such that the OSAL lock is _not_ held during the join operation.
jphickey added a commit to jphickey/osal that referenced this issue Dec 10, 2020
In the POSIX implementation, OS_TaskDelete was implemented in a
deferred manner - the API call was a request, and the actual
deletion occured sometime thereafter.  This is a problem if the
task is running code within a dynamically loaded module, and the
intent is to delete the task so the module can be unloaded.  In
this case the app needs to be certain that the task has actually
been deleted before unloading can be done safely.

To do this requires use of pthread_join() on POSIX which confirms
that the task has exited.  However, this is a (potentially) blocking
call, so to do this requires rework of the EXCLUSIVE lock mode
such that the OSAL lock is _not_ held during the join operation.
jphickey added a commit to jphickey/osal that referenced this issue Dec 11, 2020
In the POSIX implementation, OS_TaskDelete was implemented in a
deferred manner - the API call was a request, and the actual
deletion occured sometime thereafter.  This is a problem if the
task is running code within a dynamically loaded module, and the
intent is to delete the task so the module can be unloaded.  In
this case the app needs to be certain that the task has actually
been deleted before unloading can be done safely.

To do this requires use of pthread_join() on POSIX which confirms
that the task has exited.  However, this is a (potentially) blocking
call, so to do this requires rework of the EXCLUSIVE lock mode
such that the OSAL lock is _not_ held during the join operation.
@skliper skliper linked a pull request Dec 16, 2020 that will close this issue
jphickey added a commit to jphickey/osal that referenced this issue Dec 22, 2020
In the POSIX implementation, OS_TaskDelete was implemented in a
deferred manner - the API call was a request, and the actual
deletion occured sometime thereafter.  This is a problem if the
task is running code within a dynamically loaded module, and the
intent is to delete the task so the module can be unloaded.  In
this case the app needs to be certain that the task has actually
been deleted before unloading can be done safely.

To do this requires use of pthread_join() on POSIX which confirms
that the task has exited.  However, this is a (potentially) blocking
call, so to do this requires rework of the EXCLUSIVE lock mode
such that the OSAL lock is _not_ held during the join operation.
jphickey added a commit that referenced this issue Jan 11, 2021
Fix #642, 645, 701, 702, 703 - OSAL global table management
@skliper skliper added this to the 6.0.0 milestone Sep 24, 2021
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment