Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG #2107

Open
igorluppi opened this issue May 19, 2022 · 5 comments
Open

Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG #2107

igorluppi opened this issue May 19, 2022 · 5 comments
Labels

Comments

@igorluppi
Copy link

igorluppi commented May 19, 2022

Describe the bug
An APP cannot be correctly stopped by CFE_ES if the CFE_ES_StopAppCmd is sent in the same moment that the APP has a mutex locked. Although the app is terminated (with errors), when we try to restart it, it doesn't work.

To Reproduce
Steps to reproduce the behavior:

  1. Create a mutex in SAMPLE_APP using OS_MutSemCreate
  2. Take the mutex in the SAMPLE_APP RunLoop using OS_MutSemTake
  3. Perform CFE_ES CMD CFE_ES_StopAppCmd using "SAMPLE_APP" as target
  4. See error CFE_ES_CleanupObjectCallback: Call to OSAL Delete Object (ID:327691) failed. RC=-6

Expected behavior

  1. CFE_ES should be able to unlock the mutexes in a target APP in order to close all its resources.
    1.1. In case CFE_ES behavior can't do that, how can we handle apps that have mutexes and may be closed in the exactly same moment that its mutex is locked ("Taken") ?

System observed on:

Additional context
An example:
image

Reporter Info
Igor Luppi

@igorluppi igorluppi changed the title Stoping an APP thas has a locked mutex using CFE_ES_StopAppCmd BUG Stoping an APP thad has a locked mutex using CFE_ES_StopAppCmd BUG May 19, 2022
@igorluppi igorluppi changed the title Stoping an APP thad has a locked mutex using CFE_ES_StopAppCmd BUG Stoping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG May 19, 2022
@igorluppi igorluppi changed the title Stoping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG May 19, 2022
@skliper
Copy link
Contributor

skliper commented May 20, 2022

Possibly related: #701, nasa/osal#470 and the fix in nasa/osal#472. Although this change was for binary semaphores, may need a similar "fix" for a OS_MutSemTake.

@igorluppi
Copy link
Author

igorluppi commented May 20, 2022

The real problem is trying to create the mutex again when the APP is restarted.
In function OS_MutSemCreate:
return_code = OS_ObjectIdAllocateNew(LOCAL_OBJID_TYPE, sem_name, &local_id, &record);
The return_code is not OS_SUCCESS, its value is OS_ERR_NAME_TAKEN, so basically as the APP did not close properly, the mutex still exists in cFE Table.
Any way can we get around that?

  • The mutex name exists in cFE table, but we don't have the id_num to delete it or recover it. Do we?

@skliper skliper added the bug label May 23, 2022
@skliper
Copy link
Contributor

skliper commented May 23, 2022

There is OS_MutSemGetIdByName... although that doesn't solve the underlying delete issue.

@skliper
Copy link
Contributor

skliper commented May 25, 2022

Really the app should clean up after exiting it's run loop that at minimum releases the mutex. See HS as an example (not of mutext cleanup, bun unregistering callbacks and closing child tasks). The CFE_ES_StopAppCmd should cause the CFE_ES_RunLoop to return false and allow the app to clean up.

Runloop: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L120
Cleanup call: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L184
Cleanup implementation: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_custom.c#L138-L155

This approach is likely cleaner/safer than the possible side effects of OSAL trying to clean up and delete a locked mutex...

Basically you should never be able to return from the main app routine without unlocking the mutex. That's just bad manners.

@igorluppi
Copy link
Author

igorluppi commented May 26, 2022

  • Any way can we get around that?
    Answering my own question, using count semaphore with sem_initial_value = 1 works. It's a mutex but no error is raised on the release of the mutex if the releasing task is not its owner, as the true pthread_mutex implementation does.
    Moreover, the binary semaphore not only locks in the same situation, but also deadlock cFE service as well, avoid that. I was not able to finish the software with SIGINT (ctrl-c) after trying to stop an APP with a locked binary semaphore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants