Eliminating recursive locks in SB could allow for using more efficient resource #948

skliper · 2020-10-14T21:37:59Z

Describe the request
Recursive locks possible in the following code (may also be in other locations):

Lines 1110 to 1113 in dc3d62b

    
           CFE_EVS_SendEventWithAppID(CFE_SB_SUBSCRIPTION_REMOVED_EID,CFE_EVS_EventType_DEBUG,CFE_SB.AppId, 
        
               "Subscription Removed:Msg 0x%x on pipe %d,app %s", 
        
               (unsigned int)CFE_SB_MsgIdToValue(MsgId), 
        
               (int)PipeId,CFE_SB_GetAppTskName(TskId,FullName));

cFE/fsw/cfe-core/src/sb/cfe_sb_api.c

Lines 1121 to 1124 in dc3d62b

    
           CFE_EVS_SendEventWithAppID(CFE_SB_UNSUB_NO_SUBS_EID,CFE_EVS_EventType_INFORMATION,CFE_SB.AppId, 
        
               "Unsubscribe Err:No subs for Msg 0x%x on %s,app %s", 
        
               (unsigned int)CFE_SB_MsgIdToValue(MsgId), 
        
               PipeName,CFE_SB_GetAppTskName(TskId,FullName));

cFE/fsw/cfe-core/src/sb/cfe_sb_task.c

Lines 1172 to 1173 in dc3d62b

    
           CFE_EVS_SendEvent(CFE_SB_FULL_SUB_PKT_EID,CFE_EVS_EventType_DEBUG, 
        
               "Full Sub Pkt %d Sent,Entries=%d,Stat=0x%x\n",(int)SegNum,(int)EntryNum,(unsigned int)Stat);

cFE/fsw/cfe-core/src/sb/cfe_sb_task.c

Lines 1200 to 1201 in dc3d62b

    
           CFE_EVS_SendEvent(CFE_SB_PART_SUB_PKT_EID,CFE_EVS_EventType_DEBUG, 
        
               "Partial Sub Pkt %d Sent,Entries=%d,Stat=0x%x",(int)SegNum,(int)EntryNum,(unsigned int)Stat);

Related - the locking in the SendPrevSubs command handling doesn't look like it really helps since it has to unlock to send the message (same issues as the commands to record route/map info to file), typical use case is to enable subscription reporting, then send all previous subscriptions so may make sense to refactor (and possibly throttle).

To Reproduce
Clear filters on the debug messages and trigger (I stopped SAMPLE_APP to cause the pipe deletion), or just subscribe and unsubscribe twice to trigger CFE_SB_UNSUB_NO_SUBS_EID.

Expected behavior
Avoiding recursive lock could allow for using a more efficient resource on platforms where it's supported.

Code snips
See above.

System observed on:
From code analysis, tested on Ubuntu 18.04.

Additional context
From analysis during #928 and #947

Reporter Info
Jacob Hageman - NASA/GSFC

skliper · 2020-10-14T21:39:19Z

Note any fix will conflict with #947, so hold off for now.

skliper · 2020-10-14T22:00:41Z

Tested on linux - enabled all event types, then stopped SAMPLE_APP which then reported

1980-012-14:16:00.50098 CFE_ES_DeleteApp: Delete Application SAMPLE_APP Initiated
EVS Port1 66/1/CFE_ES 7: Stop Application SAMPLE_APP Initiated.
1980-012-14:16:03.00037 Application SAMPLE_APP called CFE_ES_ExitApp
EVS Port1 66/1/CFE_SB 48: Subscription Removed:Msg 0x1883 on pipe 5,app CFE_ES.ES_BackgroundTask
EVS Port1 66/1/CFE_SB 48: Subscription Removed:Msg 0x1882 on pipe 5,app CFE_ES.ES_BackgroundTask
EVS Port1 66/1/CFE_SB 47: Pipe Deleted:id 5,owner SAMPLE_APP
EVS Port1 66/1/CFE_ES 8: Stop Application SAMPLE_APP Completed.

Sent a few noops and they all worked. So double lock didn't seem to hang up on this system.

jphickey · 2020-10-15T01:51:49Z

I was mistaken, I had thought we were using fast mutexes, but we are in fact using recursive mutexes on POSIX:

https://github.com/nasa/osal/blob/3e087e2727e29791f81a7628346426eb7e3bc44a/src/os/posix/src/os-impl-mutex.c#L95-L98

So this is probably why it works ....?

Using RECURSIVE is sort of like a cheat/easy way out - using a slower resource all the time because one doesn't want to expend the effort to avoid double lock. Ideally we should fix the code so it works with normal (non-recursive) mutexes.

skliper · 2020-10-15T13:45:17Z

At least from the documentation I've come across, all three OSAL implementations in the framework support the nested locks. I don't see it mentioned in the API that the implementation MUST support nested locks though. There is a test to check it:

https://github.com/nasa/osal/blob/3e087e2727e29791f81a7628346426eb7e3bc44a/src/tests/mutex-test/mutex-test.c#L246-L259

Given that, I wouldn't think we'd want to change the behavior of this API. As you mention, could eliminate all the recursive locks and add a new API to improve performance... but at this point not a priority.

jphickey · 2020-10-15T15:00:48Z

I agree WRT not changing the OSAL behavior at this time, but we definitely should work toward removing the requirement/dependency on recursive mutexes in CFE. It is just bad design to have tasks locking the same resources more than once.

I wrote issue nasa/osal#623 for some things we can easily do in OSAL to facilitate. Recommendation is to start with an debug message in the event that a task is double locking (can be done at shared layer).

skliper · 2021-01-19T15:21:22Z

@jphickey Did #1092 fix these?

jphickey · 2021-01-19T16:09:58Z

@jphickey Did #1092 fix these?

Yes, it does - at least for all the cases listed in the description. The general pattern implemented now will defer sending events (including debug events) until the function is finishing - after unlocking - so there is no nested/double lock. However - I did not explicitly test for nested locking as part of #1092.

skliper · 2021-03-30T13:23:31Z

Closed by #1073

skliper added the bug label Oct 14, 2020

skliper added this to the 7.0.0 milestone Oct 14, 2020

skliper added enhancement and removed bug labels Oct 15, 2020

skliper changed the title ~~Possible double locks in SB~~ Eliminating recursive locks in SB could allow for using more efficient resource Oct 15, 2020

jphickey mentioned this issue Oct 15, 2020

Add OS_DEBUG warning if task locks a mutex multiple times nasa/osal#623

Closed

skliper mentioned this issue Oct 21, 2020

Fix #928 and #929 - Modularize software bus routing, add msg map hash #947

Merged

skliper mentioned this issue Jan 19, 2021

Confirm no recursive locking and transition to fast mutex use #1105

Open

skliper linked a pull request Jan 19, 2021 that will close this issue

Fix #1073, software bus locking #1092

Merged

skliper mentioned this issue Jan 19, 2021

Fix #1073, software bus locking #1092

Merged

skliper closed this as completed Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminating recursive locks in SB could allow for using more efficient resource #948

Eliminating recursive locks in SB could allow for using more efficient resource #948

skliper commented Oct 14, 2020 •

edited

Loading

skliper commented Oct 14, 2020

skliper commented Oct 14, 2020

jphickey commented Oct 15, 2020

skliper commented Oct 15, 2020

jphickey commented Oct 15, 2020

skliper commented Jan 19, 2021

jphickey commented Jan 19, 2021

skliper commented Mar 30, 2021

Eliminating recursive locks in SB could allow for using more efficient resource #948

Eliminating recursive locks in SB could allow for using more efficient resource #948

Comments

skliper commented Oct 14, 2020 • edited Loading

skliper commented Oct 14, 2020

skliper commented Oct 14, 2020

jphickey commented Oct 15, 2020

skliper commented Oct 15, 2020

jphickey commented Oct 15, 2020

skliper commented Jan 19, 2021

jphickey commented Jan 19, 2021

skliper commented Mar 30, 2021

skliper commented Oct 14, 2020 •

edited

Loading