-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition during CRMP sendMessage #6441
Conversation
Size increase report for "nrfconnect-example-build" from 17be4af
Full report output
|
Size increase report for "esp32-example-build" from 17be4af
Full report output
|
@yufengwangca is this PR dependent on #6333 to merge first? |
Yes, lets hold it until #6333 land |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe the sequence of operations here? I have a hard time understanding the fix:
I believe our current code assumes single-threading (one main loop that handles all processing). In that case, the placement of where we save the retain slot does not seem to matter.
Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition. Also 'encryptedMsg' should be deleted as a variable - it has no use. |
In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes. This should not happen in real devices, but is somehow very frequent in github tests. And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked out this PR locally, seems cirque / darwin tests won't crash now.
I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected. That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock. But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code. TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection. |
Currently, sendMessage is triggered from app within app's own thread directly, and receiveMessage is handled within CHIP main thread. I think we should force a single-threading mode (This need certain refactor on the app side) in long term. That will avoid a lot of race issues. |
The current CHIP messaging stack is largely inherited from Weave which has single-thread -model in mind, we only have one big lock PlatformManager::LockChipStack which is used to lock the whole chip stack to avoid contention from other application thread. I agree we need mutex protection for sendMessage since it is triggered from app thread directly right now. Need to handle it in the separate PR and we still need to leverage if we should force a single thread model or use separate mutex. |
Thanks for considering my comments. It's my personal belief that code shouldn't be written to assume a single-threaded environment and that code underlying public interfaces should protect itself to accommodate cross-thread calls. But it's OK if the SDK isn't written this way. We just need to know how to safely call into it. Based on the above, is it safe to simply wrap cross-thread calls into the sdk with PlatformManager::Lock/UnlockChipStack? That's an easy rule to follow. However I'm having a tough time figuring out whether our current examples do this. It appears things like chip-tool do not. Edit: Also, things like #6286 move us in a direction where it is not tenable for time-sensitive apps to execute CHIP from their existing event loops. So it's important to know how to safely call into the stack across a thread boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need to get our threading story straight, but if this unblocks us by preventing flakiness in CI, let's do it. We just need to make sure we do the real fix of locking or serializing on a single thread appropriately as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aproving - my guess is that this passes in a pointer directly in the retransmission table, hence the race.
However the real bug seems to be us using separate threads on a stack that strongly assumes single threading. This code is not yet safe - this is a temporary workaround rather than a fix.
I've filed an issue here #6841 to get this sorted out. Once we do, can we please get this documented? |
Problem
Currently, we see the cirque is flaky with random crash if we enable IM and Echo test. The crash log shows we have race condition during CRMP sendMessage.
Now we need to first send the message and cache the message during CRMP since the packetHeader is encoded at TransportLayer during send. If we receive ack response before we cache the sent message, we will try to remove an non-exist buffer from retransTable and crash.
With #6333, packetHeader encoding and decoding is moved to SecureSessionMgr, we can cache the outgoing message first and send later to prevent this race condition.
Summary of Changes
Fixes #6155