Fix race condition during CRMP sendMessage #6441

yufengwangca · 2021-05-04T05:22:32Z

Problem

Currently, we see the cirque is flaky with random crash if we enable IM and Echo test. The crash log shows we have race condition during CRMP sendMessage.

Now we need to first send the message and cache the message during CRMP since the packetHeader is encoded at TransportLayer during send. If we receive ack response before we cache the sent message, we will try to remove an non-exist buffer from retransTable and crash.

With #6333, packetHeader encoding and decoding is moved to SecureSessionMgr, we can cache the outgoing message first and send later to prevent this race condition.

Summary of Changes

Fix race condition during CRMP sendMessage
Enable EchoTest in Cirque
Enable InteractionModel Test in Cirque

Fixes #6155

github-actions · 2021-05-04T06:40:28Z

Size increase report for "nrfconnect-example-build" from 17be4af

File	Section	File	VM
chip-lock.elf	device_handles	8	8
chip-lock.elf	rodata	-88	-92
chip-lock.elf	text	-280	-280
chip-shell.elf	rodata	-64	-60
chip-shell.elf	text	-192	-192
chip-lighting.elf	device_handles	8	8
chip-lighting.elf	rodata	-88	-92
chip-lighting.elf	text	-280	-280

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_aranges,0,8
.debug_frame,0,8
device_handles,8,8
.shstrtab,0,-1
.debug_ranges,0,-48
.symtab,0,-80
rodata,-92,-88
.strtab,0,-219
.debug_line,0,-244
.debug_abbrev,0,-268
text,-280,-280
.debug_str,0,-328
.debug_loc,0,-1604
.debug_info,0,-8820

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_frame,0,20
.debug_aranges,0,8
.shstrtab,0,-1
.debug_ranges,0,-16
.symtab,0,-48
rodata,-60,-64
.debug_line,0,-122
.strtab,0,-159
.debug_abbrev,0,-190
text,-192,-192
.debug_str,0,-275
.debug_loc,0,-1181
.debug_info,0,-5612

Comparing ./master_artifact/chip-lighting.elf and ./pull_artifact/chip-lighting.elf:

sections,vmsize,filesize
.debug_aranges,0,8
.debug_frame,0,8
device_handles,8,8
.shstrtab,0,-1
.debug_ranges,0,-48
.symtab,0,-80
rodata,-92,-88
.debug_abbrev,0,-118
.debug_line,0,-172
.strtab,0,-219
text,-280,-280
.debug_str,0,-328
.debug_loc,0,-1608
.debug_info,0,-1718

github-actions · 2021-05-04T06:41:13Z

Size increase report for "esp32-example-build" from 17be4af

File	Section	File	VM
chip-all-clusters-app.elf	.flash.rodata	-96	-96
chip-all-clusters-app.elf	.flash.text	-268	-268

Full report output

BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize
.xt.prop._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,160
[Unmapped],0,96
.xt.lit._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,88
.xt.prop._ZN4chip9Transport19PeerConnectionStateC5Ev,0,40
.xt.prop._ZTVN4chip24LifetimePersistedCounterE,0,-1
.symtab,0,-16
.debug_ranges,0,-24
.xt.prop._ZNK4chip9Transport11PeerAddress8ToStringEPcj,0,-40
.shstrtab,0,-41
.xt.lit._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS_12PacketHeaderERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,-88
.flash.rodata,-96,-96
.xt.prop._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS_12PacketHeaderERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,-160
.flash.text,-268,-268
.debug_str,0,-348
.debug_abbrev,0,-359
.strtab,0,-363
.debug_line,0,-579
.debug_loc,0,-1213
.debug_info,0,-8728

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize

pan-apple · 2021-05-04T20:17:13Z

@yufengwangca is this PR dependent on #6333 to merge first?

yufengwangca · 2021-05-04T21:05:07Z

@yufengwangca is this PR dependent on #6333 to merge first?

Yes, lets hold it until #6333 land

scripts/tests/cirque_tests.sh

src/transport/SecureSessionMgr.cpp

andy31415

Could you describe the sequence of operations here? I have a hard time understanding the fix:

I believe our current code assumes single-threading (one main loop that handles all processing). In that case, the placement of where we save the retain slot does not seem to matter.

andy31415 · 2021-05-05T13:17:08Z

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.

Also 'encryptedMsg' should be deleted as a variable - it has no use.

erjiaqing · 2021-05-05T14:21:17Z

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.

Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.

This should not happen in real devices, but is somehow very frequent in github tests.

And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

erjiaqing

Checked out this PR locally, seems cirque / darwin tests won't crash now.

msandstedt · 2021-05-05T15:53:54Z

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.
Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.

This should not happen in real devices, but is somehow very frequent in github tests.

And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.

That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.

But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.

TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

yufengwangca · 2021-05-05T16:25:46Z

Could you describe the sequence of operations here? I have a hard time understanding the fix:

I believe our current code assumes single-threading (one main loop that handles all processing). In that case, the placement of where we save the retain slot does not seem to matter.

Currently, sendMessage is triggered from app within app's own thread directly, and receiveMessage is handled within CHIP main thread. I think we should force a single-threading mode (This need certain refactor on the app side) in long term. That will avoid a lot of race issues.

yufengwangca · 2021-05-05T16:39:50Z

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.
Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.
This should not happen in real devices, but is somehow very frequent in github tests.
And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.

That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.

But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.

TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

The current CHIP messaging stack is largely inherited from Weave which has single-thread -model in mind, we only have one big lock PlatformManager::LockChipStack which is used to lock the whole chip stack to avoid contention from other application thread.

I agree we need mutex protection for sendMessage since it is triggered from app thread directly right now. Need to handle it in the separate PR and we still need to leverage if we should force a single thread model or use separate mutex.

msandstedt · 2021-05-05T18:14:05Z

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.
That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.
But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.
TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

The current CHIP messaging stack is largely inherited from Weave which has single-thread -model in mind, we only have one big lock PlatformManager::LockChipStack which is used to lock the whole chip stack to avoid contention from other application thread.

I agree we need mutex protection for sendMessage since it is triggered from app thread directly right now. Need to handle it in the separate PR and we still need to leverage if we should force a single thread model or use separate mutex.

Thanks for considering my comments.

It's my personal belief that code shouldn't be written to assume a single-threaded environment and that code underlying public interfaces should protect itself to accommodate cross-thread calls. But it's OK if the SDK isn't written this way. We just need to know how to safely call into it.

Based on the above, is it safe to simply wrap cross-thread calls into the sdk with PlatformManager::Lock/UnlockChipStack? That's an easy rule to follow. However I'm having a tough time figuring out whether our current examples do this. It appears things like chip-tool do not.

Edit: Also, things like #6286 move us in a direction where it is not tenable for time-sensitive apps to execute CHIP from their existing event loops. So it's important to know how to safely call into the stack across a thread boundary.

bzbarsky-apple

We really need to get our threading story straight, but if this unblocks us by preventing flakiness in CI, let's do it. We just need to make sure we do the real fix of locking or serializing on a single thread appropriately as well...

andy31415

Aproving - my guess is that this passes in a pointer directly in the retransmission table, hence the race.

However the real bug seems to be us using separate threads on a stack that strongly assumes single threading. This code is not yet safe - this is a temporary workaround rather than a fix.

mrjerryjohns · 2021-05-14T23:36:17Z

I've filed an issue here #6841 to get this sorted out. Once we do, can we please get this documented?

boring-cyborg bot added the transport label May 4, 2021

yufengwangca marked this pull request as draft May 4, 2021 05:22

yufengwangca requested review from erjiaqing and yunhanw-google May 4, 2021 05:22

boring-cyborg bot added the scripts label May 4, 2021

yunhanw-google requested review from woody-apple, andy31415 and msandstedt May 4, 2021 17:43

woody-apple marked this pull request as ready for review May 4, 2021 19:35

pullapprove bot requested review from chrisdecenzo, jelderton, jmartinez-silabs and LuDuda May 4, 2021 19:35

msandstedt approved these changes May 4, 2021

View reviewed changes

yufengwangca marked this pull request as draft May 4, 2021 22:02

erjiaqing reviewed May 5, 2021

View reviewed changes

scripts/tests/cirque_tests.sh Show resolved Hide resolved

yufengwangca added 2 commits May 4, 2021 23:55

Add the outgoing message to retransTable before sending

024ec18

Enable Echo and InteractionModel test in cirque

e9f634d

andy31415 reviewed May 5, 2021

View reviewed changes

src/transport/SecureSessionMgr.cpp Outdated Show resolved Hide resolved

andy31415 requested changes May 5, 2021

View reviewed changes

erjiaqing approved these changes May 5, 2021

View reviewed changes

Remove temporary encryptedMsg varible

385b8b5

yufengwangca marked this pull request as ready for review May 5, 2021 17:11

yufengwangca requested a review from andy31415 May 5, 2021 17:11

bzbarsky-apple approved these changes May 5, 2021

View reviewed changes

andy31415 approved these changes May 6, 2021

View reviewed changes

andy31415 merged commit 3c4d1a5 into project-chip:master May 6, 2021

yufengwangca deleted the pr/crmp/race branch May 6, 2021 15:41

yufengwangca mentioned this pull request May 10, 2021

[DO NOT MERGE]: Re-enable Cirque for integration sanity check #6640

Closed

mrjerryjohns mentioned this pull request May 14, 2021

Get the SDK's Threading model sorted out #6841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition during CRMP sendMessage #6441

Fix race condition during CRMP sendMessage #6441

yufengwangca commented May 4, 2021 •

edited

Loading

github-actions bot commented May 4, 2021

github-actions bot commented May 4, 2021

pan-apple commented May 4, 2021

yufengwangca commented May 4, 2021

andy31415 left a comment

andy31415 commented May 5, 2021

erjiaqing commented May 5, 2021

erjiaqing left a comment

msandstedt commented May 5, 2021

yufengwangca commented May 5, 2021

yufengwangca commented May 5, 2021

msandstedt commented May 5, 2021 •

edited

Loading

bzbarsky-apple left a comment

andy31415 left a comment

mrjerryjohns commented May 14, 2021

Fix race condition during CRMP sendMessage #6441

Fix race condition during CRMP sendMessage #6441

Conversation

yufengwangca commented May 4, 2021 • edited Loading

Problem

Summary of Changes

github-actions bot commented May 4, 2021

github-actions bot commented May 4, 2021

pan-apple commented May 4, 2021

yufengwangca commented May 4, 2021

andy31415 left a comment

Choose a reason for hiding this comment

andy31415 commented May 5, 2021

erjiaqing commented May 5, 2021

erjiaqing left a comment

Choose a reason for hiding this comment

msandstedt commented May 5, 2021

yufengwangca commented May 5, 2021

yufengwangca commented May 5, 2021

msandstedt commented May 5, 2021 • edited Loading

bzbarsky-apple left a comment

Choose a reason for hiding this comment

andy31415 left a comment

Choose a reason for hiding this comment

mrjerryjohns commented May 14, 2021

yufengwangca commented May 4, 2021 •

edited

Loading

msandstedt commented May 5, 2021 •

edited

Loading