Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition during CRMP sendMessage #6441

Merged
merged 3 commits into from
May 6, 2021
Merged

Fix race condition during CRMP sendMessage #6441

merged 3 commits into from
May 6, 2021

Conversation

yufengwangca
Copy link
Contributor

@yufengwangca yufengwangca commented May 4, 2021

Problem

Currently, we see the cirque is flaky with random crash if we enable IM and Echo test. The crash log shows we have race condition during CRMP sendMessage.

Now we need to first send the message and cache the message during CRMP since the packetHeader is encoded at TransportLayer during send. If we receive ack response before we cache the sent message, we will try to remove an non-exist buffer from retransTable and crash.

With #6333, packetHeader encoding and decoding is moved to SecureSessionMgr, we can cache the outgoing message first and send later to prevent this race condition.

Summary of Changes

  1. Fix race condition during CRMP sendMessage
  2. Enable EchoTest in Cirque
  3. Enable InteractionModel Test in Cirque

Fixes #6155

@yufengwangca yufengwangca marked this pull request as draft May 4, 2021 05:22
@boring-cyborg boring-cyborg bot added the scripts label May 4, 2021
@github-actions
Copy link

github-actions bot commented May 4, 2021

Size increase report for "nrfconnect-example-build" from 17be4af

File Section File VM
chip-lock.elf device_handles 8 8
chip-lock.elf rodata -88 -92
chip-lock.elf text -280 -280
chip-shell.elf rodata -64 -60
chip-shell.elf text -192 -192
chip-lighting.elf device_handles 8 8
chip-lighting.elf rodata -88 -92
chip-lighting.elf text -280 -280
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-lock.elf and ./pull_artifact/chip-lock.elf:

sections,vmsize,filesize
.debug_aranges,0,8
.debug_frame,0,8
device_handles,8,8
.shstrtab,0,-1
.debug_ranges,0,-48
.symtab,0,-80
rodata,-92,-88
.strtab,0,-219
.debug_line,0,-244
.debug_abbrev,0,-268
text,-280,-280
.debug_str,0,-328
.debug_loc,0,-1604
.debug_info,0,-8820

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize
.debug_frame,0,20
.debug_aranges,0,8
.shstrtab,0,-1
.debug_ranges,0,-16
.symtab,0,-48
rodata,-60,-64
.debug_line,0,-122
.strtab,0,-159
.debug_abbrev,0,-190
text,-192,-192
.debug_str,0,-275
.debug_loc,0,-1181
.debug_info,0,-5612

Comparing ./master_artifact/chip-lighting.elf and ./pull_artifact/chip-lighting.elf:

sections,vmsize,filesize
.debug_aranges,0,8
.debug_frame,0,8
device_handles,8,8
.shstrtab,0,-1
.debug_ranges,0,-48
.symtab,0,-80
rodata,-92,-88
.debug_abbrev,0,-118
.debug_line,0,-172
.strtab,0,-219
text,-280,-280
.debug_str,0,-328
.debug_loc,0,-1608
.debug_info,0,-1718


@github-actions
Copy link

github-actions bot commented May 4, 2021

Size increase report for "esp32-example-build" from 17be4af

File Section File VM
chip-all-clusters-app.elf .flash.rodata -96 -96
chip-all-clusters-app.elf .flash.text -268 -268
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize
.xt.prop._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,160
[Unmapped],0,96
.xt.lit._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,88
.xt.prop._ZN4chip9Transport19PeerConnectionStateC5Ev,0,40
.xt.prop._ZTVN4chip24LifetimePersistedCounterE,0,-1
.symtab,0,-16
.debug_ranges,0,-24
.xt.prop._ZNK4chip9Transport11PeerAddress8ToStringEPcj,0,-40
.shstrtab,0,-41
.xt.lit._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS_12PacketHeaderERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,-88
.flash.rodata,-96,-96
.xt.prop._ZN4chip9Transport5TupleIJNS0_3UDPES2_NS0_3BLEILj1EEEEE11SendMessageERKNS_12PacketHeaderERKNS0_11PeerAddressENS_6System18PacketBufferHandleE,0,-160
.flash.text,-268,-268
.debug_str,0,-348
.debug_abbrev,0,-359
.strtab,0,-363
.debug_line,0,-579
.debug_loc,0,-1213
.debug_info,0,-8728

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize


@woody-apple woody-apple marked this pull request as ready for review May 4, 2021 19:35
@pan-apple
Copy link
Contributor

@yufengwangca is this PR dependent on #6333 to merge first?

@yufengwangca
Copy link
Contributor Author

@yufengwangca is this PR dependent on #6333 to merge first?

Yes, lets hold it until #6333 land

@yufengwangca yufengwangca marked this pull request as draft May 4, 2021 22:02
Copy link
Contributor

@andy31415 andy31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe the sequence of operations here? I have a hard time understanding the fix:

I believe our current code assumes single-threading (one main loop that handles all processing). In that case, the placement of where we save the retain slot does not seem to matter.

@andy31415
Copy link
Contributor

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.

Also 'encryptedMsg' should be deleted as a variable - it has no use.

@erjiaqing
Copy link
Contributor

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.

Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.

This should not happen in real devices, but is somehow very frequent in github tests.

And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

Copy link
Contributor

@erjiaqing erjiaqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked out this PR locally, seems cirque / darwin tests won't crash now.

@msandstedt
Copy link
Contributor

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.
Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.

This should not happen in real devices, but is somehow very frequent in github tests.

And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.

That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.

But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.

TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

@yufengwangca
Copy link
Contributor Author

Could you describe the sequence of operations here? I have a hard time understanding the fix:

I believe our current code assumes single-threading (one main loop that handles all processing). In that case, the placement of where we save the retain slot does not seem to matter.

Currently, sendMessage is triggered from app within app's own thread directly, and receiveMessage is handled within CHIP main thread. I think we should force a single-threading mode (This need certain refactor on the app side) in long term. That will avoid a lot of race issues.

@yufengwangca
Copy link
Contributor Author

Note that I agree that not trying to do the caching part in two sections of the code is a good change. I just am not clear how this fixes any form of race condition.
Also 'encryptedMsg' should be deleted as a variable - it has no use.

In very rare cases, the ack is received before SendMessage is returned, but the retrans table is not ready, thus it crashes.
This should not happen in real devices, but is somehow very frequent in github tests.
And, yes, we should do real send / receive works in one thread / task, all operations to CHIP objects should be guarded by mutex, but we need to figure the border of CHIP sdk first, this is tracked in #6251

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.

That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.

But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.

TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

The current CHIP messaging stack is largely inherited from Weave which has single-thread -model in mind, we only have one big lock PlatformManager::LockChipStack which is used to lock the whole chip stack to avoid contention from other application thread.

I agree we need mutex protection for sendMessage since it is triggered from app thread directly right now. Need to handle it in the separate PR and we still need to leverage if we should force a single thread model or use separate mutex.

@yufengwangca yufengwangca marked this pull request as ready for review May 5, 2021 17:11
@yufengwangca yufengwangca requested a review from andy31415 May 5, 2021 17:11
@msandstedt
Copy link
Contributor

msandstedt commented May 5, 2021

I took it for granted that for this to occur in the first place, send must be happening from a different thread context than chip UDP comms. This implies that send must be mutex-protected.
That mutex protection then also acts as a memory barrier: you cache first, lock, send, unlock. This guarantees that caching will have occurred before unlock.
But this was just an assumption. I was not able to trace the code to figure out where this locking would occur. I assume it has to be within, or at least dependent upon platform-specific code.
TLDR: yes, the send operation needs to be mutex protected if send can ever occur from a different thread context. Again, I assumed we had such protection.

The current CHIP messaging stack is largely inherited from Weave which has single-thread -model in mind, we only have one big lock PlatformManager::LockChipStack which is used to lock the whole chip stack to avoid contention from other application thread.

I agree we need mutex protection for sendMessage since it is triggered from app thread directly right now. Need to handle it in the separate PR and we still need to leverage if we should force a single thread model or use separate mutex.

Thanks for considering my comments.

It's my personal belief that code shouldn't be written to assume a single-threaded environment and that code underlying public interfaces should protect itself to accommodate cross-thread calls. But it's OK if the SDK isn't written this way. We just need to know how to safely call into it.

Based on the above, is it safe to simply wrap cross-thread calls into the sdk with PlatformManager::Lock/UnlockChipStack? That's an easy rule to follow. However I'm having a tough time figuring out whether our current examples do this. It appears things like chip-tool do not.

Edit: Also, things like #6286 move us in a direction where it is not tenable for time-sensitive apps to execute CHIP from their existing event loops. So it's important to know how to safely call into the stack across a thread boundary.

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to get our threading story straight, but if this unblocks us by preventing flakiness in CI, let's do it. We just need to make sure we do the real fix of locking or serializing on a single thread appropriately as well...

Copy link
Contributor

@andy31415 andy31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aproving - my guess is that this passes in a pointer directly in the retransmission table, hence the race.

However the real bug seems to be us using separate threads on a stack that strongly assumes single threading. This code is not yet safe - this is a temporary workaround rather than a fix.

@mrjerryjohns
Copy link
Contributor

I've filed an issue here #6841 to get this sorted out. Once we do, can we please get this documented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Cirque]chip-im-initiat" received signal SIGSEGV, Segmentation fault.
7 participants