Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data) #25917

Closed
cvinayak opened this issue Jun 3, 2020 · 18 comments · Fixed by #52536
Closed

Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data) #25917

cvinayak opened this issue Jun 3, 2020 · 18 comments · Fixed by #52536
Assignees
Labels
area: Bluetooth Host area: Bluetooth bug The issue is a bug, or the PR is fixing a bug known issue Known issue priority: low Low impact/importance bug
Milestone

Comments

@cvinayak
Copy link
Contributor

cvinayak commented Jun 3, 2020

Describe the bug
Performing GATT write commands from a peripheral and a central as fast as possible from the main loop while the peer central is performing security procedure to bond, causes the following assertion fail on both the peripheral and central device

In peripheral:

*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d  ***
Bluetooth initialized
[00:00:00.007,507] <inf> fs_nvs: 6 Sectors of 4096 bytes
[00:00:00.007,507] <inf> fs_nvs: alloc wra: 0, ff0
[00:00:00.007,507] <inf> fs_nvs: data wra: 0, 0
[00:00:00.008,941] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.008,941] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.009,216] <inf> bt_hci_core: No ID address. App must call settings_load()
[00:00:00.031,829] <inf> bt_hci_core: Identity: c1:ab:66:13:11:2e (random)
[00:00:00.031,860] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.031,860] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
Advertising successfully started
Connected: ea:15:23:d0:ac:03 (random)
MTU: 23
Exchange pending...Write: count= 0, len= 0, rate= 0 bps.
MTU exchange successful (247)
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
        k_sem_take failed with err -11
[00:00:27.460,693] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000001  r2/a3:  0x00000001
[00:00:27.460,693] <err> os: r3/a4:  0x00024c28 r12/ip:  0x00000000 r14/lr:  0x00007a01
[00:00:27.460,693] <err> os:  xpsr:  0x61000000
[00:00:27.460,723] <err> os: Faulting instruction address (r15/pc): 0x00007a0c
[00:00:27.460,723] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:27.460,723] <err> os: Current thread: 0x200014f0 (unknown)
[00:00:27.807,037] <err> os: Halting system

In central_hr:

*** Booting Zephyr OS build v2.3.0-rc1-368-g8fecd4b32d9d  ***
Bluetooth initialized
Scanning successfully started
[DEVICE]: 22:52:48:f8:a5:91 (random), AD evt type 3, AD data len 31, RSSI -45
[DEVICE]: 63:47:03:75:bc:8f (random), AD evt type 0, AD data len 29, RSSI -41
[AD]: 1 data_len 1
[AD]: 3 data_len 6
Connected: 63:47:03:75:bc:8f (random)
MTU: 23
Exchange pending...[00:00:00.006,042] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)MTU exchange successful (247)

[00:00:00.006,042] <inf> bt_hci_core: HW Variant: nRF52x (0x0002)
[00:00:00.006,042] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 2.3 Build 0
[00:00:00.006,683] <inf> bt_hci_core: Identity: ea:15:23:d0:ac:03 (random)
[00:00:00.006,683] <inf> bt_hci_core: HCI: version 5.2 (0x0b) revision 0x0000, manufacturer 0x05f1
[00:00:00.006,683] <inf> bt_hci_core: LMP: version 5.2 (0x0b) subver 0xffff
[00:00:00.688,507] <err> bt_conn: Unable to allocate TX context
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:370
        k_sem_take failed with err -11
[00:00:11.095,764] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000001  r2/a3:  0x00000001
[00:00:11.095,764] <err> os: r3/a4:  0x0002142a r12/ip:  0x00000000 r14/lr:  0x0000666d
[00:00:11.095,764] <err> os:  xpsr:  0x61000000
[00:00:11.095,764] <err> os: Faulting instruction address (r15/pc): 0x00006678
[00:00:11.095,794] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:11.095,794] <err> os: Current thread: 0x2000117c (unknown)
[00:00:11.702,239] <err> os: Halting system

To Reproduce
Use the branch: cvinayak@a228956

Steps to reproduce the behavior:
Build and flash peripheral

  1. mkdir -p build/peripheral; cd build/peripheral
  2. cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/peripheral
  3. ninja
  4. ninja flash

Build and flash central_hr
5. mkdir -p build/central_hr; cd build/central_hr
6. cmake -GNinja -DBOARD=nrf52dk_nrf52832 ../../samples/bluetooth/central_hr
7. ninja
8. ninja flash

  1. Open two terminals, minicom -D /dev/ttyACMx
  2. Observe both peripheral and central_hr fail after connection

Expected behavior
peripheral and central_hr should connect, transfer write commands, perform SMP pairing, be encrypted and continue to transder write commands.

Impact
showstopper

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK 0.11.2
  • Commit SHA or Version used: cvinayak@a228956
@cvinayak cvinayak added bug The issue is a bug, or the PR is fixing a bug area: Bluetooth labels Jun 3, 2020
@joerchan
Copy link
Contributor

joerchan commented Jun 5, 2020

Anylsis:
Symmetrical deadlock in both devices in the connection between TX and RX thread.
TX thread has filled up the all the buffers in the controller and has blocked waiting for num complete events.
Controller has filled up all it's receive buffers and the RX thread is delivering ACL packets as fast as it can.
The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command.
The TX thread will not be able to proceed until a num complete arrives on both sides.
The RX thread will not be able to proceed until TX thread is unblocked on both sides.
Receive buffer is full on both sides so no new packets can be transmitted, meaning no new completes will happen.

@joerchan joerchan added the priority: medium Medium impact/importance bug label Jun 5, 2020
@github-actions github-actions bot added the Stale label Aug 5, 2020
@joerchan joerchan removed the Stale label Aug 5, 2020
@carlescufi carlescufi changed the title Bluetooth: Assertion Fail during write commands and security procedure Bluetooth: Deadlock with TX of ACL data and HCI commands (command blocked by data) Oct 1, 2020
@github-actions github-actions bot added the Stale label Dec 1, 2020
@cvinayak cvinayak removed the Stale label Dec 1, 2020
@github-actions github-actions bot added the Stale label Mar 16, 2021
@cvinayak cvinayak removed the Stale label Mar 16, 2021
@galak galak added this to the v2.6.0 milestone May 11, 2021
@galak
Copy link
Collaborator

galak commented May 19, 2021

is this fixed?

@joerchan
Copy link
Contributor

@galak No.

@galak galak removed this from the v2.6.0 milestone May 25, 2021
@galak galak added priority: low Low impact/importance bug known issue Known issue and removed priority: medium Medium impact/importance bug labels May 25, 2021
@xudongzheng
Copy link
Contributor

xudongzheng commented May 30, 2021

I ran into this issue and my workaround was to do the sending from a workqueue thread with a lower priority.

@joerchan
Copy link
Contributor

@xudongzheng Could you elaborate on your workaround? What exactly did you move to the workqueue? Sending of data, or sending of the command?

@xudongzheng
Copy link
Contributor

xudongzheng commented May 31, 2021

@joerchan

I was occasionally getting the "Unable to allocate TX context" error when calling bt_gatt_notify() from a work running on the system workqueue. However moving my bt_gatt_notify() call to a work running on my custom workqueue (with a lower priority) did not give the same "Unable to allocate TX context" error.

I haven't looked into the details of this specific bug but it's possible that https://github.com/cvinayak/zephyr/blob/a2289565363a210e33b8956eac46df55e19a790a/samples/bluetooth/peripheral/src/main.c#L476 or one of the other calls is running into the same issue.

@joerchan
Copy link
Contributor

joerchan commented Jun 1, 2021

@xudongzheng That behavior is documented here: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/bluetooth/gatt.h#L927L930

That is however not the same as is happening in this case, you haven't encountered this issue.

@jefffhaynes
Copy link

jefffhaynes commented Jul 10, 2021

I'm seeing the same thing when I spam bt_gatt_notify. This seems like a dangerous bug since there's no way to really know if you'll exceed the threshold for overflow. Much better would be to simply return an error that could optionally be ignored for non-critical notifications. @galak can you comment on why you feel this is low priority?

@LingaoM
Copy link
Collaborator

LingaoM commented Aug 5, 2021

The RX thread receives an HCI event (phy update complete) and blocks on sending an HCI command waiting for the TX thread to send the command.

I can't understand this, can you point out the relevant code?

@LingaoM
Copy link
Collaborator

LingaoM commented Aug 6, 2021

@joerchan Why we hci_le_read_max_data_len not use bt_hci_cmd_send to avoiding race condition ?

@carlescufi
Copy link
Member

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@klapro
Copy link

klapro commented May 18, 2023

Is it still worked on ? Its still having the same issue.. Im on Zephyr 3.2.0

@jori-nordic
Copy link
Contributor

@klapro this is supposed to be fixed. Could you open a new bug report with all the relevant details (how to reproduce etc) ? Maybe it's something different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host area: Bluetooth bug The issue is a bug, or the PR is fixing a bug known issue Known issue priority: low Low impact/importance bug
Projects
None yet