Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows BLE reconnection not working after update to v4.4.6 (IDFGH-11531) #12656

Closed
3 tasks done
ztefanjo opened this issue Nov 23, 2023 · 19 comments
Closed
3 tasks done
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF

Comments

@ztefanjo
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.6

Espressif SoC revision.

ESP32-S3

Operating System used.

Windows

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

CMD

Development Kit.

Custom Board

Power Supply used.

Battery

What is the expected behavior?

Using the ESP as BLE peripheral and windows as BLE central:

  • Our proprietary Windows tools is able to disconnect a Windows BLE connection and the status of the device remains "Paired" in "Bluetooth & other devices" Setting in Windows until next connection is triggered by the tool.
  • It is always possible to connect with our Windows tool.

This was working with v4.4.5.

In some cases, Windows anyway kept the connection open (as windows very much like to keep connections open), but it was still possible to reconnect with our Windows tool.

What is the actual behavior?

Some different strange behaviors, see steps to reproduce for more details:

  • After running our proprietary windows tool, the BLE connection remains Connected (the tool fails to disconnect).
  • When trying to reconnect, it fails directly in Windows without any logs on the ESP side indicating any communication at all.
    or
  • Windows is able to connect but will itself disconnect directly

Steps to reproduce.

  1. Using v4.4.5, pair and bond with the device (ESP acting peripheral) from Windows
  2. Update to v4.4.6
  3. Try to connect -> connection is refused on Windows side. On ESP side, logs tell that windows connects and disconnects again
  4. Remove pairing info in windows
  5. Try to connect again -> it works
  6. Try to close connection in windows tool -> windows will not close its connection but keep it open
  7. Try to connect again -> it works (but sometimes it doesn't, have no exact way to reproduce...)
  8. Downgrade to v4.4.5
  9. Windows directly automatically reconnects in background (not wanted)
  10. Try to connect in windows tool -> connection is refused on Windows side, no activity on windows side
    This state is also happening in v4.4.6 but I don't know yet how to reproduce.

Debug Logs.

See More Information.

More Information.

Three example logs in "more Information" that contain much proprietary logging and no ESP logs:

@ztefanjo ztefanjo added the Type: Bug bugs in IDF label Nov 23, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Nov 23, 2023
@github-actions github-actions bot changed the title Windows BLE reconnection not working after update to v4.4.6 Windows BLE reconnection not working after update to v4.4.6 (IDFGH-11531) Nov 23, 2023
@ztefanjo
Copy link
Author

Also attaching our sdkconfigs:
sdkconfig.defaults.txt

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 24, 2023

Based on your description, one potential reason could be that one end has the LTK while the other does not. To verify this, you need to print the bond list on the ESP side. Check if Windows is present in the bond list using the following code:

static void show_bonded_devices(void)
{
    int dev_num = esp_ble_get_bond_device_num();
    ESP_LOGI(GATTS_TABLE_TAG, "Bonded devices number: %d", dev_num);
    if (dev_num == 0) {
        return;
    }

    esp_ble_bond_dev_t *dev_list = (esp_ble_bond_dev_t *)malloc(sizeof(esp_ble_bond_dev_t) * dev_num);
    if (!dev_list) {
        ESP_LOGI(GATTS_TABLE_TAG, "malloc failed, return\n");
        return;
    }
    esp_ble_get_bond_device_list(&dev_num, dev_list);

    ESP_LOGI(GATTS_TABLE_TAG, "Bonded devices list: %d", dev_num);
    for (int i = 0; i < dev_num; i++) {
        esp_log_buffer_hex(GATTS_TABLE_TAG, (void *)dev_list[i].bd_addr, sizeof(esp_bd_addr_t));
    }

    free(dev_list);
}

Additionally, try clearing the LTK on both ends. The removal process is as follows:

  1. On Windows, delete the paired device in the settings.
  2. On ESP, clear the bond list. Specifically, you can use the "remove_all_bonded_devices" API or use the command "idf.py erase-flash".

Verify if the system works correctly after clearing the bond.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 24, 2023

1--Do you have a packet capture tool available? If so, could you please capture the data packets for analysis?

2--To facilitate analysis, kindly utilize the provided patch to display HCI data. If feasible, provide me with comprehensive logs for both normal and problematic scenarios.

0001-fix-bt-Display-hci-data-including-up-and-down (3).patch

3--Could you employ the bisect command to pinpoint the specific commit causing the issue? Here are the steps:

Initiate the bisect process:

git bisect start

Mark the current version as bad:

git bisect bad

Mark a known good version, for example:

git bisect good v2.0.0

...

I will attempt to replicate your issue on my end and eagerly await further feedback from you.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Nov 24, 2023
@ztefanjo
Copy link
Author

Thanks @zhp0406 we will start look into this. We have a BLE sniffer that we hopefully can use to capture some more details.

@ztefanjo
Copy link
Author

@zhp0406 ,
1 above done, attaching log files from "Wireless Protocol Suite" that you should be able to read if you install the same tool: https://www.teledynelecroy.com/support/softwaredownload/psgdocuments.aspx?standardid=2&mseries=671

Log file 1: This shows expected behavior when paring/bonding with v4.4.5.
Log file 2: Again expected behavior when reconnecting and disconnecting towards v4.4.5.
Log file 3: After update to v4.4.6, the same procedure as in log file 2 fails to connect (ESP is not pairable here). There is a CRC error on frame 243 that could be related.
Log file 4: Make ESP pairable to try to connect - still fails
Log file 5: Removed pairing in Windows and reconnected. Works, but disconnecting does not happen, Windows continues communicating
Log file 6: Downgrade to v4.4.5. Windows connects directly in BG (because it had this erroneous state saved from v4.4.6). Our tools fails to connect directly, unclear if there is any actual communication over the air related to this. There is however a CRC error at frame 2106 that could relate to the connect attempt.

BLE sniffer logs.zip

We are also using a Wifi/BLE coexist patch that we have got from Espressif: https://glab.espressif.cn/technical_support/husqvarnagroup/esp-idf-patch/commit/3144578169766e79180d99a689eb769d6f28c768.
I will also test without this patch to see if this is related.

@ztefanjo
Copy link
Author

In log file 4, there is an error on frame 109.
In log file 5, there are also CRC errors, e.g. on frame 386.

@ztefanjo
Copy link
Author

Same problem without the Wifi/BLE coexist patch, so that is not the problem.

@ztefanjo
Copy link
Author

@zhp0406 , I have now done the bisect exercise as you recommended and got the following result:
7e6b085 is the first bad commit
commit 7e6b085
Author: chenjianhua chenjianhua@espressif.com
Date: Wed Dec 7 15:06:27 2022 +0800

bluedroid: support GATT robust caching feature

I did however need to patch gatt_attr.c when testing this version as GATTS_ROBUST_CACHING_ENABLED was not defined. I assumed it to be always "ENABLED" when compiling.

Do you still require HCI data logs (number 2) or is our BLE sniffer logs together with the bisect result enough to pin-point the problem?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 27, 2023

Could you please disable GATTS_ROBUST_CACHING_ENABLED and see if the issue persists?
image

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 28, 2023

According to the log you provided (4_Make_pairable_fails_to_connect_anyway.cfa), the Windows side initiated the disconnection with the error code "Remote User Terminated Connection."

I need to further analyze the reason for the Windows-initiated disconnection. Based on the current description, it may be related to the "support GATT robust caching feature."
image

@ztefanjo
Copy link
Author

@zhp0406 : I have now tested with GATTS_ROBUST_CACHING_ENABLED set to be FALSE in gatt_attr.c and it gives the same error.
I also tried to set CONFIG_BT_GATTS_ROBUST_CACHING_ENABLED=y (disabled by default and on all tests while I was doing the bisect excercise) on v4.4.6 and it still gives the same error.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 28, 2023

ok,I need more time to investigate the reasons.If you have any new discoveries, please let me know.
By the way, how can I reproduce your issue on my end?

@ztefanjo
Copy link
Author

@zhp0406, hard to say how much you would need to do to be able to reproduce. Definitely use our sdkconfig. We support two incoming connections and one outgoing (meaning that we restart advertising after connection) at the same time as we do Wifi communication. We have one attribute that we use for simulating a comport on which we do all communication once the connection is set up. We set MTU and connection parameters. We use an external crystal as BT clock source.

Let me know if there are any more tests that we can do in our end.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 29, 2023

1-I am currently trying to reproduce your issue using alternative methods. For example, I used the example (esp-idf/examples/bluetooth/bluedroid/ble/gatt_security_server) and connected with Windows. I observed that Windows actively initiated the disconnection (currently unclear why Windows is disconnecting). However, the connection from an Android device to the same example is normal.
This issue occurred in both v4.4.5 and v4.4.6. I am unsure whether it is the same issue you are experiencing, but regardless, the problem should be addressed.
2-Additionally, based on your previous description, your issue seems to be related to the commit 7e6b085, which introduced robust features and some features of BLE5.3. Therefore, I am providing you with a patch that disables the characteristics of the robust feature and addresses some known issues.
0001-fix-bt-bluedroid-Debug-for-ble-service-change-charac.zip
I am not certain if this patch will resolve your issue. Could you please test it and provide feedback?
3-If I can reproduce your problem on my end, it would be very helpful in finding a solution.

@ztefanjo
Copy link
Author

Hi @zhp0406 ,
Your problem with Windows sounds similar but not identical as we have no problems with v4.4.5.

The patch seems to work :-). I am not able to reproduce the error after applying it, at least not as easy as before. Is the patch in a deployable state or do you want to do additional work on it before it is "production ready"?

@esp-zhp
Copy link
Collaborator

esp-zhp commented Nov 30, 2023

@ztefanjo

In your patch, two main actions were taken. Firstly, there was a fix for the issue where gatts did not respond correctly when reading the service change characteristic descriptor. We have already addressed this problem in the latest master version, and the corresponding commit is c73d7c0.

Secondly, the reboost feature was disabled, as it is relatively new and may lead to compatibility issues.

However, I'm unsure which specific problem caused this. I can provide you with separate patches for each fix, allowing you to test and identify the root cause.
debug_patch.zip

For the final version you'll be using, I recommend applying both patches simultaneously.

Additionally, I will synchronize the code to the release branch and the master branch in the form of configuration options.

@ztefanjo
Copy link
Author

ztefanjo commented Dec 5, 2023

Hi @zhp0406 ,
The problem still remains with the first patch (0001-fix-bt-bluedroid-Fix-bugs-in-ble-service-change-char.patch), so the second patch solves the problem.

@esp-zhp
Copy link
Collaborator

esp-zhp commented Dec 5, 2023

@ztefanjo
Thank you very much for your feedback.

@Indastri
Copy link
Collaborator

Indastri commented Dec 7, 2023

According to the user the issue is now fixed, if there will be any other problems feel free to reopen or create a new issue. Cheers.

@Indastri Indastri closed this as completed Dec 7, 2023
@tom-borcin tom-borcin added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress labels Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

5 participants