Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very frequent connect/disconnect issue #185

Open
petermeter69 opened this issue Jun 2, 2023 · 4 comments
Open

Very frequent connect/disconnect issue #185

petermeter69 opened this issue Jun 2, 2023 · 4 comments

Comments

@petermeter69
Copy link

Hello,

i am developing a BLE device based on an NRF52 chip to collect sensor data, and the android app to visualize them at the same time. So to speak, I am on both ends of the BLE connection.

The given sensor collects a lot of data (15 kB/s = 1.3 GB/day = ca. 40 GB/month). If I establish a connection and continuously stream one sensor's data to one Android device, this library works perfectly. Each BLE connection is stable (i.e. not interrupted) for months (!), probably longer (just not yet tested), and data flow is smooth.

Now, I want to test another scenario: one android device collects many nearby sensor's data (in the magnitude of 100). Because of BLE throughput limitation and with above mentioned data rate, only one or two devices can be allowed to be connected at a time only, so a lot of device connects and disconnects have to be performed. This works on both ends (BLE device and Android) well until ca. 200 - 300 connects/disconnects. After that, there is no error on either side, the Android BLE stack just stops working, as if it was overfloated. Even if I open an independent Android app with an independent sensor, the BLE stacks is dead. After an Android device restart, everything works again until the amount of connects/disconnects is reached.

In all these scenarios the BLE device is a peripheral and the Android side is the central, with later initiating the connection to the peripheral with autoconnect.
If I program the BLE device with a very simple program, that does nothing else but wait for a connection, and after 1 second initiate a local disconnect without any data transfer, I can reproduce this issue with just one BLE device and 200-300 connect/disconnect attempts to the very same device.
If I autoconnect to the same BLE device from NRF Connect Android app (this application is provided by the manufacturer of the chip to test BLE connections), everything works fine, no BLE stack overfloating occurs (tested up to ca. 10000 attempts).
Another observation is that with the Android BLE stack overfloated, the already opened connections continue to work (i.e. data transfer continues), just no new connections can be established.

I am ready to provide code samples (BLE device and Android) if needed.

Thanks for your help in advance!
P.

@weliem
Copy link
Owner

weliem commented Jun 2, 2023

hmmm interesting. Sounds like something isn't being cleaned up properly.

What do you see in the logs when it stops working? please share logs

@petermeter69
Copy link
Author

First, thanks for your asnwer!
On the Android side, these are the first 2 full connect disconnect cycles with BLE tag when everything works:

onConnectedPeripheral: D0:74:F5:B1:18:59
onServicesDiscovered: D0:74:F5:B1:18:59
onMtuChanged: D0:74:F5:B1:18:59 247
onPhyUpdate: D0:74:F5:B1:18:59 tx=LE_2M rx=LE_2M
onConnectionUpdated: D0:74:F5:B1:18:59 int=36 lat=0 to=500
onDisconnectedPeripheral: D0:74:F5:B1:18:59

onConnectedPeripheral: D0:74:F5:B1:18:59
onServicesDiscovered: D0:74:F5:B1:18:59
onMtuChanged: D0:74:F5:B1:18:59 247
onPhyUpdate: D0:74:F5:B1:18:59 tx=LE_2M rx=LE_2M
onNotificationStateUpdate: D0:74:F5:B1:18:59 6e400003-b5a3-f393-e0a9-e50e24dcca9e
onConnectionUpdated: D0:74:F5:B1:18:59 int=36 lat=0 to=500
onDisconnectedPeripheral: D0:74:F5:B1:18:59`

The below is the last working connect/disconnect cycle on Android side:

onConnectedPeripheral: D0:74:F5:B1:18:59 onServicesDiscovered: D0:74:F5:B1:18:59 onMtuChanged: D0:74:F5:B1:18:59 247 onPhyUpdate: D0:74:F5:B1:18:59 tx=LE_2M rx=LE_2M onConnectionUpdated: D0:74:F5:B1:18:59 int=36 lat=0 to=500 onNotificationStateUpdate: D0:74:F5:B1:18:59 6e400003-b5a3-f393-e0a9-e50e24dcca9e onDisconnectedPeripheral: D0:74:F5:B1:18:59

After that, the BLE has overfloated and the log has no other entries with BLE TAG.
On the BLE device side these are the last cycles (CC=Connection counter):

`
23:32:23.466 -> Conn: Galaxy A71, CC=189
23:32:24.528 -> Disc, reason = 0x16

23:32:28.807 -> Conn: Galaxy A71, CC=190
23:32:29.924 -> Disc, reason = 0x16

23:32:34.177 -> Conn: Galaxy A71, CC=191
23:32:35.344 -> Disc, reason = 0x16

23:32:39.601 -> Conn: Galaxy A71, CC=192
23:32:40.721 -> Disc, reason = 0x16

23:32:44.993 -> Conn: Galaxy A71, CC=193
23:32:46.120 -> Disc, reason = 0x16

23:32:46.676 -> Conn: Galaxy A71, CC=194
23:32:46.676 -> Disc, reason = 0x13
`
As you can see, a total of 194 connections were established before the BLE overfloating. Normally the disconnect reason is 0x16 = Local Host Terminated Connection (because I am explicitly calling the disconnect function on the BLE device after 1000 msecs) while the last disconnect reason was 0x13 = Remote Host Terminated Connection. Interestingly enough, the last disconnect occurs at the exact same moment as the connect.

For comparison on the BLE device side, when autoconnecting from the NRF Connect app the counter was 17217 when I manually disconnected it (=no BLE overfloating):
`
23:14:35.762 -> Conn: Galaxy A71, CC=17212
23:14:36.898 -> Disc, reason = 0x16

23:14:37.085 -> Conn: Galaxy A71, CC=17213
23:14:38.161 -> Disc, reason = 0x16

23:14:38.254 -> Conn: Galaxy A71, CC=17214
23:14:39.378 -> Disc, reason = 0x16

23:14:39.472 -> Conn: Galaxy A71, CC=17215
23:14:40.594 -> Disc, reason = 0x16

23:14:40.688 -> Conn: Galaxy A71, CC=17216
23:14:41.792 -> Disc, reason = 0x16

23:14:41.890 -> Conn: Galaxy A71, CC=17217
23:14:43.008 -> Disc, reason = 0x16
`
Any thoughts?

@weliem
Copy link
Owner

weliem commented Jun 4, 2023

Ok that doesn't say much that would lead us to a solution. You mentioned that you have an app with which you can reproduce the issue? Can you share that? Then I'll try to reproduce the issue myself.

What I see in the logs is that you do more than just connecting. You change MTU, connection settings, Phy and turn on notifications...that might give a clue. Do you also do that with the nRF app?

Another thing worth trying is to wait 5 seconds before you issue the autoConnect. I have noticed that the Android stack sometimes needs some time to process the disconnect....

@petermeter69
Copy link
Author

My android app is supposed to do much more than just connecting to one sensor with no data transfer. Initially, I had 8 sensors in an array of peripherals with full data transfer (max MTU, data receiving in notifications on a per sensor basis, etc.) On having noticed the problem, I started removing functionality from the app, until I came to the one sensor connect/disconnect problem without any data transfer. That is why parts of the initial code are still there (MTU, notifications, connection params were requested in onServicesDiscovered and in the BLE device, as well). As for issuing autoconnect with a delay, NRF Connect handles it immediately without delay, so it must work anyway.

I have further investigated into this issue on the BLE device side. I switched on counting all disconnect reasons (current and previous ones).
On autoconnecting from NRFConnect (the below is an arbitrary pick not the end of connection):

20:34:29.605 -> Conn: Galaxy A71, CC=2769
20:34:30.723 -> Disc, current reason=0x16, Prev disc reasons: 0x16:2637 0x22:132
20:34:30.723 ->
20:34:32.655 -> Conn: Galaxy A71, CC=2770
20:34:38.688 -> Disc, current reason=0x22, Prev disc reasons: 0x16:2637 0x22:133
20:34:38.688 ->
20:34:40.486 -> Conn: Galaxy A71, CC=2771
20:34:41.552 -> Disc, current reason=0x16, Prev disc reasons: 0x16:2638 0x22:133
20:34:41.553 ->
20:34:42.353 -> Conn: Galaxy A71, CC=2772
20:34:43.485 -> Disc, current reason=0x16, Prev disc reasons: 0x16:2639 0x22:133

As you can see, there are only two reasons of disconnect. 95% of all disconnects (2639/2772=0.95) are 0x16 = Local Host Terminated Connection and 5% are 0x22 = LMP_RESPONSE_TIMEOUT. The above shows only one series, but I ran many of them and the response timeout rate changes in the range 1% .. 5% depending on how congested the current environment is (WiFi/BLE other 2.4GHz stuff). I attribute the 0x22 reasons to this.

On autoconnecting from my app (this time all MTU, connection, and notification requests in onServicesDiscovered commented out). At connect #247 the usual termination occured):

23:17:06.496 -> Conn: , CC=243
23:17:07.564 -> Disc, current reason=0x16, Prev disc reasons: 0x13:36 0x16:200 0x22:7
23:17:07.564 ->
23:17:07.689 -> Conn: Galaxy A71, CC=244
23:17:08.766 -> Disc, current reason=0x16, Prev disc reasons: 0x13:36 0x16:201 0x22:7
23:17:08.766 ->
23:17:09.389 -> Conn: Galaxy A71, CC=245
23:17:10.527 -> Disc, current reason=0x16, Prev disc reasons: 0x13:36 0x16:202 0x22:7
23:17:10.527 ->
23:17:10.621 -> Conn: Galaxy A71, CC=246
23:17:11.710 -> Disc, current reason=0x16, Prev disc reasons: 0x13:36 0x16:203 0x22:7
23:17:11.710 ->
23:17:12.289 -> Conn: , CC=247
23:17:12.289 -> Disc, current reason=0x13, Prev disc reasons: 0x13:37 0x16:203 0x22:7

As you can see, another disconnect reason appeared 0x13 = Remote Host Terminated Connection, and quite a lot of it 37 out of 247 connections. 0x13 was not present in the device logs made with NRF Connect, and its origin is totally unclear to me (my Android app does not call cancelconnection for sure, I double checked). As previously pointed out, the time of disconnect always equals to the time of connect with each 0x13 reason, whereas in the 0x16 cases the difference is around 1100 ms (because 1000 ms after connecting I call disconnect on the current connection in the device).

My conclusions:
1.A large part of the disconnects on the device are initiated by remote host (which are not called by my Android code for sure). If a disconnect was initiated by remote Android code, the time of connect and disconnect on the device could not equal.
2.In the device log I always include the name of the remote host. As you can see, when I connect to the device from my app, sometimes (e.g.: CC=243, CC=247) this info is missing, but it is never the case when when connecting over NRFConnect.
3.When connections overflow, the last disconnect reason is always 0x13 (confirmed by 10 out of 10 attempts).
I suspect they are the origin of this issue, we should find why.

I can send you my app, but you would need the same firmware on the device as mine. Do you have access to any NRF52 board? If not, they can be purchased from 10 USD. I could help you get going.

Thanks for your help!

Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants