Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGApiConnHandler thread exceptions, no way to catch and recover #2

Open
karlp opened this issue Jan 13, 2023 · 13 comments
Open

BGApiConnHandler thread exceptions, no way to catch and recover #2

karlp opened this issue Jan 13, 2023 · 13 comments
Assignees

Comments

@karlp
Copy link

karlp commented Jan 13, 2023

I've got some long lived applications using bgapi, and they (far too) regularly have the internal thread die:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  File "/usr/lib/python3.10/site-packages/bgapi/bglib.py", line 165, in run
  File "/usr/lib/python3.10/site-packages/bgapi/serdeser.py", line 226, in parse
KeyError: 0

I presume this is related to the inherent unreliable nature of the serial channel, but that's unfixable...

Unfortunately, there's no event generated, or caught, so there doesn't seem to be any way for my application to handle this. It would be nice if it could all be safely caught and generate some system level event please...

@silabs-beta
Copy link
Contributor

Which pybgapi version do you use?
Can you please check if the BGAPI version on your target and the BGAPI version in your xapi file are the same?

@karlp
Copy link
Author

karlp commented Feb 20, 2023

I've worked around this here: etactica/silabs-pybgapi-fork@4712def

I'm using BGAPI 1.2.0 (imported from pypi into our fork where I've added this event wrapper)

Target is is a BGM220S, using bt_evt_system_boot(major=5, minor=1, patch=0, build=144, bootloader=0, hw=258, hash=3778289845)

the XAPI file are from the gecko 4.2.1 SDK package. I've had this problem with 4.2.0 as well, I'm not sure if I was using the NCP code with 4.1.3

@silabs-beta
Copy link
Contributor

The versions seem to be correct.
Do you use a standard Silicon Labs board (which board number) or a custom one?

@karlp
Copy link
Author

karlp commented Feb 20, 2023

At the moment it's a BRD4184A rev A02. (BG22 thunderboard) But will eventually be a custom board, most likely with an MG24.

@silabs-beta
Copy link
Contributor

Thank you!
Since we didn't face the issue with the serial connection that you have described, I think it's worth a try fixing it by updating the adapter firmware of the board, if it's not the latest one yet (1v4p9b113).

@karlp
Copy link
Author

karlp commented Feb 20, 2023

it's already running 1v4p9b113

@silabs-beta
Copy link
Contributor

I see.
You propose to "generate some system level event", but I don't think it's a good idea to mix events that originate from the BT stack on the target with events that originate from the host system. The pybgapi package should be independent from the XAPI file content.
I'd like to understand the root cause and provide a proper solution for that. Can you give some hints how to reproduce this issue? Or do you have logs to share?

@silabs-beta silabs-beta self-assigned this Feb 21, 2023
@karlp
Copy link
Author

karlp commented Mar 1, 2023

I'm trying to setup an example using your examples repo, but no luck yet. I see this, and also "'<class 'struct.error'>:unpack requires a buffer of 130 bytes [' File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner\n', ' File "/usr/lib/python3.10/site-packages/bgapi/bglib.py", line 186, in run\n', ' File "/usr/lib/python3.10/site-packages/bgapi/serdeser.py", line 304, in parse\n']'" errors a lot more when I'm running on a lower power CPU. On my laptop, it's much rarer.

the current workload is permanent scanning for legacy advertisements, a single periodic advertiser set, and rolling a loop of client connections to devices seen.

@karlp
Copy link
Author

karlp commented Mar 7, 2023

I've created an example using the minimal code I can manage that demonstrates this: https://github.com/etactica/pybgapi-examples/tree/just-simple-crasher/example/bt_scanner

It's simply opening the NCP connection in an advertising rich environment scanning for everything at once. This hangs with various serdes errors (depending on exactly where it's gotten lost)

The hang is the problem. because the internal bglib thread dies, the app stays running, which no standard process monitoring can catch this and restart. Also, it happens very often.

@karlp
Copy link
Author

karlp commented Mar 7, 2023

Also, as far as addressing the root cause, changing SL_NCP_EVT_BUF_SIZE up to 4096 has no impact on this that I can see, not sure what else to try in that regard. I'm currently running this on the MG24 xplorer, brd2703A rev a02, but as mentioned, was also on the bg22 explorer board

@silabs-beta
Copy link
Contributor

Thanks for the reproducer script!
I'm running it on a Raspberry Pi 2 model B with BRD4184A board, but I couldn't reproduce the issue within 30+ minutes.
I suspect that some bytes are already lost in the serial device driver of your host system.
I see 3 ways to fix it.

  1. Fix the serial device handling in your system.
  2. Make the bglib receiver thread available to the user application, so it can be monitored.
  3. Trying to make the transport layer more robust by using CPCd: https://github.com/SiliconLabs/cpc-daemon

@karlp
Copy link
Author

karlp commented Mar 7, 2023

Option 1 would be ideal, of course, but I'm not sure what avenue to even try, I'm not used to having a 115200 serial port lose data like this. I do agree it seems like the serial port is losing data. It seems to be CPU related though, with all the debugging prints, it tops out at 100% and looks like it's switching cores as it's rescheduled. I tried extending the event buffer size in NCP, but that didn't help. What speed can I configure the Jlink on these dev boards to? I'd like to use 1MBit or more if could.

  1. that's an option too, but it seems like putting a lot of work on the user. The reason I made it emit an "internal" event is that it then was at least handled by the normal event handler paths.

I'm not sure option 3 would actually help, if I'm having problems with the serial port losing data? I did look at that, as it seems essential if want to use bluetooth and matter concurrently down the road anyway, and I can try packaging cpcd for this platform to try it out.

@karlp
Copy link
Author

karlp commented Mar 7, 2023

with cpcd, I still get failures, and cpcd errors similar too:

FATAL system call in function 'server_push_data_to_endpoint' in file server_core/server/server.c at line #1741 : Invalid argument
[15:55:48:278] Server : send() failed with EAGAIN
[15:55:48:278] *** ASSERT *** : FATAL system call in function 'server_push_data_to_endpoint' in file server_core/server/server.c at line #1741 : Invalid argument

the client side I had to modify a little to not issue a reset, but instead to wait 2 seconds before trying the HELLO command, but it still crashes very rapidly, so it's ~equivalent to the normal NCP case, just with more moving pieces :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants