-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [PYTHON] BLE Scan crash #26265
Comments
@mingjiansi What's the stack to the failing assert? |
And I thought this was fixed in #26200 ? |
@bzbarsky-apple my use case might be a little different than usual. What i have done is to use the build_python.sh script to compile into whl file then directly use the whl package in code. This allows me to test light bulbs and plugs in a more controlled and automated fashion. However, in this use case there are two different failures.
|
console screen for the error is following:
|
Run python under a debugger, wait for the crash, then look at the stack. |
That said, |
this unfortunately did not fix the issue. the crash just happened in src/platform/Linux/bluez/ChipDeviceScanner.cpp instead. |
I attempted using pdb and set_trace. since the error happened in the c++ library, there was nothing from python side |
I'm not following. I suggested adding the appropriate locks to |
You'd want to use gdb attached to the python process. But again: I suspect that fixing src/platform/Linux/bluez/ChipDeviceScanner.cpp would help. It might not be the only thing that needs fixing, of course. |
Hi @mingjiansi, can you tell me the exact chip-repl command you are running. Right now when I run I get a generator. If I try doing something like:
If you have a way to reproduce, normally the way have debugged with GDB I have done |
@tehampson As I mentioned above, this error occurred when I ran a python file I wrote outside of repl. for example something like the following.
|
Hm... When I am doing something similar in the chip-repl interactive shell I am not hitting any crash. I get a timeout but that is it. @mingjiansi did you try doing something like |
@tehampson @bzbarsky-apple Here is what I have realized the in past few days. The issue basically stem from |
I tried this before, really didn't get anywhere with it.
|
When you get into GDB after the issue happens please do Also I am right now recompiling everything on a linux machine that has bluetooth so maybe I will be able to reproduce soon |
|
Sweet! We just need to add After that I think we will be 👌 |
Created a pull request #26338 Let me know if that unblocks you? I am getting a different segfault later on, but that was happening before as well, and I am not really setup to comission BLE devices so this is likely an issue on my side. For posterity my new segfault is in |
Unfortunately this does not work.
|
That is getting further. Now we have a crash in I have updated #26338 as well already |
As for this error. I have a fix for that. It is just the Python interface only passed 2 handlers, missing the scan error handler. In my branch right now, I just added a default handler to get pass that. |
@tehampson I thought |
My understanding is that #26180 introduce check to make sure that anything that schedules work (or cancels work) has the chip stack lock. We only want to acquire this lock right before calls to schedule/cancel work. Doing so in I am simply looking at precedent set by other fixes where they surround the |
Keep me posted if the 4 extra lines in #26338 unblocks you. If so I will work on getting it merged (or if you want to upload the PR I can close mine). If you hit another issue please upload the stack trace |
Doesn't seem to fix it, ran into an infinite loop of sort.
|
That looks like it ran successfully. What happens if you remove gdb and just do |
tried already. just infinite loop after the scanner completes. feels like some timer thread didn't exit. |
fwiw, this is the branch i am working on that already fixed the scan error handler missing. |
Are you on slack? Was this ever working for you before? Or are you only recently trying to use this? I will have to context switch for the day. But I do think we are making some progress here |
Also I am pretty sure we are stuck in here. If you change your
|
I have to run as well, not sure if you can find me on slack via my email or not. mingjian.si@savant.com |
tried it. same thing, something is not exit correctly. whole thing hangs. |
Hmm... it won't let me DM you 🤷 . I guess we can just go back and forth here. I do think that your issue is that you are waiting on that queue forever, and that if you handle it in your error callback you should be unstuck. I do think those error related changes can be in a follow up PR since that is working on an adjacent but different issue |
You can also add a timeout on the
Edit: Finger slipped I did not mean to close this issue, also I was not done writing my comment. I was going to add that if that timeout doesn't work for you I think we will need to use PDB to understand why |
@tehampson I have implemented the queue change that you have suggested stilled ended up with hanging same issue. By using the dumbest method available to me, I have added a bunch of debug console lines to see the path of execution I have made a discovery. From what I can tell it looks like something is wrong with the OnComplete handler. at this point. https://github.com/mingjiansi/connectedhomeip/blob/93a5e8093693fa102c662effdf17903b032dc316/src/platform/Linux/bluez/ChipDeviceScanner.cpp#L245 my observation is as follow. I guess the problem currently lies in the OnScanComplete handler, but nothing really is jumping at me. |
@mingjiansi , weird I am not reproducing the issue you are seeing. You are now running your script without gdb right? When you use PDB, and add Also I am now context switched into this issue again and should be more responsive to this issue for the next 1-2 hours |
Hm... Using GDB I think I can see what you mean by |
great. at this point, it almost seems like there is a deadlock somewhere. |
Because there is! Thread running main event loop:
Thread running delete this
Lock is acquired in these two spots:
Posting findings for now, but still seeing what we can do about this and why it is happening |
@tehampson Thank you. My god, I have no idea how you are deciphering logs like that. orz |
Deleted comment because it actually did not work |
oh good. i thought my reverse midas touch might have ruined it. |
@mingjiansi Do you mind if I just also add in your |
Please do. |
I have updated #26338 to what I have right now. There is still one remaining hang during chip-repl shutdown. But let me know if that gets you up and running for the most part |
Okay so #26338 now "works". I say that in quotes since there are still some paths that will crash, just not the path that @mingjiansi you are doing right now. I will figure this out, but for now just sharing the state of this before I log off for the evening. |
I am heading out of the door now, I will give this a try first thing tomorrow. |
Reproduction steps
Initiate DiscoverySync or DiscoveryAsync will cause entire process to crash with the following message
[1682519165.322591][1219598:1219598] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:125'. Code is unsafe/racy
[1682519165.322610][1219598:1219598] CHIP:-: chipDie chipDie chipDie
Issue can be traced to
assertChipStackLockedByCurrentThread();
to the beginning of LayerSelect::StartTimer and LayerSelect::CancelTimerBug prevalence
Everytime
GitHub hash of the SDK that was being used
980a1a1
Platform
python
Platform Version(s)
No response
Anything else?
Issue can be traced to
assertChipStackLockedByCurrentThread();
to the beginning of LayerSelect::StartTimer and LayerSelect::CancelTimerThe text was updated successfully, but these errors were encountered: