Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [PYTHON] BLE Scan crash #26265

Closed
mingjiansi opened this issue Apr 26, 2023 · 48 comments · Fixed by #26338
Closed

[BUG] [PYTHON] BLE Scan crash #26265

mingjiansi opened this issue Apr 26, 2023 · 48 comments · Fixed by #26338
Assignees
Labels

Comments

@mingjiansi
Copy link

Reproduction steps

Initiate DiscoverySync or DiscoveryAsync will cause entire process to crash with the following message
[1682519165.322591][1219598:1219598] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:125'. Code is unsafe/racy
[1682519165.322610][1219598:1219598] CHIP:-: chipDie chipDie chipDie

Issue can be traced to assertChipStackLockedByCurrentThread(); to the beginning of LayerSelect::StartTimer and LayerSelect::CancelTimer

Bug prevalence

Everytime

GitHub hash of the SDK that was being used

980a1a1

Platform

python

Platform Version(s)

No response

Anything else?

Issue can be traced to assertChipStackLockedByCurrentThread(); to the beginning of LayerSelect::StartTimer and LayerSelect::CancelTimer

@bzbarsky-apple
Copy link
Contributor

@mingjiansi What's the stack to the failing assert?

@bzbarsky-apple
Copy link
Contributor

And I thought this was fixed in #26200 ?

@mingjiansi
Copy link
Author

@bzbarsky-apple my use case might be a little different than usual. What i have done is to use the build_python.sh script to compile into whl file then directly use the whl package in code. This allows me to test light bulbs and plugs in a more controlled and automated fashion. However, in this use case there are two different failures.

  1. when in python script directly import and invoke chip.ble.DiscoverySync or the async version. the stack immediately crashes. After i commented out the assertChipStackLockedByCurrentThread(); from the StartTimer and CancelTimer, failure no longer occur. At this point, I have no idea how I can get trace the stack for this issue.
  2. In the python interface, there is no ScanError callback, and from what I can tell the C++ interface is expecting that, so whenever a BLE scan error occurs, stack fails again. I have added a default handler for error that seems to do the trick.

@mingjiansi
Copy link
Author

console screen for the error is following:

[1682610262.862360][872234:872234] CHIP:CTL: Setting attestation nonce to random value
[1682610262.863067][872234:872234] CHIP:CTL: Setting CSR nonce to random value
[1682610262.863810][872234:872234] CHIP:DL: writing settings to file (/tmp/chip_counters.ini-pMqW7J)
[1682610262.863967][872234:872234] CHIP:DL: renamed tmp file to file (/tmp/chip_counters.ini)
[1682610262.863977][872234:872234] CHIP:DL: NVS set: chip-counters/reboot-count = 9 (0x9)
[1682610262.864598][872234:872234] CHIP:DL: Got Ethernet interface: enx5cff35d579d4
[1682610262.864916][872234:872234] CHIP:DL: Found the primary Ethernet interface:enx5cff35d579d4
[1682610262.864956][872234:872234] CHIP:DL: Got WiFi interface: wlp45s0
[1682610262.866640][872234:872234] CHIP:DL: Found the primary WiFi interface:wlp45s0
Initializing persistent storage from file: ./chip-ge-stacks.bin
Loading configuration from ./chip-ge-stacks.bin...
[1682610262.867335][872234:872234] CHIP:TS: Last Known Good Time: 2023-04-27T11:42:08
[1682610262.867666][872234:872234] CHIP:FP: Fabric index 0x1 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x0000000000001F40, VendorId 0x1339
[1682610262.868030][872234:872234] CHIP:ZCL: Using ZAP configuration...
[1682610262.869115][872234:872234] CHIP:DL: Avahi client registered
Loading certificate authorities from storage...
New CertificateAuthority at index 1
Loading fabric admins from storage...
New FabricAdmin: FabricId: 0x000000000000000F, VendorId = 0x1339
Allocating new controller with CaIndex: 1, FabricId: 0x000000000000000F, NodeId: 0x0000000000001F40, CatTags: []
[1682610262.907842][872234:872240] CHIP:CTL: Setting attestation nonce to random value
[1682610262.907903][872234:872240] CHIP:CTL: Setting CSR nonce to random value
[1682610262.907922][872234:872240] CHIP:SPT: Using device attestation PAA trust store path /home/ubuntu/trust-cert/.
[1682610262.908314][872234:872240] CHIP:CTL: Generating NOC
[1682610262.908752][872234:872240] CHIP:FP: Validating NOC chain
[1682610262.909252][872234:872240] CHIP:FP: NOC chain validation successful
[1682610262.909298][872234:872240] CHIP:FP: Updated fabric at index: 0x1, Node ID: 0x0000000000001F40
[1682610262.909305][872234:872240] CHIP:TS: Last Known Good Time: 2023-04-27T11:42:08
[1682610262.909309][872234:872240] CHIP:TS: New proposed Last Known Good Time: 2021-01-01T00:00:00
[1682610262.909314][872234:872240] CHIP:TS: Retaining current Last Known Good Time
[1682610262.909585][872234:872240] CHIP:FP: Metadata for Fabric 0x1 persisted to storage.
[1682610262.909748][872234:872240] CHIP:TS: Committing Last Known Good Time to storage: 2023-04-27T11:42:08
[1682610262.909911][872234:872240] CHIP:CTL: Joined the fabric at index 1. Fabric ID is 0x000000000000000F (Compressed Fabric ID: FEB3387C2067A0B3)
[1682610262.909916][872234:872240] CHIP:CTL: *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
[1682610262.909964][872234:872240] CHIP:SPT: Setting up group data for Fabric Index 1 with Compressed Fabric ID:
[{'fabricID': 15, 'devices': [{'nodeID': 15, 'productID': 89}, {'nodeID': 16, 'productID': 89}]}]
[1682610262.919278][872234:872239] CHIP:BLE: BLE removing known devices.
[1682610262.919795][872234:872239] CHIP:BLE: BLE initiating scan.
[1682610263.042385][872234:872234] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:125'. Code is unsafe/racy
[1682610263.042400][872234:872234] CHIP:-: chipDie chipDie chipDie
Aborted

@bzbarsky-apple
Copy link
Contributor

At this point, I have no idea how I can get trace the stack for this issue.

Run python under a debugger, wait for the crash, then look at the stack.

@bzbarsky-apple
Copy link
Contributor

That said, src/platform/Linux/bluez/ChipDeviceScanner.cpp has StartTimer and CancelTimer calls that might be happening on the wrong threads and should probably take the stack lock. @mingjiansi can you test whether that fixes your issue, please?

@mingjiansi
Copy link
Author

That said, src/platform/Linux/bluez/ChipDeviceScanner.cpp has StartTimer and CancelTimer calls that might be happening on the wrong threads and should probably take the stack lock. @mingjiansi can you test whether that fixes your issue, please?

this unfortunately did not fix the issue. the crash just happened in src/platform/Linux/bluez/ChipDeviceScanner.cpp instead.

@mingjiansi
Copy link
Author

At this point, I have no idea how I can get trace the stack for this issue.

Run python under a debugger, wait for the crash, then look at the stack.

I attempted using pdb and set_trace. since the error happened in the c++ library, there was nothing from python side

@bzbarsky-apple
Copy link
Contributor

 this unfortunately did not fix the issue. the crash just happened in src/platform/Linux/bluez/ChipDeviceScanner.cpp instead.

I'm not following. I suggested adding the appropriate locks to src/platform/Linux/bluez/ChipDeviceScanner.cpp. Did you try doing that?

@bzbarsky-apple
Copy link
Contributor

I attempted using pdb and set_trace. since the error happened in the c++ library, there was nothing from python side

You'd want to use gdb attached to the python process. But again: I suspect that fixing src/platform/Linux/bluez/ChipDeviceScanner.cpp would help. It might not be the only thing that needs fixing, of course.

@tehampson tehampson self-assigned this May 2, 2023
@tehampson
Copy link
Contributor

Hi @mingjiansi, can you tell me the exact chip-repl command you are running. Right now when I run chip-repl I do the following 2 command:
import chip.ble
asdf = chip.ble.DiscoverSync(timeoutMs=5000)

I get a generator. If I try doing something like: for x in asdf: print(x) I get

2023-05-02 18:21:06 thampson-boreal.c.googlers.com chip.DL[1086138] ERROR Failed to get DBUS object manager for listing adapters.
2023-05-02 18:21:06 thampson-boreal.c.googlers.com chip.DL[1086138] ERROR DBus error: Error calling StartServiceByName for org.bluez: Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)

If you have a way to reproduce, normally the way have debugged with GDB I have done gdb --args python3 .environment/pigweed-venv/bin/chip-repl. It will compain about not it cannot find symbols for python3. That is fine, just type r run, reproduce the issue and typically you can get the stack trace

@mingjiansi
Copy link
Author

@tehampson As I mentioned above, this error occurred when I ran a python file I wrote outside of repl. for example something like the following.

import argparse
import asyncio
import atexit
import builtins
import datetime
import json
import logging
import random
import time
import math

import chip.CertificateAuthority
import chip.discovery
import chip.FabricAdmin
import chip.logging
import chip.native
import chip.ble
import chip.setup_payload
import coloredlogs
from chip import discovery
from chip.ChipDeviceCtrl import ChipDeviceController
from chip.ChipStack import *


def chipStartUp(chipFilePath: str):
    chip.native.Init()

    chipStack = ChipStack(persistentStoragePath=chipFilePath,
                          bluetoothAdapter=None,
                          enableServerInteractions=False)

    certificateAuthorityManager = chip.CertificateAuthority.CertificateAuthorityManager(chipStack,
                                                                                        chipStack.GetStorageManager())

    certificateAuthorityManager.LoadAuthoritiesFromStorage()
    vendor = 4921
    fabric = 15

    if (len(certificateAuthorityManager.activeCaList) == 0):
        ca = certificateAuthorityManager.NewCertificateAuthority()
        ca.NewFabricAdmin(vendorId=vendor, fabricId=fabric)
    elif (len(certificateAuthorityManager.activeCaList[0].adminList) == 0):
        certificateAuthorityManager.activeCaList[0].NewFabricAdmin(vendorId=vendor, fabricId=fabric)

    caList = certificateAuthorityManager.activeCaList

    certPath = '/home/ubuntu/trust-cert/'

    devCtrl = caList[0].adminList[0].NewController(
        10, certPath, False
    )

    builtins.devCtrl = devCtrl
    builtins.certificateAuthorityManager = certificateAuthorityManager


if __name__ == '__main__':
    chipStartUp('./chip-ge-stacks.bin')

    bleValues = chip.ble.DiscoverSync(timeoutMs=5000)
    for v in bleValues:
        print(v)
'''        

@tehampson
Copy link
Contributor

Hm... When I am doing something similar in the chip-repl interactive shell I am not hitting any crash. I get a timeout but that is it.

@mingjiansi did you try doing something like gdb --args python3 yourscript.py followed by r in the gdb shell?

@mingjiansi
Copy link
Author

@tehampson @bzbarsky-apple Here is what I have realized the in past few days. The issue basically stem from src/controller/python/chip/ble/LinuxImpl.cpp the python code call directly creates the BLE scanner by invoking pychip_ble_start_scanning the result is that the delegate itself never locked/unlock the chip stack here. Once the timer is created inside the StartTimer, assertChipStackLockedByCurrentThread() would then cause error. When chip::DeviceLayer::PlatformMgr().LockChipStack(); and chip::DeviceLayer::PlatformMgr().UnlockChipStack(); were placed in, things starts to workish. Problem is that at this point I have yet to figure out where/how to add them.

@mingjiansi
Copy link
Author

mingjiansi commented May 2, 2023

Hm... When I am doing something similar in the chip-repl interactive shell I am not hitting any crash. I get a timeout but that is it.

@mingjiansi did you try doing something like gdb --args python3 yourscript.py followed by r in the gdb shell?

I tried this before, really didn't get anywhere with it.

mingjian@mingjian-950xdb ~/Git/GE-Matter-Controller/src $ sudo gdb --args python test.py
[sudo] password for mingjian: 
GNU gdb (Ubuntu 12.1-3ubuntu2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
(gdb) r
Starting program: /usr/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4c0d6c0 (LWP 21374)]
[1683055142.047764][21371:21371] CHIP:CTL: Setting attestation nonce to random value
[1683055142.048741][21371:21371] CHIP:CTL: Setting CSR nonce to random value
[New Thread 0x7fffeefb86c0 (LWP 21404)]
[1683055142.049646][21371:21371] CHIP:DL: writing settings to file (/tmp/chip_counters.ini-msNfnQ)
[1683055142.049777][21371:21371] CHIP:DL: renamed tmp file to file (/tmp/chip_counters.ini)
[1683055142.049786][21371:21371] CHIP:DL: NVS set: chip-counters/reboot-count = 3 (0x3)
[1683055142.050146][21371:21371] CHIP:DL: Got Ethernet interface: enx5cff35d579d4
[1683055142.050420][21371:21371] CHIP:DL: Found the primary Ethernet interface:enx5cff35d579d4
[1683055142.050699][21371:21371] CHIP:DL: Got WiFi interface: wlp45s0
[1683055142.052124][21371:21371] CHIP:DL: Found the primary WiFi interface:wlp45s0
Initializing persistent storage from file: ./chip-ge-stacks.bin
Loading configuration from ./chip-ge-stacks.bin...
[1683055142.052789][21371:21371] CHIP:TS: Last Known Good Time: 2023-05-02T15:13:48
[1683055142.053097][21371:21371] CHIP:FP: Fabric index 0x1 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x0000000000001F40, VendorId 0x1339
[1683055142.053213][21371:21371] CHIP:FP: Fabric index 0x2 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x000000000000000A, VendorId 0x1339
[1683055142.053557][21371:21371] CHIP:ZCL: Using ZAP configuration...
[1683055142.054461][21371:21371] CHIP:DL: Avahi client registered
[New Thread 0x7fffee7b76c0 (LWP 21405)]
Loading certificate authorities from storage...
New CertificateAuthority at index 1
Loading fabric admins from storage...
New FabricAdmin: FabricId: 0x000000000000000F, VendorId = 0x1339
Allocating new controller with CaIndex: 1, FabricId: 0x000000000000000F, NodeId: 0x000000000000000A, CatTags: []
[1683055142.098168][21371:21405] CHIP:CTL: Setting attestation nonce to random value
[1683055142.098238][21371:21405] CHIP:CTL: Setting CSR nonce to random value
[1683055142.098261][21371:21405] CHIP:SPT: Using device attestation PAA trust store path /home/ubuntu/trust-cert/.
[1683055142.098545][21371:21405] CHIP:CTL: Generating NOC
[1683055142.098884][21371:21405] CHIP:FP: Validating NOC chain
[1683055142.099382][21371:21405] CHIP:FP: NOC chain validation successful
[1683055142.099426][21371:21405] CHIP:FP: Updated fabric at index: 0x2, Node ID: 0x000000000000000A
[1683055142.099431][21371:21405] CHIP:TS: Last Known Good Time: 2023-05-02T15:13:48
[1683055142.099434][21371:21405] CHIP:TS: New proposed Last Known Good Time: 2021-01-01T00:00:00
[1683055142.099439][21371:21405] CHIP:TS: Retaining current Last Known Good Time
[1683055142.099798][21371:21405] CHIP:FP: Metadata for Fabric 0x2 persisted to storage.
[1683055142.099978][21371:21405] CHIP:TS: Committing Last Known Good Time to storage: 2023-05-02T15:13:48
[1683055142.100159][21371:21405] CHIP:CTL: Joined the fabric at index 2. Fabric ID is 0x000000000000000F (Compressed Fabric ID: FEB3387C2067A0B3)
[1683055142.100165][21371:21405] CHIP:CTL: *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
[1683055142.100221][21371:21405] CHIP:SPT: Setting up group data for Fabric Index 2 with Compressed Fabric ID:
[New Thread 0x7fffedeb66c0 (LWP 21406)]
[New Thread 0x7fffed6b56c0 (LWP 21407)]
[1683055142.104284][21371:21404] CHIP:BLE: BLE removing known devices.
[1683055142.104629][21371:21404] CHIP:BLE: BLE initiating scan.
[1683055142.231317][21371:21371] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:125'. Code is unsafe/racy
[1683055142.231348][21371:21371] CHIP:-: chipDie chipDie chipDie

Thread 1 "python" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) exit
A debugging session is active.

@tehampson
Copy link
Contributor

When you get into GDB after the issue happens please do bt so we can see the stack trace instead of r again. That will give us the back trace and we will see where StartTimer is called without acquiring the lock.

Also I am right now recompiling everything on a linux machine that has bluetooth so maybe I will be able to reproduce soon

@mingjiansi
Copy link
Author

When you get into GDB after the issue happens please do bt so we can see the stack trace instead of r again. That will give us the back trace and we will see where StartTimer is called without acquiring the lock.

Also I am right now recompiling everything on a linux machine that has bluetooth so maybe I will be able to reproduce soon

[1683055436.508424][23702:23707] CHIP:CTL: Setting attestation nonce to random value
[1683055436.508479][23702:23707] CHIP:CTL: Setting CSR nonce to random value
[1683055436.508491][23702:23707] CHIP:SPT: Using device attestation PAA trust store path /home/ubuntu/trust-cert/.
[1683055436.508677][23702:23707] CHIP:CTL: Generating NOC
[1683055436.508910][23702:23707] CHIP:FP: Validating NOC chain
[1683055436.509261][23702:23707] CHIP:FP: NOC chain validation successful
[1683055436.509298][23702:23707] CHIP:FP: Updated fabric at index: 0x2, Node ID: 0x000000000000000A
[1683055436.509303][23702:23707] CHIP:TS: Last Known Good Time: 2023-05-02T15:13:48
[1683055436.509306][23702:23707] CHIP:TS: New proposed Last Known Good Time: 2021-01-01T00:00:00
[1683055436.509309][23702:23707] CHIP:TS: Retaining current Last Known Good Time
[1683055436.509512][23702:23707] CHIP:FP: Metadata for Fabric 0x2 persisted to storage.
[1683055436.509642][23702:23707] CHIP:TS: Committing Last Known Good Time to storage: 2023-05-02T15:13:48
[1683055436.509771][23702:23707] CHIP:CTL: Joined the fabric at index 2. Fabric ID is 0x000000000000000F (Compressed Fabric ID: FEB3387C2067A0B3)
[1683055436.509774][23702:23707] CHIP:CTL: *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
[1683055436.509820][23702:23707] CHIP:SPT: Setting up group data for Fabric Index 2 with Compressed Fabric ID:
[New Thread 0x7fffedeb66c0 (LWP 23708)]
[New Thread 0x7fffed6b56c0 (LWP 23709)]
[1683055436.533600][23702:23706] CHIP:BLE: BLE removing known devices.
[1683055436.534119][23702:23706] CHIP:BLE: BLE initiating scan.
[1683055436.655038][23702:23702] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:125'. Code is unsafe/racy
[1683055436.655064][23702:23702] CHIP:-: chipDie chipDie chipDie

Thread 1 "python" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cb6c46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c9d7fc in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fffefba980a in chipAbort() () at ../../src/lib/support/CodeUtils.h:508
#6  0x00007fffefd47615 in chipDie() () at ../../src/lib/support/CodeUtils.h:518
#7  0x00007fffefd47682 in chip::Platform::Internal::AssertChipStackLockedByCurrentThread(char const*, int)
    (file=0x7fffeffa2678 "src/system/SystemLayerImplSelect.cpp", line=125) at ../../src/platform/LockTracker.cpp:36
#8  0x00007fffefd58e05 in chip::System::LayerImplSelect::StartTimer(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >, void (*)(chip::System::Layer*, void*), void*)
    (this=0x7fffefffbc20 <chip::DeviceLayer::SystemLayerImpl()::gSystemLayerImpl>, delay=..., onComplete=0x7fffefe699d0 <chip::DeviceLayer::Internal::ChipDeviceScanner::TimerExpiredCallback(chip::System::Layer*, void*)>, appState=0x7fffd8006c00) at ../../src/system/SystemLayerImplSelect.cpp:125
#9  0x00007fffefe69909 in chip::DeviceLayer::Internal::ChipDeviceScanner::StartScan(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >)
    (this=0x7fffd8006c00, timeout=...) at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:139
#10 0x00007fffefbcfd8f in pychip_ble_start_scanning((anonymous namespace)::PyObject*, void*, uint32_t, (anonymous namespace)::ScannerDelegateImpl::DeviceScannedCallback, (anonymous namespace)::ScannerDelegateImpl::ScanCompleteCallback, (anonymous namespace)::ScannerDelegateImpl::ScanErrorCallback)
    (context=0x7ffff42fad10, adapter=0x5555564409e0, timeoutMs=5000, scanCallback=0x7ffff7fb26d0, completeCallback=0x7ffff7fb2710, errorCallback=0x7ffff7fb2750) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:133
#11 0x00007ffff625fe2e in  () at /lib/x86_64-linux-gnu/libffi.so.8
#12 0x00007ffff625c493 in  () at /lib/x86_64-linux-gnu/libffi.so.8
#13 0x00007ffff627e59d in  () at /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
#14 0x00007ffff627dc2d in  () at /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so
#15 0x00005555556a8f5b in _PyObject_MakeTpCall ()
#16 0x00005555556a25bb in _PyEval_EvalFrameDefault ()
#17 0x00005555556b1ff9 in _PyFunction_Vectorcall ()
#18 0x000055555569c5d0 in _PyEval_EvalFrameDefault ()
#19 0x00005555556ccd71 in  ()
#20 0x000055555569c9bd in _PyEval_EvalFrameDefault ()
#21 0x0000555555698cb4 in  ()
#22 0x0000555555785266 in PyEval_EvalCode ()
#23 0x00005555557b1497 in  ()
#24 0x00005555557aa45e in  ()
#25 0x00005555557b11c6 in  ()
#26 0x00005555557b06ae in _PyRun_SimpleFileObject ()
#27 0x00005555557b03b4 in _PyRun_AnyFileObject ()
#28 0x00005555557a1a20 in Py_RunMain ()
#29 0x000055555577938b in Py_BytesMain ()
#30 0x00007ffff7c9e510 in __libc_start_call_main (main=main@entry=0x555555779350, argc=argc@entry=2, argv=argv@entry=0x7fffffffe5c8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#31 0x00007ffff7c9e5c9 in __libc_start_main_impl
     (main=0x555555779350, argc=2, argv=0x7fffffffe5c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe5b8)
    at ../csu/libc-start.c:381
#32 0x0000555555779285 in _start ()

@tehampson
Copy link
Contributor

Sweet! We just need to add chip::DeviceLayer::PlatformMgr().LockChipStack(); and chip::DeviceLayer::PlatformMgr().UnlockChipStack(); around this call here:

https://github.com/project-chip/connectedhomeip/blob/master/src/platform/Linux/bluez/ChipDeviceScanner.cpp#L139

After that I think we will be 👌

@tehampson
Copy link
Contributor

Created a pull request #26338

Let me know if that unblocks you? I am getting a different segfault later on, but that was happening before as well, and I am not really setup to comission BLE devices so this is likely an issue on my side.

For posterity my new segfault is in (anonymous namespace)::ScannerDelegateImpl::OnScanError(CHIP_ERROR) (this=0x5555569c3490, error=...) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:104

@mingjiansi
Copy link
Author

chip::DeviceLayer::PlatformMgr().LockChipStack();

Unfortunately this does not work.
I even added a few debug lines just to help to see the sequence of execution. As you can see, the unlock of the chip stack happened too early before the check started.

    chip::DeviceLayer::PlatformMgr().LockChipStack();
    ChipLogError(Ble, "============> LOCK Chip Stack <================");
    CHIP_ERROR err = chip::DeviceLayer::SystemLayer().StartTimer(timeout, TimerExpiredCallback, static_cast<void *>(this));
    chip::DeviceLayer::PlatformMgr().UnlockChipStack();
    ChipLogError(Ble, "============> UNLOCK Chip Stack <================");
GNU gdb (Ubuntu 12.1-3ubuntu2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
(gdb) r
Starting program: /usr/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4c0d6c0 (LWP 59069)]
[1683056450.714796][59066:59066] CHIP:CTL: Setting attestation nonce to random value
[1683056450.715366][59066:59066] CHIP:CTL: Setting CSR nonce to random value
[New Thread 0x7fffeefb86c0 (LWP 59121)]
[1683056450.716101][59066:59066] CHIP:DL: writing settings to file (/tmp/chip_counters.ini-3V7eVJ)
[1683056450.716238][59066:59066] CHIP:DL: renamed tmp file to file (/tmp/chip_counters.ini)
[1683056450.716243][59066:59066] CHIP:DL: NVS set: chip-counters/reboot-count = 10 (0xA)
[1683056450.716613][59066:59066] CHIP:DL: Got Ethernet interface: enx5cff35d579d4
[1683056450.716853][59066:59066] CHIP:DL: Found the primary Ethernet interface:enx5cff35d579d4
[1683056450.717096][59066:59066] CHIP:DL: Got WiFi interface: wlp45s0
[1683056450.718521][59066:59066] CHIP:DL: Found the primary WiFi interface:wlp45s0
Initializing persistent storage from file: ./chip-ge-stacks.bin
Loading configuration from ./chip-ge-stacks.bin...
[1683056450.719070][59066:59066] CHIP:TS: Last Known Good Time: 2023-05-02T15:38:49
[1683056450.719316][59066:59066] CHIP:FP: Fabric index 0x1 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x0000000000001F40, VendorId 0x1339
[1683056450.719406][59066:59066] CHIP:FP: Fabric index 0x2 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x000000000000000A, VendorId 0x1339
[1683056450.719671][59066:59066] CHIP:ZCL: Using ZAP configuration...
[1683056450.720406][59066:59066] CHIP:DL: Avahi client registered
[New Thread 0x7fffee7b76c0 (LWP 59122)]
Loading certificate authorities from storage...
New CertificateAuthority at index 1
Loading fabric admins from storage...
New FabricAdmin: FabricId: 0x000000000000000F, VendorId = 0x1339
Allocating new controller with CaIndex: 1, FabricId: 0x000000000000000F, NodeId: 0x000000000000000A, CatTags: []
[1683056450.743764][59066:59122] CHIP:CTL: Setting attestation nonce to random value
[1683056450.743805][59066:59122] CHIP:CTL: Setting CSR nonce to random value
[1683056450.743817][59066:59122] CHIP:SPT: Using device attestation PAA trust store path /home/ubuntu/trust-cert/.
[1683056450.743989][59066:59122] CHIP:CTL: Generating NOC
[1683056450.744223][59066:59122] CHIP:FP: Validating NOC chain
[1683056450.744570][59066:59122] CHIP:FP: NOC chain validation successful
[1683056450.744600][59066:59122] CHIP:FP: Updated fabric at index: 0x2, Node ID: 0x000000000000000A
[1683056450.744605][59066:59122] CHIP:TS: Last Known Good Time: 2023-05-02T15:38:49
[1683056450.744608][59066:59122] CHIP:TS: New proposed Last Known Good Time: 2021-01-01T00:00:00
[1683056450.744610][59066:59122] CHIP:TS: Retaining current Last Known Good Time
[1683056450.744808][59066:59122] CHIP:FP: Metadata for Fabric 0x2 persisted to storage.
[1683056450.744927][59066:59122] CHIP:TS: Committing Last Known Good Time to storage: 2023-05-02T15:38:49
[1683056450.745045][59066:59122] CHIP:CTL: Joined the fabric at index 2. Fabric ID is 0x000000000000000F (Compressed Fabric ID: FEB3387C2067A0B3)
[1683056450.745048][59066:59122] CHIP:CTL: *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
[1683056450.745088][59066:59122] CHIP:SPT: Setting up group data for Fabric Index 2 with Compressed Fabric ID:
[New Thread 0x7fffedeb66c0 (LWP 59123)]
[New Thread 0x7fffed6b56c0 (LWP 59124)]
[1683056450.765239][59066:59121] CHIP:BLE: BLE removing known devices.
[1683056450.765829][59066:59121] CHIP:BLE: BLE initiating scan.
[1683056450.891412][59066:59066] CHIP:BLE: ============> LOCK Chip Stack <================
[1683056450.891484][59066:59066] CHIP:BLE: ============> UNLOCK Chip Stack <================
[1683056455.902097][59066:59121] CHIP:DL: Chip stack locking error at 'src/system/SystemLayerImplSelect.cpp:172'. Code is unsafe/racy
[1683056455.902122][59066:59121] CHIP:-: chipDie chipDie chipDie

Thread 3 "gmain-matter" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffeefb86c0 (LWP 59121)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7cb6c46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c9d7fc in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007fffefba980a in chipAbort() () at ../../src/lib/support/CodeUtils.h:508
#6  0x00007fffefd47615 in chipDie() () at ../../src/lib/support/CodeUtils.h:518
#7  0x00007fffefd47682 in chip::Platform::Internal::AssertChipStackLockedByCurrentThread(char const*, int)
    (file=0x7fffeffa2678 "src/system/SystemLayerImplSelect.cpp", line=172) at ../../src/platform/LockTracker.cpp:36
#8  0x00007fffefd58fb4 in chip::System::LayerImplSelect::CancelTimer(void (*)(chip::System::Layer*, void*), void*)
    (this=0x7fffefffbc20 <chip::DeviceLayer::SystemLayerImpl()::gSystemLayerImpl>, onComplete=0x7fffefe69a42 <chip::DeviceLayer::Internal::ChipDeviceScanner::TimerExpiredCallback(chip::System::Layer*, void*)>, appState=0x555555fb41e0) at ../../src/system/SystemLayerImplSelect.cpp:172
#9  0x00007fffefe6955d in chip::DeviceLayer::Internal::ChipDeviceScanner::~ChipDeviceScanner() (this=0x555555fb41e0, __in_chrg=<optimized out>)
    at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:82
#10 0x00007fffefbd024a in std::default_delete<chip::DeviceLayer::Internal::ChipDeviceScanner>::operator()(chip::DeviceLayer::Internal::ChipDeviceScanner*) const (this=0x555556246dd8, __ptr=0x555555fb41e0) at /usr/include/c++/12/bits/unique_ptr.h:95
#11 0x00007fffefbcfefa in std::unique_ptr<chip::DeviceLayer::Internal::ChipDeviceScanner, std::default_delete<chip::DeviceLayer::Internal::ChipDeviceScanner> >::~unique_ptr() (this=0x555556246dd8, __in_chrg=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:396
#12 0x00007fffefbd08a4 in (anonymous namespace)::ScannerDelegateImpl::~ScannerDelegateImpl() (this=0x555556246dd0, __in_chrg=<optimized out>)
    at ../../src/controller/python/chip/ble/LinuxImpl.cpp:65
#13 0x00007fffefbd08d0 in (anonymous namespace)::ScannerDelegateImpl::~ScannerDelegateImpl() (this=0x555556246dd0, __in_chrg=<optimized out>)
    at ../../src/controller/python/chip/ble/LinuxImpl.cpp:65
#14 0x00007fffefbcfbf2 in (anonymous namespace)::ScannerDelegateImpl::OnScanComplete() (this=0x555556246dd0)
    at ../../src/controller/python/chip/ble/LinuxImpl.cpp:97
#15 0x00007fffefe69d97 in chip::DeviceLayer::Internal::ChipDeviceScanner::MainLoopStopScan(chip::DeviceLayer::Internal::ChipDeviceScanner*)
    (self=0x555555fb41e0) at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:204
#16 0x00007fffefe63e08 in operator()(void*) const (__closure=0x0, userData_=0x7fffee7b6c10) at ../../src/platform/Linux/PlatformManagerImpl.cpp:298
#17 0x00007fffefe63e94 in _FUN(void*) () at ../../src/platform/Linux/PlatformManagerImpl.cpp:306
#18 0x00007fffef66b43f in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#19 0x00007fffef6c03c8 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x00007fffef66aaef in g_main_loop_run () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#21 0x00007fffefe630f7 in chip::DeviceLayer::(anonymous namespace)::GLibMainLoopThread(void*) (loop=0x5555563c9590)
    at ../../src/platform/Linux/PlatformManagerImpl.cpp:61
#22 0x00007fffef69a5c1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#23 0x00007ffff7d0b402 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#24 0x00007ffff7d9a590 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) r

@tehampson
Copy link
Contributor

That is getting further. Now we have a crash in src/platform/Linux/bluez/ChipDeviceScanner.cpp:82 Add DeviceLayer::PlatformMgr().LockChipStack(); and DeviceLayer::PlatformMgr().UnlockChipStack(); around that and we should be good for that one and then we can see if there is another error after that.

I have updated #26338 as well already

@mingjiansi
Copy link
Author

Created a pull request #26338

Let me know if that unblocks you? I am getting a different segfault later on, but that was happening before as well, and I am not really setup to comission BLE devices so this is likely an issue on my side.

For posterity my new segfault is in (anonymous namespace)::ScannerDelegateImpl::OnScanError(CHIP_ERROR) (this=0x5555569c3490, error=...) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:104

As for this error. I have a fix for that. It is just the Python interface only passed 2 handlers, missing the scan error handler. In my branch right now, I just added a default handler to get pass that.

@mingjiansi
Copy link
Author

@tehampson I thought src/controller/python/chip/ble/LinuxImpl.cpp is where Add DeviceLayer::PlatformMgr().LockChipStack(); and DeviceLayer::PlatformMgr().UnlockChipStack();. Big issue I ran into is that there were good way for me to call the last DeviceLayer::PlatformMgr().UnlockChipStack(); Then again, I am really not the expert on this.

@tehampson
Copy link
Contributor

My understanding is that #26180 introduce check to make sure that anything that schedules work (or cancels work) has the chip stack lock. We only want to acquire this lock right before calls to schedule/cancel work. Doing so in src/controller/python/chip/ble/LinuxImpl.cpp is not what we are trying to protect against.

I am simply looking at precedent set by other fixes where they surround the StartTimer, CancelTimer, and ScheduleWork calls with the lock/unlock and not some higher level function

@tehampson
Copy link
Contributor

Keep me posted if the 4 extra lines in #26338 unblocks you. If so I will work on getting it merged (or if you want to upload the PR I can close mine). If you hit another issue please upload the stack trace

@mingjiansi
Copy link
Author

Keep me posted if the 4 extra lines in #26338 unblocks you. If so I will work on getting it merged (or if you want to upload the PR I can close mine). If you hit another issue please upload the stack trace

Doesn't seem to fix it, ran into an infinite loop of sort.
Also added a few more lines on line 82 in addition to previous.

    DeviceLayer::PlatformMgr().LockChipStack();
    ChipLogError(Ble, "============> LOCK Chip Stack from ~ChipDeviceScanner <================");

    chip::DeviceLayer::SystemLayer().CancelTimer(TimerExpiredCallback, this);
    DeviceLayer::PlatformMgr().UnlockChipStack();
    ChipLogError(Ble, "============> UNLOCK Chip Stack from ~ChipDeviceScanner <================");
mingjian@mingjian-950xdb ~/Git/GE-Matter-Controller/src $ sudo gdb --args python test.py
GNU gdb (Ubuntu 12.1-3ubuntu2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
(gdb) r
Starting program: /usr/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4c0d6c0 (LWP 94198)]
[1683058380.629215][94195:94195] CHIP:CTL: Setting attestation nonce to random value
[1683058380.629972][94195:94195] CHIP:CTL: Setting CSR nonce to random value
[New Thread 0x7fffeefb86c0 (LWP 94199)]
[1683058380.640635][94195:94195] CHIP:DL: writing settings to file (/tmp/chip_counters.ini-xfej7P)
[1683058380.640782][94195:94195] CHIP:DL: renamed tmp file to file (/tmp/chip_counters.ini)
[1683058380.640792][94195:94195] CHIP:DL: NVS set: chip-counters/reboot-count = 13 (0xD)
[1683058380.641382][94195:94195] CHIP:DL: Got Ethernet interface: enx5cff35d579d4
[1683058380.641697][94195:94195] CHIP:DL: Found the primary Ethernet interface:enx5cff35d579d4
[1683058380.642010][94195:94195] CHIP:DL: Got WiFi interface: wlp45s0
[1683058380.643449][94195:94195] CHIP:DL: Found the primary WiFi interface:wlp45s0
Initializing persistent storage from file: ./chip-ge-stacks.bin
Loading configuration from ./chip-ge-stacks.bin...
[1683058380.644876][94195:94195] CHIP:TS: Last Known Good Time: 2023-05-02T16:06:33
[1683058380.644894][94195:94195] CHIP:TS: Setting Last Known Good Time to firmware build time 2023-05-02T16:11:40
[1683058380.645998][94195:94195] CHIP:FP: Fabric index 0x1 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x0000000000001F40, VendorId 0x1339
[1683058380.646189][94195:94195] CHIP:FP: Fabric index 0x2 was retrieved from storage. Compressed FabricId 0xFEB3387C2067A0B3, FabricId 0x000000000000000F, NodeId 0x000000000000000A, VendorId 0x1339
[1683058380.646686][94195:94195] CHIP:ZCL: Using ZAP configuration...
[1683058380.649144][94195:94195] CHIP:DL: Avahi client registered
[New Thread 0x7fffee7b76c0 (LWP 94200)]
Loading certificate authorities from storage...
New CertificateAuthority at index 1
Loading fabric admins from storage...
New FabricAdmin: FabricId: 0x000000000000000F, VendorId = 0x1339
Allocating new controller with CaIndex: 1, FabricId: 0x000000000000000F, NodeId: 0x000000000000000A, CatTags: []
[1683058380.686548][94195:94200] CHIP:CTL: Setting attestation nonce to random value
[1683058380.686627][94195:94200] CHIP:CTL: Setting CSR nonce to random value
[1683058380.686661][94195:94200] CHIP:SPT: Using device attestation PAA trust store path /home/ubuntu/trust-cert/.
[1683058380.687159][94195:94200] CHIP:CTL: Generating NOC
[1683058380.687839][94195:94200] CHIP:FP: Validating NOC chain
[1683058380.688491][94195:94200] CHIP:FP: NOC chain validation successful
[1683058380.688553][94195:94200] CHIP:FP: Updated fabric at index: 0x2, Node ID: 0x000000000000000A
[1683058380.688561][94195:94200] CHIP:TS: Last Known Good Time: 2023-05-02T16:11:40
[1683058380.688565][94195:94200] CHIP:TS: New proposed Last Known Good Time: 2021-01-01T00:00:00
[1683058380.688568][94195:94200] CHIP:TS: Retaining current Last Known Good Time
[1683058380.689251][94195:94200] CHIP:FP: Metadata for Fabric 0x2 persisted to storage.
[1683058380.689581][94195:94200] CHIP:TS: Committing Last Known Good Time to storage: 2023-05-02T16:11:40
[1683058380.689890][94195:94200] CHIP:CTL: Joined the fabric at index 2. Fabric ID is 0x000000000000000F (Compressed Fabric ID: FEB3387C2067A0B3)
[1683058380.689900][94195:94200] CHIP:CTL: *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
[1683058380.689973][94195:94200] CHIP:SPT: Setting up group data for Fabric Index 2 with Compressed Fabric ID:
[New Thread 0x7fffedeb66c0 (LWP 94201)]
[New Thread 0x7fffed6b56c0 (LWP 94202)]
[1683058380.695180][94195:94199] CHIP:BLE: BLE removing known devices.
[1683058380.695477][94195:94199] CHIP:BLE: BLE initiating scan.
[1683058380.822819][94195:94195] CHIP:BLE: ============> LOCK Chip Stack from StartScan <================
[1683058380.822873][94195:94195] CHIP:BLE: ============> UNLOCK Chip Stack from StartScan <================
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
DeviceInfo(address=b'CB:2B:3D:39:1F:74', discriminator=2283, vendor=137, product=4921)
[Thread 0x7ffff4c0d6c0 (LWP 94198) exited]
r
bt

@tehampson
Copy link
Contributor

That looks like it ran successfully. What happens if you remove gdb and just do python test.py?

@mingjiansi
Copy link
Author

That looks like it ran successfully. What happens if you remove gdb and just do python test.py?

tried already. just infinite loop after the scanner completes. feels like some timer thread didn't exit.

@mingjiansi
Copy link
Author

fwiw, this is the branch i am working on that already fixed the scan error handler missing.
https://github.com/mingjiansi/connectedhomeip/tree/Add-Python-BLE-Scan-Error-Handler

@tehampson
Copy link
Contributor

Are you on slack?

Was this ever working for you before? Or are you only recently trying to use this? I will have to context switch for the day. But I do think we are making some progress here

@tehampson
Copy link
Contributor

Also I am pretty sure we are stuck in here.

If you change your ScanError to, it won't hang there:

    def ScanError(self, errorCode):
        self.queue.put(None)

@mingjiansi
Copy link
Author

Are you on slack?

Was this ever working for you before? Or are you only recently trying to use this? I will have to context switch for the day. But I do think we are making some progress here

I have to run as well, not sure if you can find me on slack via my email or not. mingjian.si@savant.com

@mingjiansi
Copy link
Author

Also I am pretty sure we are stuck in here.

If you change your ScanError to, it won't hang there:

    def ScanError(self, errorCode):
        self.queue.put(None)

tried it. same thing, something is not exit correctly. whole thing hangs.

@tehampson
Copy link
Contributor

Hmm... it won't let me DM you 🤷 . I guess we can just go back and forth here. I do think that your issue is that you are waiting on that queue forever, and that if you handle it in your error callback you should be unstuck. I do think those error related changes can be in a follow up PR since that is working on an adjacent but different issue

@tehampson
Copy link
Contributor

tehampson commented May 2, 2023

You can also add a timeout on the receiver.queue.get. This is working for me (note I have not added the error callback plumbing yet)

    timeout = timeoutMs/1000 +1
    while True:
        data = receiver.queue.get(timeout=timeout)
        if not data:
            break
        yield data

Edit: Finger slipped I did not mean to close this issue, also I was not done writing my comment. I was going to add that if that timeout doesn't work for you I think we will need to use PDB to understand why

@tehampson tehampson reopened this May 2, 2023
@mingjiansi
Copy link
Author

@tehampson I have implemented the queue change that you have suggested stilled ended up with hanging same issue. By using the dumbest method available to me, I have added a bunch of debug console lines to see the path of execution I have made a discovery. From what I can tell it looks like something is wrong with the OnComplete handler. at this point. https://github.com/mingjiansi/connectedhomeip/blob/93a5e8093693fa102c662effdf17903b032dc316/src/platform/Linux/bluez/ChipDeviceScanner.cpp#L245

my observation is as follow.
After the scan is completed, ChipDeviceScanner::TimerExpiredCallback is triggered thus eventually flows to StopScan then MainLoopStopScan in the same file. MainLoopStopScan invokes delegate->OnScanComplete();.
At this points, it eventually goes to the src/controller/python/chip/ble/LinuxImpl.cpp's OnScanComplete where delete this eventually triggers the destructor for ChipDeviceScanner. But for whatever reason, delegate->OnScanComplete(); never completed after this.

I guess the problem currently lies in the OnScanComplete handler, but nothing really is jumping at me.

@tehampson
Copy link
Contributor

tehampson commented May 3, 2023

I have implemented the queue change that you have suggested stilled ended up with hanging same issue.

@mingjiansi , weird I am not reproducing the issue you are seeing. You are now running your script without gdb right? When you use PDB, and add breakpoint() are your callbacks getting called? I am trying to understand if there is something on the python side that is hanging the call to delegate->OnScanComplete(); for you

Also I am now context switched into this issue again and should be more responsive to this issue for the next 1-2 hours

@tehampson
Copy link
Contributor

Hm... Using GDB I think I can see what you mean by delegate->OnScanComplete(); never returns. I think I can debug myself from here

@mingjiansi
Copy link
Author

Hm... Using GDB I think I can see what you mean by delegate->OnScanComplete(); never returns. I think I can debug myself from here

great. at this point, it almost seems like there is a deadlock somewhere.

@tehampson
Copy link
Contributor

tehampson commented May 3, 2023

it almost seems like there is a deadlock somewhere.

Because there is!

Thread running main event loop:

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fffedaa9c40) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7fffedaa9c40, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true)
    at ./nptl/futex-internal.c:87
#2  0x00007ffff7d25d9b in __GI___futex_abstimed_wait_cancelable64
    (futex_word=futex_word@entry=0x7fffedaa9c40, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d283f8 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fffef065fc0 <chip::DeviceLayer::PlatformManagerImpl::sInstance+384>, cond=0x7fffedaa9c18) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7fffedaa9c18, mutex=0x7fffef065fc0 <chip::DeviceLayer::PlatformManagerImpl::sInstance+384>) at ./nptl/pthread_cond_wait.c:618
#5  0x00007fffeeec2e47 in std::condition_variable::wait<chip::DeviceLayer::PlatformManagerImpl::_GLibMatterContextInvokeSync(CHIP_ERROR (*)(void*), void*)::<lambda()> >(std::unique_lock<std::mutex> &, struct {...}) (this=0x7fffedaa9c18, __lock=..., __p=...) at /usr/include/c++/12/condition_variable:102
#6  0x00007fffeeec2b43 in chip::DeviceLayer::PlatformManagerImpl::_GLibMatterContextInvokeSync(chip::ChipError (*)(void*), void*)
    (this=0x7fffef065e40 <chip::DeviceLayer::PlatformManagerImpl::sInstance>, func=0x7fffeeec8c0c <chip::DeviceLayer::Internal::ChipDeviceScanner::MainLoopStopScan(chip::DeviceLayer::Internal::ChipDeviceScanner*)>, userData=0x7fffd400eba0) at ../../src/platform/Linux/PlatformManagerImpl.cpp:311
#7  0x00007fffeeec95f4 in chip::DeviceLayer::PlatformManagerImpl::GLibMatterContextInvokeSync<chip::DeviceLayer::Internal::ChipDeviceScanner>(chip::ChipError (*)(chip::DeviceLayer::Internal::ChipDeviceScanner*), chip::DeviceLayer::Internal::ChipDeviceScanner*)
    (this=0x7fffef065e40 <chip::DeviceLayer::PlatformManagerImpl::sInstance>, func=0x7fffeeec8c0c <chip::DeviceLayer::Internal::ChipDeviceScanner::MainLoopStopScan(chip::DeviceLayer::Internal::ChipDeviceScanner*)>, userData=0x7fffd400eba0) at ../../src/platform/Linux/PlatformManagerImpl.h:72
#8  0x00007fffeeec8b86 in chip::DeviceLayer::Internal::ChipDeviceScanner::StopScan() (this=0x7fffd400eba0) at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:181
#9  0x00007fffeeec8a4f in chip::DeviceLayer::Internal::ChipDeviceScanner::TimerExpiredCallback(chip::System::Layer*, void*)
    (layer=0x7fffef062e60 <chip::DeviceLayer::SystemLayerImpl()::gSystemLayerImpl>, appState=0x7fffd400eba0) at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:159
#10 0x00007fffeedac7a5 in chip::System::TimerData::Callback::Invoke() const (this=0x7fffedaa9dc0) at ../../src/system/SystemTimer.h:61
#11 0x00007fffeedadf29 in chip::System::TimerPool<chip::System::TimerList::Node>::Invoke(chip::System::TimerList::Node*)
    (this=0x7fffef063468 <chip::DeviceLayer::SystemLayerImpl()::gSystemLayerImpl+1544>, timer=0x5555569bfab0) at ../../src/system/SystemTimer.h:224
#12 0x00007fffeedada2c in chip::System::LayerImplSelect::HandleEvents() (this=0x7fffef062e60 <chip::DeviceLayer::SystemLayerImpl()::gSystemLayerImpl>) at ../../src/system/SystemLayerImplSelect.cpp:514
#13 0x00007fffeeec1520 in chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_RunEventLoop()
    (this=0x7fffef065e50 <chip::DeviceLayer::PlatformManagerImpl::sInstance+16>) at ../../src/include/platform/internal/GenericPlatformManagerImpl_POSIX.ipp:181

Thread running delete this

#0  futex_wait (private=0, expected=2, futex_word=0x7fffef065e60 <chip::DeviceLayer::PlatformManagerImpl::sInstance+32>) at ../sysdeps/nptl/futex-internal.h:146
146	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7fffef065e60 <chip::DeviceLayer::PlatformManagerImpl::sInstance+32>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7fffef065e60 <chip::DeviceLayer::PlatformManagerImpl::sInstance+32>, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007ffff7d2c262 in lll_mutex_lock_optimized (mutex=0x7fffef065e60 <chip::DeviceLayer::PlatformManagerImpl::sInstance+32>) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7fffef065e60 <chip::DeviceLayer::PlatformManagerImpl::sInstance+32>) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007fffeeec128a in chip::DeviceLayer::Internal::GenericPlatformManagerImpl_POSIX<chip::DeviceLayer::PlatformManagerImpl>::_LockChipStack()
    (this=0x7fffef065e50 <chip::DeviceLayer::PlatformManagerImpl::sInstance+16>) at ../../src/include/platform/internal/GenericPlatformManagerImpl_POSIX.ipp:76
#5  0x00007fffeec41a02 in chip::DeviceLayer::PlatformManager::LockChipStack() (this=0x7fffef065e40 <chip::DeviceLayer::PlatformManagerImpl::sInstance>) at ../../src/include/platform/PlatformManager.h:453
#6  0x00007fffeeec8520 in chip::DeviceLayer::Internal::ChipDeviceScanner::~ChipDeviceScanner() (this=0x7fffd400eba0, __in_chrg=<optimized out>) at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:82
#7  0x00007fffeec427ec in std::default_delete<chip::DeviceLayer::Internal::ChipDeviceScanner>::operator()(chip::DeviceLayer::Internal::ChipDeviceScanner*) const (this=0x5555569b76f8, __ptr=0x7fffd400eba0)
    at /usr/include/c++/12/bits/unique_ptr.h:95
#8  0x00007fffeec424cc in std::unique_ptr<chip::DeviceLayer::Internal::ChipDeviceScanner, std::default_delete<chip::DeviceLayer::Internal::ChipDeviceScanner> >::~unique_ptr()
    (this=0x5555569b76f8, __in_chrg=<optimized out>) at /usr/include/c++/12/bits/unique_ptr.h:396
#9  0x00007fffeec42de2 in (anonymous namespace)::ScannerDelegateImpl::~ScannerDelegateImpl() (this=0x5555569b76f0, __in_chrg=<optimized out>) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:65
#10 0x00007fffeec42e0a in (anonymous namespace)::ScannerDelegateImpl::~ScannerDelegateImpl() (this=0x5555569b76f0, __in_chrg=<optimized out>) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:65
#11 0x00007fffeec42206 in (anonymous namespace)::ScannerDelegateImpl::OnScanComplete() (this=0x5555569b76f0) at ../../src/controller/python/chip/ble/LinuxImpl.cpp:97
#12 0x00007fffeeec8cb4 in chip::DeviceLayer::Internal::ChipDeviceScanner::MainLoopStopScan(chip::DeviceLayer::Internal::ChipDeviceScanner*) (self=0x7fffd400eba0)
    at ../../src/platform/Linux/bluez/ChipDeviceScanner.cpp:204
#13 0x00007fffeeec2995 in operator()(void*) const (__closure=0x0, userData_=0x7fffedaa9bf0) at ../../src/platform/Linux/PlatformManagerImpl.cpp:298
#14 0x00007fffeeec2a09 in _FUN(void*) () at ../../src/platform/Linux/PlatformManagerImpl.cpp:306
#15 0x00007fffef0bc67f in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#16 0x00007fffef0bca38 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#17 0x00007fffef0bccef in g_main_loop_run () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x00007fffeeec1cdb in chip::DeviceLayer::(anonymous namespace)::GLibMainLoopThread(void*) (loop=0x555555fb27e0) at ../../src/platform/Linux/PlatformManagerImpl.cpp:61
#19 0x00007fffef0e6d0d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x00007ffff7d28fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#21 0x00007ffff7da966c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Lock is acquired in these two spots:

  • src/include/platform/internal/GenericPlatformManagerImpl_POSIX.ipp:181 // actually line 179
  • src/platform/Linux/bluez/ChipDeviceScanner.cpp:82 // This is the lock we introduced before cancel

Posting findings for now, but still seeing what we can do about this and why it is happening

@mingjiansi
Copy link
Author

@tehampson Thank you. My god, I have no idea how you are deciphering logs like that. orz

@tehampson
Copy link
Contributor

Deleted comment because it actually did not work

@mingjiansi
Copy link
Author

Deleted comment because it actually did not work

oh good. i thought my reverse midas touch might have ruined it.

@tehampson
Copy link
Contributor

@mingjiansi Do you mind if I just also add in your ScanErrorCallback stuff into my PR. I think I have something, just figured it makes sense to fix all crashes include the one introduced with #24873 that you fixed

@mingjiansi
Copy link
Author

@mingjiansi Do you mind if I just also add in your ScanErrorCallback stuff into my PR. I think I have something, just figured it makes sense to fix all crashes include the one introduced with #24873 that you fixed

Please do.

@tehampson
Copy link
Contributor

I have updated #26338 to what I have right now. There is still one remaining hang during chip-repl shutdown. But let me know if that gets you up and running for the most part

@tehampson
Copy link
Contributor

Okay so #26338 now "works". I say that in quotes since there are still some paths that will crash, just not the path that @mingjiansi you are doing right now. I will figure this out, but for now just sharing the state of this before I log off for the evening.

@mingjiansi
Copy link
Author

I am heading out of the door now, I will give this a try first thing tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants