Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread accessory (Nordic board) issue - Randomly OnOff Toggle throws Timeout error after successful pairing #17477

Closed
phonnakasturi-apple opened this issue Apr 19, 2022 · 13 comments
Labels
stale Stale issue or PR

Comments

@phonnakasturi-apple
Copy link

phonnakasturi-apple commented Apr 19, 2022

Using the SHA b4e4257

Upon successful pairing of thread accessory (nordic board), toggle command throws Timeout error randomly.
This issue was encountered twice.

Steps to reproduce;

  1. Compile and flash thread accessory (nordic board)

  2. Compile CHIPTool cli as follows
    cd connectedhomeip
    source scripts/bootstrap.sh
    source scripts/activate.sh
    gn gen out/debug --args='chip_mdns="platform" chip_crypto="mbedtls" chip_inet_config_enable_ipv4=false'
    ninja -C out/debug

sudo ./out/debug/chip-tool pairing ble-thread node-id hex:operationalDataset setup-pin-code discriminator --pat-trust-store-path
sudo ./out/debug/chip-tool onoff toggle node-id 1

Following is the error;
sudo ./out/debug/chip-tool onoff toggle 2 1
[1650319736780] [35416:4677474] CHIP: [DL] KVS will be written to: file:///tmp/chip_kvs
[1650319736787] [35416:4677474] CHIP: [DL] _Init
[1650319736787] [35416:4677474] CHIP: [IN] UDP::Init bind&listen port=0
[1650319736788] [35416:4677474] CHIP: [IN] UDP::Init bound to port=57846
[1650319736788] [35416:4677474] CHIP: [IN] BLEBase::Init - setting/overriding transport
[1650319736788] [35416:4677474] CHIP: [IN] TransportMgr initialized
[1650319736788] [35416:4677474] CHIP: [DIS] Init fabric pairing table with server storage
[1650319736788] [35416:4677474] CHIP: [IN] Loading from storage for fabric index 0x1
[1650319736788] [35416:4677474] CHIP: [IN] Loading from storage for fabric index 0x2
[1650319736788] [35416:4677474] CHIP: [IN] Loading from storage for fabric index 0x3
[1650319736788] [35416:4677474] CHIP: [DIS] Add fabric pairing table delegate
[1650319736789] [35416:4677474] CHIP: [ZCL] Using ZAP configuration...
[1650319736789] [35416:4677474] CHIP: [CTL] System State Initialized...
[1650319736789] [35416:4677474] CHIP: [CTL] Setting attestation nonce to random value
[1650319736789] [35416:4677474] CHIP: [CTL] Setting CSR nonce to random value
[1650319736789] [35416:4677474] CHIP: [CTL] Setting attestation nonce to random value
[1650319736789] [35416:4677474] CHIP: [CTL] Setting CSR nonce to random value
[1650319736792] [35416:4677474] CHIP: [CTL] Generating NOC
[1650319736794] [35416:4677474] CHIP: [DIS] Verifying the received credentials
[1650319736805] [35416:4677474] CHIP: [DIS] Added new fabric at index: 0x1, Initialized: 1
[1650319736805] [35416:4677474] CHIP: [DIS] Assigned compressed fabric ID: 0x732DA42E801C7C2D, node ID: 0x000000000001B669
[1650319736805] [35416:4677474] CHIP: [CTL] Joined the fabric at index 1. Compressed fabric ID is: 0x732DA42E801C7C2D
[1650319736805] [35416:4677474] CHIP: [CTL] Setting attestation nonce to random value
[1650319736805] [35416:4677474] CHIP: [CTL] Setting CSR nonce to random value
[1650319736807] [35416:4677474] CHIP: [CTL] Generating NOC
[1650319736810] [35416:4677474] CHIP: [DIS] Verifying the received credentials
[1650319736820] [35416:4677474] CHIP: [DIS] Added new fabric at index: 0x2, Initialized: 1
[1650319736820] [35416:4677474] CHIP: [DIS] Assigned compressed fabric ID: 0xE6E75F40B8EDD62F, node ID: 0x000000000001B669
[1650319736820] [35416:4677474] CHIP: [CTL] Joined the fabric at index 2. Compressed fabric ID is: 0xE6E75F40B8EDD62F
[1650319736820] [35416:4677474] CHIP: [CTL] Setting attestation nonce to random value
[1650319736820] [35416:4677474] CHIP: [CTL] Setting CSR nonce to random value
[1650319736823] [35416:4677474] CHIP: [CTL] Generating NOC
[1650319736825] [35416:4677474] CHIP: [DIS] Verifying the received credentials
[1650319736835] [35416:4677474] CHIP: [DIS] Added new fabric at index: 0x3, Initialized: 1
[1650319736835] [35416:4677474] CHIP: [DIS] Assigned compressed fabric ID: 0xD93825D0690DFEF3, node ID: 0x000000000001B669
[1650319736835] [35416:4677474] CHIP: [CTL] Joined the fabric at index 3. Compressed fabric ID is: 0xD93825D0690DFEF3
[1650319736846] [35416:4677482] CHIP: [DL] _OnPlatformEvent
[1650319736846] [35416:4677482] CHIP: [TOO] Sending command to node 0x2
[1650319736846] [35416:4677482] CHIP: [CSM] FindOrEstablishSession: PeerId = 732DA42E801C7C2D:0000000000000002
[1650319736846] [35416:4677482] CHIP: [CSM] FindOrEstablishSession: No existing OperationalDeviceProxy instance found
[1650319736846] [35416:4677482] CHIP: [DIS] Resolving 732DA42E801C7C2D:0000000000000002 ...
[1650319736846] [35416:4677482] CHIP: [CTL] Resolve type=_matter._tcp name=732DA42E801C7C2D-0000000000000002
[1650319737049] [35416:4677482] CHIP: [DIS] Checking node lookup status after 202 ms
[1650319746847] [35416:4677474] CHIP: [CTL] Shutting down the commissioner
[1650319746847] [35416:4677474] CHIP: [CTL] Shutting down the controller
[1650319746847] [35416:4677474] CHIP: [CTL] Shutting down the commissioner
[1650319746847] [35416:4677474] CHIP: [CTL] Shutting down the controller
[1650319746847] [35416:4677474] CHIP: [DIS] Cancelling incomplete address resolution as device is being deleted.
[1650319746847] [35416:4677474] CHIP: [DIS] Discovery does not require any more timeouts
[1650319746848] [35416:4677474] CHIP: [IN] Expiring all connections for fabric 1!!
[1650319746848] [35416:4677474] CHIP: [CTL] Shutting down the commissioner
[1650319746848] [35416:4677474] CHIP: [CTL] Shutting down the controller
[1650319746848] [35416:4677474] CHIP: [IN] Expiring all connections for fabric 2!!
[1650319746848] [35416:4677474] CHIP: [CTL] Shutting down the commissioner
[1650319746848] [35416:4677474] CHIP: [CTL] Shutting down the controller
[1650319746848] [35416:4677474] CHIP: [IN] Expiring all connections for fabric 3!!
[1650319746848] [35416:4677474] CHIP: [CTL] Shutting down the System State, this will teardown the CHIP Stack
[1650319746848] [35416:4677474] CHIP: [DMG] IM WH moving to [Uninitialized]
[1650319746848] [35416:4677474] CHIP: [DMG] IM WH moving to [Uninitialized]
[1650319746848] [35416:4677474] CHIP: [DMG] IM WH moving to [Uninitialized]
[1650319746848] [35416:4677474] CHIP: [DMG] IM WH moving to [Uninitialized]
[1650319746848] [35416:4677474] CHIP: [DMG] All ReadHandler-s are clean, clear GlobalDirtySet
[1650319746848] [35416:4677474] CHIP: [BLE] CancelConnection
[1650319746848] [35416:4677474] CHIP: [DL] Inet Layer shutdown
[1650319746848] [35416:4677474] CHIP: [DL] BLE shutdown
[1650319746848] [35416:4677474] CHIP: [DL] System Layer shutdown
[1650319746848] [35416:4677474] CHIP: [TOO] Run command failure: ../../examples/chip-tool/commands/common/CHIPCommand.cpp:390: CHIP Error 0x00000032: Timeout

Attaching the complete log file from CHIPTool cli :

ThreadAccessoryToggleErrorLog.txt

@phonnakasturi-apple phonnakasturi-apple changed the title Thread accessory (Nordic board) issue - Randomly OnOff Toggle throws Timeout error upon successful pairing Thread accessory (Nordic board) issue - Randomly OnOff Toggle throws Timeout error after successful pairing Apr 19, 2022
@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Apr 19, 2022

That looks like we hit the 10-second command timeout before we ever managed to resolve the operational identifier and find the IP/port for the node.

@phonnakasturi-apple would it be possible to run log stream --line-buffered --system --level=info --filter 'process:"mDNSResponder"' in parallel with the above steps, with the debugging profile from https://developer.apple.com/bug-reporting/profiles-and-logs/?name=dns installed, so we can see what is happening on the mdns level here, and attach both the mDNSResponder log and the corresponding client failure log (so we can correlate the timestamps)?

@phonnakasturi-apple
Copy link
Author

phonnakasturi-apple commented Apr 20, 2022

@bzbarsky-apple
Copy link
Contributor

@phonnakasturi-apple That looks like just the mdnsresponder log....

@phonnakasturi-apple
Copy link
Author

Updated with the log files

@bzbarsky-apple
Copy link
Contributor

Thank you. So looking at ThreadAccessoryToggleErrorCLI-Log.txt I see:

  1. sudo ./out/debug/chip-tool pairing ble-thread 1 which fails to actually commission the device:
    [1650490471954] [25852:6806811] CHIP: [CTL] Successfully finished commissioning step 'SendNOC'
    [1650490471954] [25852:6806811] CHIP: [CTL] Required network information not provided in commissioning parameters
    [1650490471954] [25852:6806811] CHIP: [CTL] Parameters supplied: wifi (no) thread (yes)
    [1650490471954] [25852:6806811] CHIP: [CTL] Device supports: wifi (yes) thread(no)
    [1650490471954] [25852:6806811] CHIP: [CTL] Going from commissioning step 'SendNOC' with lastErr = '../../src/controller/AutoCommissioner.cpp:203: CHIP Error 0x0000002F: Invalid argument' -> 'Cleanup'
    
    but might look like it successfully paired because it does not obviously throw any errors?
  2. Two attempts to toggle node id 1 which time out, because the device is not actually commissioned and not on the network, so operational lookup fails.
  3. sudo ./out/debug/chip-tool pairing ble-thread 11 which succeeds.
  4. Two attempts to toggle node id 11 which succeed.
  5. sudo ./out/debug/chip-tool pairing ble-thread 12 which succeeds.
  6. Two attempts to toggle node id 12 which succeed.
  7. sudo ./out/debug/chip-tool pairing ble-thread 13 which succeeds
  8. Two attempts to toggle node id 13 which succeed.
  9. sudo ./out/debug/chip-tool pairing ble-thread 14 which succeeds
  10. Two attempts to toggle node id 14 which succeed.

etc up through node id 21, all of which succeed.

@bzbarsky-apple
Copy link
Contributor

bzbarsky-apple commented Apr 21, 2022

Looking at the first log in this issue, it has a similar thing going on. It does not include the actual pairing command that was run, but then it has:

[1650319729543] [35348:4676903] CHIP: [CTL] Successfully finished commissioning step 'SendNOC'
[1650319729543] [35348:4676903] CHIP: [CTL] Required network information not provided in commissioning parameters
[1650319729543] [35348:4676903] CHIP: [CTL] Parameters supplied: wifi (no) thread (yes)
[1650319729543] [35348:4676903] CHIP: [CTL] Device supports: wifi (yes) thread(no)
[1650319729543] [35348:4676903] CHIP: [CTL] Going from commissioning step 'SendNOC' with lastErr = '../../src/controller/AutoCommissioner.cpp:203: CHIP Error 0x0000002F: Invalid argument' -> 'Cleanup'

And indeed, if I look at the actual reads of the feature map in both those logs, I see:

[1650319723603] [35348:4676903] CHIP: [DMG] 					Endpoint = 0x0,
[1650319723603] [35348:4676903] CHIP: [DMG] 					Cluster = 0x31,
[1650319723603] [35348:4676903] CHIP: [DMG] 					Attribute = 0x0000_FFFC,
[1650319723603] [35348:4676903] CHIP: [DMG] 				}
[1650319723603] [35348:4676903] CHIP: [DMG] 					
[1650319723603] [35348:4676903] CHIP: [DMG] 					Data = 1, 

for the failing cases and

[1650490736858] [26235:6813385] CHIP: [DMG] 					Endpoint = 0x0,
[1650490736858] [26235:6813385] CHIP: [DMG] 					Cluster = 0x31,
[1650490736858] [26235:6813385] CHIP: [DMG] 					Attribute = 0x0000_FFFC,
[1650490736858] [26235:6813385] CHIP: [DMG] 				}
[1650490736858] [26235:6813385] CHIP: [DMG] 					
[1650490736858] [26235:6813385] CHIP: [DMG] 					Data = 2, 

for the success cases, which corresponds to the WI and TH bits in the feature map.

@bzbarsky-apple
Copy link
Contributor

@phonnakasturi-apple which exact example app (i.e. which path in the tree), built from which SHA is running on the Nordic board? Is it examples/all-clusters-app/nrfconnect or something else?

And is it possible that there are other BLE things around (WiFi devices) using the same discriminator and passcode and we just picked up the wrong one as a result?

@phonnakasturi-apple
Copy link
Author

Yes its from examples/all-clusters-app/nrfconnect

and also other tests work fine with the same discriminator and passcode

@bzbarsky-apple
Copy link
Contributor

@phonnakasturi-apple My point is this: is it possible that we are talking to the wrong device during commissioning because there are multiple devices all powered on at once using the same discriminator?

Either that's going on or the Thread device randomly claims to be a WiFi device.

We really need to figure out which of those is going on, because if it's the latter we need to figure out why, and if it's the former trying to analyze the device code is a waste of time. I took a quick look at examples/all-clusters-app/nrfconnect and I really don't see how it would end up with a wifi driver offhand....

Is there any way to confirm whether when we think we're pairing the thread device and it fails we're actually in fact talking to the thread device?

@phonnakasturi-apple
Copy link
Author

@bzbarsky-apple I can confirm that when testing with thread device, no other devices are connected (wifi devices)

@bzbarsky-apple
Copy link
Contributor

@phonnakasturi-apple So specifically, during a run that shows the failure the device is logging things as we go through the commissioning flow where it tells us it is in fact a wifi device?

@stale
Copy link

stale bot commented Oct 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label Oct 19, 2022
@stale
Copy link

stale bot commented Oct 28, 2022

This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale issue or PR
Projects
None yet
Development

No branches or pull requests

2 participants