Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of order pairing steps when using HomeSpan (async race condition?) #343

Open
phybros opened this issue Nov 11, 2023 · 15 comments
Open

Comments

@phybros
Copy link

phybros commented Nov 11, 2023

HomeSpan is a library for ESP32 devices to enable them to speak HomeKit. After a huge amount of tests, it seems that when pairing these devices with Home Assistant via the HomeKit Device integration, pairing intermittently fails in a variety of ways.

  • Sometimes Home Assistant starts pairing and then complains that the device is already paired
  • Sometimes it all works perfectly
  • Sometimes Home Assistant fails on step 04

I've included a massive amount of logs in my original HomeSpan bug report, but the description of what's happening from the devs makes it sound like there's some sort of race condition in aiohomekit which is causing the pairing steps to happen in a random order (nondeterministically).

The HomeSpan issue link is here: HomeSpan/HomeSpan#680

but the important part I think is here:

Thanks -- these details are helpful, though I'm not sure there is much I can do. As you can see from your first log output, the first thing HA is doing is asking for a pair-verify. That makes no sense - it first needs to set up pairing before it can open a secure paired channel. However, this won't cause a problem for HomeSpan as it just reports the error and waits for the next input. Based on your log file, HA does send the correct pair-setup request, but waits for over a minute before sending. I have no idea why HA would wait so long. As you can see, it correctly send the 3 steps for pairing in the correct order (M1, M3, and then M5) and the device properly pairs.

However, 10 seconds later HA seems to open a new connection and tries to pair again. Since the device is already paired, HomeKit responds with an error in accordance with the HAP protocol. Next, HA tries to create a secure connection using the steps for a pair-verify. Since the device is paired, this works and the verification completes. I do not know why HA does not think the device is paired since it has gone as far to open a secure connection.

As you noted, the second log file shows a correct pairing and verification as HA sent all the right steps in the correct order.

In the third log file, HA seems to be trying to verify a secure connection before it has finished all the pairing steps. HAP does not specify what is supposed to happen if a device interrupts a pair-setup with a misplaced pair-verify. I coded HomeSpan to reset the pairing procedure whenever that occurs so things can start fresh, but it looks like HA tries to continue the pairing setup by sending an M3 (which is the second step of pairing) --- by that time HomeSpan has already reset the pairing sequence and is expected to see M1 (which is the first step).

Based on the above logs, it almost seems as if HA is running two parallel processes to try and pair/verify, such that depending on the exact timing, if one process completes everything works, but if they interfere, the behavior is indeterminant.

Could this be a bug in aiohomekit? Or maybe the Home Assistant integration itself? HomeSpan devices always pair flawlessly with "real" HomeKit controllers (e.g. via the Home app) and claim to be fully compliant with the spec, so I'm not sure what's going on

@Jc2k
Copy link
Owner

Jc2k commented Nov 11, 2023

It sounds like pairing worked once and then you reset the esp but not the home assistant integration. So the existing pairing in ha is trying to establish a connection, but now you are trying to pair it again. So the device sees the results of 2 concurrent tasks - one trying to use the previous (stale) pairing, and one trying to make a new one.

@phybros
Copy link
Author

phybros commented Nov 11, 2023 via email

@phybros
Copy link
Author

phybros commented Nov 11, 2023

Incidentally I did also try it on a completely different instance of HA on a different network at a different house and got the same issue the very first time trying to pair it.

Just to be sure, what are the steps you would suggest going through to get the HA integration to stop trying to contact the device after removing it from the integrations page?

@Jc2k
Copy link
Owner

Jc2k commented Nov 11, 2023

The pair verify code path only happens when there is a pairing. It loads secrets from a config entry, and there is no config entry until pairing has happened.

It would probably fill your logs with exceptions if it tried.

That should mean that it'd be impossible to see pair verifys from ha if you removed the integration and then restarted ha to make sure it was gone. The restart shouldn't be needed but it's possible you've discovered an edge case where a faulty pairing can't be cleanly unloaded.

@Jc2k
Copy link
Owner

Jc2k commented Nov 11, 2023

We know there are ~28k installs of this integration, and this is the first time I can remember this behaviour coming up, and you can recreate it in 2 distinct environments.

That makes me wonder if there is an environmental variable (eg we have had trouble with mdns repeaters in the past, and that's how paired/unpaired status is relayed, rather than via the api). We might want to examine logs from the zeroconf integration to make sure it's getting the right data there.

Duplicate device ids could also be a problem and cause this kind of confusion. Hopefully they are automated by homespan and fully random so I wouldn't expect that to be a problem.

@phybros
Copy link
Author

phybros commented Nov 11, 2023

Thanks for all this info. I also suspected network weirdness, which is why I tried it somewhere else with a super normal wifi/networking situation. And now that I'm thinking back maybe it did work fine the very first time in that environment and then had issues subsequent times.... hmm. I'll go away and try to do a more scientific set of tests and let you know!

@phybros
Copy link
Author

phybros commented Nov 11, 2023

I tried just un-pairing and re-pairing from within HA a few times. When unpairing, it all works well, the HomeSpan devices sets itself back to unpaired etc, the integration disappears from HA. The device is immediately re-discovered by HA and I can re-pair no problem (MOST of the time), however it does put two things into the HA logs:

Logger: homeassistant.components.homekit_controller.connection
Source: components/homekit_controller/connection.py:782
Integration: HomeKit Device (documentation, issues)
First occurred: 2:22:31 PM (1 occurrences)
Last logged: 2:22:31 PM

HomeKit device update skipped as previous poll still in flight: CF:97:67:3A:BC:A6

and

Logger: aiohomekit.utils
Source: /usr/local/lib/python3.11/site-packages/aiohomekit/utils.py:45
First occurred: 2:22:35 PM (1 occurrences)
Last logged: 2:22:35 PM

Failure running background task: Task-4125
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/utils.py", line 40, in _handle_task_result
    task.result()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 590, in _reconnect
    return await self._connect_once()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 698, in _connect_once
    request, expected = state_machine.send(response)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 596, in get_session_keys
    handle_state_step(response_tlv, TLV.M4)
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 90, in handle_state_step
    error_handler(tlv_dict[TLV.kTLVType_Error], f"step {expected_state}")
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/protocol/__init__.py", line 64, in error_handler
    raise AuthenticationError(stage)
aiohomekit.exceptions.AuthenticationError: step bytearray(b'\x04')

But as I said, the device is configured fine, and everything works as it should MOST of the time. Occasionally it will give the step bytearray(b'\x04') but if I try again 2 or 3 more times, getting that error in the UI each time, it eventually works

One intersting thing I noticed while looking at HomeSpan's debug logs is that even after I un-pair the device, and the integration vanishes from HA, the device is still receiving HTTP requests periodically like this:

2023-11-11 14:34:04 --> ** Client #12 Connected: (87 sec) 192.168.86.201 on Socket 3/16
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> <<<<<<<<< 192.168.86.201 <<<<<<<<<
2023-11-11 14:34:04 --> POST /pair-verify HTTP/1.1
2023-11-11 14:34:04 --> Host: 192.168.86.46
2023-11-11 14:34:04 --> Content-Length: 37
2023-11-11 14:34:04 --> Content-Type: application/pairing+tlv8
2023-11-11 14:34:04 --> ------------ END BODY! ------------
2023-11-11 14:34:04 --> STATE(1) 01
2023-11-11 14:34:04 --> PUBKEY(32) 7D278B1D4A959D2C111F91B1D4ABB04220BCF7CFA38E29A0B14F16BA70DBBE59
2023-11-11 14:34:04 --> -------
2023-11-11 14:34:04 --> ----- END TLVS! ------------
2023-11-11 14:34:04 --> In Pair Verify #12 (192.168.86.201)...
2023-11-11 14:34:04 --> *** ERROR: Device not yet paired!
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> >>>>>>>>>> 192.168.86.201 >>>>>>>>>>
2023-11-11 14:34:04 --> HTTP/1.1 200 OK
2023-11-11 14:34:04 --> Content-Type: application/pairing+tlv8
2023-11-11 14:34:04 --> Content-Length: 6
2023-11-11 14:34:04 --> 
2023-11-11 14:34:04 --> STATE(1) 02
2023-11-11 14:34:04 --> ERROR(1) 01
2023-11-11 14:34:04 --> ------------ SENT! --------------

Almost like HA is still doing things in the background even though the integration is gone. This still happens after I reboot the ESP device also. It wakes up and starts getting these requests. The IP ending 201 is my home assistant server BTW.

If I restart HA in this situation, the HTTP requests stop, and they don't start again once HA is back up.

After HA is back up, I try to pair and almost always get the "device is already paired", and I can see in HomeSpan's logs that HA is requesting to pair, pairing, and then requesting to pair a second time, resulting in the error.

The second set of tests I did also involved wiping the ESP device and having it re-generate a new unique device ID. Each time I did this, I also restarted my HA instance to get rid of any cached zeroconf discovery things. This seemed most successful when I wiped the ESP and also powered it totally off during the HA restart. Even then, sometimes there was a device "detected" by HA. Definitely seems like some mdns/zeroconf things maybe?

A couple of times I got this, if that helps.

Logger: aiohomekit.utils
Source: /usr/local/lib/python3.11/site-packages/aiohomekit/utils.py:45
First occurred: 2:16:39 PM (2 occurrences)
Last logged: 2:18:15 PM

Failure running background task: Task-2638
Failure running background task: Task-3279
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 140, in _ensure_connected
    await connection.ensure_connection()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/connection.py", line 313, in ensure_connection
    await asyncio.shield(self._connector)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 139, in _ensure_connected
    async with asyncio_timeout(10):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/utils.py", line 40, in _handle_task_result
    task.result()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 421, in _process_config_changed
    await self.list_accessories_and_characteristics()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 177, in list_accessories_and_characteristics
    await self._ensure_connected()
  File "/usr/local/lib/python3.11/site-packages/aiohomekit/controller/ip/pairing.py", line 146, in _ensure_connected
    raise AccessoryDisconnectedError(
aiohomekit.exceptions.AccessoryDisconnectedError: Timeout while waiting for connection to device 192.168.86.46:80

@bdraco
Copy link
Collaborator

bdraco commented Nov 11, 2023

It looks like HA isn't fully shutting down the connection when its unpaired.

@phybros
Copy link
Author

phybros commented Nov 12, 2023

Is there anything I can do to confirm that @bdraco ?

@Jc2k
Copy link
Owner

Jc2k commented Nov 12, 2023

If it's still doing pair verify after removing the integration and stopping after a restart then it's pretty much confirmed.

It's probably not been a problem before because most users add it once and don't need to remove it and add it again.

@phybros
Copy link
Author

phybros commented Nov 14, 2023

OK should I report this to HA then? What's the next thing to do for this issue?

@Jc2k
Copy link
Owner

Jc2k commented Nov 14, 2023

No reporting it there would just ping the same people that you are speaking to here :) Just need to wait for someone to have time to look into it.

@phybros
Copy link
Author

phybros commented Nov 14, 2023

Haha ok no problem! Thanks for all your responses

@phybros
Copy link
Author

phybros commented Dec 12, 2023

Hey there! I have definitively confirmed that this can happen if the device is brand new and has never been paired with HA. Also this was on a completely fresh HA that has never paired any devices before. So it seems like it's not just pair-unpair scenario from above.

@phybros
Copy link
Author

phybros commented Dec 12, 2023

To be clear I think that HA DOES continue to send messages to the device after unpairing it (sometimes), but it can ALSO fail on totally fresh pairing. I think they are related but separate bugs. The logs from the "fresh" scenario are identical to the ones I provided already - seems like multiple threads both trying to pair at the same time for some reason.

I've never had to debug anything this deep in HA before, but I'm going to try the dev env set up locally and see if I can do anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants