Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Hub took too long to respond #256

Closed
laurensvalk opened this issue Feb 13, 2021 · 54 comments
Closed

[Bug] Hub took too long to respond #256

laurensvalk opened this issue Feb 13, 2021 · 54 comments
Labels
bug Something isn't working software: pybricks-code Issues with https://code.pybricks.com application topic: bluetooth Issues involving bluetooth topic: flashing firmware Issues related to fhe firmware flashing process

Comments

@laurensvalk
Copy link
Member

laurensvalk commented Feb 13, 2021

Describe the bug
Cannot install Technic Hub firmware in Microsoft Edge and Google Chrome on Windows.

It pairs just fine, and appears to erase the flash as usual. The light is off for a bit, then starts blinking red/green/blue. But Pybricks Code does not appear to pick it up after.

Pybricks Code v1.0.0-beta.6
Chrome 88.0.4324.150
Edge 88.0.705.69
Windows 10 Home 10.0.19042 running in VirtualBox on Ubuntu 18.04.

Expected behavior
Start installation after erasing flash.

Screenshots
image

@laurensvalk laurensvalk added bug Something isn't working software: pybricks-code Issues with https://code.pybricks.com application topic: flashing firmware Issues related to fhe firmware flashing process labels Feb 13, 2021
@dlech
Copy link
Member

dlech commented Feb 13, 2021

Can you capture the Bluetooth traffic when this happens? https://bleak.readthedocs.io/en/latest/troubleshooting.html#capture

@laurensvalk
Copy link
Member Author

Will do. I'm stuck on the decoding step though (This app can't run on your PC); will have to try it later.

@dlech
Copy link
Member

dlech commented Feb 14, 2021

I can decode it if that helps.

@laurensvalk
Copy link
Member Author

It appears to have loaded something; there must be something trivial with the command line interface not working.

image

@laurensvalk
Copy link
Member Author

Here's the etl file if I've done it right. Not sure this type of file contains any sysinfo? Feel free to delete if so.

bth_hci.etl.txt

@dlech
Copy link
Member

dlech commented Apr 3, 2021

image

For some reason, it looks like Windows is hitting a 30 second timeout waiting for the write response. At this point, tries to disconnect the device. But as soon as the disconnect request is made, we get the write response and the notification that we were expecting. It's like they are stuck in the Bluetooth stack or something (would need to get a proper Bluetooth sniffer to see for sure).

@dlech
Copy link
Member

dlech commented Apr 3, 2021

bth_hci.etl.txt.zip

Decoded file for the record.

@dlech
Copy link
Member

dlech commented Apr 12, 2021

Does this still happen after rebooting Windows?

@laurensvalk
Copy link
Member Author

laurensvalk commented Apr 13, 2021

Yes. Still occurs consistently in Pybricks Code v1.0.0-beta.14 in Microsoft Edge and Google Chrome.

@laurensvalk
Copy link
Member Author

Need to go through old builds of Pybricks Code to see if it worked before.

@dlech
Copy link
Member

dlech commented May 12, 2021

Can you also share the output of chrome://device-log for Bluetooth devices with the log level set to Debug?

@laurensvalk
Copy link
Member Author

BluetoothDebug[2021/05/12 17:00:36.031250] bluetooth_device_winrt.cc:804 OnGattDiscoveryComplete() success=1

BluetoothDebug[2021/05/12 17:00:35.387774] bluetooth_device_winrt.cc:788 StartGattDiscovery()

BluetoothDebug[2021/05/12 17:00:35.387772] bluetooth_device_winrt.cc:776 Discovering GATT services anew

BluetoothDebug[2021/05/12 17:00:35.387755] bluetooth_device_winrt.cc:760 OnGattServicesChanged()

BluetoothDebug[2021/05/12 17:00:35.203580] bluetooth_device_winrt.cc:788 StartGattDiscovery()

BluetoothDebug[2021/05/12 17:00:35.203576] bluetooth_device_winrt.cc:776 Discovering GATT services anew

BluetoothDebug[2021/05/12 17:00:35.203554] bluetooth_device_winrt.cc:760 OnGattServicesChanged()

BluetoothDebug[2021/05/12 17:00:35.024604] bluetooth_device_winrt.cc:788 StartGattDiscovery()

BluetoothDebug[2021/05/12 17:00:35.024601] bluetooth_device_winrt.cc:776 Discovering GATT services anew

BluetoothDebug[2021/05/12 17:00:35.024590] bluetooth_device_winrt.cc:760 OnGattServicesChanged()

BluetoothDebug[2021/05/12 17:00:35.023874] bluetooth_device_winrt.cc:737 OnConnectionStatusChanged() status=1

BluetoothDebug[2021/05/12 17:00:33.513082] bluetooth_device_winrt.cc:788 StartGattDiscovery()

BluetoothDebug[2021/05/12 17:00:33.411533] bluetooth_adapter_winrt.cc:211 No advertisement flags found.

@dlech
Copy link
Member

dlech commented May 12, 2021

Hmm... not much useful there. I was at least expecting to see a disconnect after 30 seconds like we did in the packet capture.

@dlech
Copy link
Member

dlech commented May 12, 2021

Does flashing with pybricksdev work on the same hardware?

Either way, can you do a Bluetooth packet capture while doing this as well so we can compare it to what Edge does?

@laurensvalk
Copy link
Member Author

Same hardware yes and no --- I'm running Windows on a VM.

And indeed, it seems to go wrong there with Pybricksdev as well, so I could dig a bit deeper.

Connected successfully!
Erasing flash and starting update
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\laure\Desktop\pybricksdev\pybricksdev\cli\__init__.py", line 327, in main
    asyncio.run(subparsers.choices[args.tool].tool.run(args))
  File "C:\Program Files\Python38\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Program Files\Python38\lib\asyncio\base_events.py", line 616, in run_until_complete
    return future.result()
  File "C:\Users\laure\Desktop\pybricksdev\pybricksdev\cli\__init__.py", line 202, in run
    await updater.flash(firmware, metadata, args.delay/1000)
  File "C:\Users\laure\Desktop\pybricksdev\pybricksdev\flash.py", line 300, in flash
    response = await self.bootloader_request(self.ERASE_FLASH)
  File "C:\Users\laure\Desktop\pybricksdev\pybricksdev\flash.py", line 267, in bootloader_request
    await self.write(data, delay, request.write_with_response)
  File "C:\Users\laure\Desktop\pybricksdev\pybricksdev\ble.py", line 140, in write
    await self.client.write_gatt_char(
  File "C:\Users\laure\Desktop\pybricksdev\.venv\lib\site-packages\bleak\backends\dotnet\client.py", line 733, in write_gatt_char
    raise BleakError(
bleak.exc.BleakError: Could not write value bytearray(b'\x11') to characteristic 00001626-1212-efde-1623-785feabcd123: Unreachable

@dlech
Copy link
Member

dlech commented May 12, 2021

Does it work if you change the erase command to write without response?

@laurensvalk
Copy link
Member Author

Also confirming that I see this in Pybricksdev on both City Hub and Technic Hub.

Does it work if you change the erase command to write without response?

It's been a long time since I worked on this code. The followowing appears to make no difference:

image

@laurensvalk
Copy link
Member Author

Also, the light does turn off briefly, so the hub appears to receive the erase command.

@laurensvalk
Copy link
Member Author

laurensvalk commented May 12, 2021

Without response, it mostly still doesn't work, but I saw it work once...

In that case it mostly just gets stuck awaiting the reply:

2021-05-12 21:23:30,391:   DEBUG: Erasing flash.
2021-05-12 21:23:30,391:   DEBUG: Clear and prepare reply
2021-05-12 21:23:30,392:   DEBUG: Make and write request
2021-05-12 21:23:30,392:   DEBUG: TX CHUNK: bytearray(b'\x11'), without response
2021-05-12 21:23:30,450:   DEBUG: Awaiting reply

@laurensvalk
Copy link
Member Author

laurensvalk commented May 12, 2021

Hmm, looks like write without response and without reply + waiting about 5 minutes does the trick in pybricksdev. Maybe there's something else we can send to wake it up sooner. Might not be able to investigate tonight though.

EDIT: It seems we always wait for a reply of some sort, maybe that's worth skipping. Update: yes, see post below.

@laurensvalk
Copy link
Member Author

laurensvalk commented May 12, 2021

So indeed, this quick hack does the trick on Technic Hub:

diff --git a/pybricksdev/flash.py b/pybricksdev/flash.py
index 8ee5e5f..da8ecd5 100644
--- a/pybricksdev/flash.py
+++ b/pybricksdev/flash.py
@@ -11,6 +11,8 @@ from typing import Dict, Tuple
 from tqdm import tqdm
 import typing
 import zipfile
+import asyncio
+

 from .ble import BLERequestsConnection
 from .compile import save_script, compile_file
@@ -203,6 +205,8 @@ class BootloaderRequest():
         self.reply_len = struct.calcsize(data_format)
         if request_reply:
             self.reply_len += 1
+        if command == 0x11:  # Hack to skip waiting for reply altogether, not even the echo of the command.
+            self.reply_len = 0
         self.write_with_response = write_with_response

     def make_request(self, payload=None):
@@ -223,7 +227,7 @@ class BootloaderConnection(BLERequestsConnection):

     # Static BootloaderRequest instances for particular messages
     ERASE_FLASH = BootloaderRequest(
-        0x11, 'Erase', ['result'], '<B'
+        0x11, 'Erase', ['result'], '<B', False, False
     )
     PROGRAM_FLASH_BARE = BootloaderRequest(
         0x22, 'Flash', [], '', False, False
@@ -273,7 +277,6 @@ class BootloaderConnection(BLERequestsConnection):
             return request.parse_reply(reply)

     async def flash(self, firmware, metadata, delay):
-
         # Firmware information
         firmware_io = io.BytesIO(firmware)
         firmware_size = len(firmware)
@@ -299,6 +302,7 @@ class BootloaderConnection(BLERequestsConnection):
         self.logger.debug("Erasing flash.")
         response = await self.bootloader_request(self.ERASE_FLASH)
         self.logger.debug(response)
+        await asyncio.sleep(4) # wait for flash erase to complete, guess enough time instead of waiting for reply with success

         # Get the bootloader ready to accept the firmware
         self.logger.debug('Request begin update.')

On City Hub I do get something back b'\x11\x00', which we can't currently parse in Pybricksdev so it raises ValueError: Unknown message.

@dlech
Copy link
Member

dlech commented May 12, 2021

Does Move hub work?

Can you do a Bluetooth packet capture for all hubs with the hack in place?

@dlech dlech added the topic: bluetooth Issues involving bluetooth label May 13, 2021
@laurensvalk
Copy link
Member Author

Does Move hub work?

Yes, it works in Pybricks Code.

@laurensvalk
Copy link
Member Author

Can you do a Bluetooth packet capture for all hubs with the hack in place?

bth_hci_pybricksdev_hack.zip

Notes:

  • For Technic Hub I included a log without hack (which fails: Could not write value bytearray(b'\x11')). Uploaded here for comparison with Pybricks Code log from before
  • For Technic Hub I included a log with hack (which succeeds)
  • For City Hub I included a log with the same hack. This fails, but that seems to be mainly because I hacked how the reply is treated, so Pybricksdev doesn't understand b'\x11\x00'. This can probably be worked around in another way.

@dlech
Copy link
Member

dlech commented May 14, 2021

Thanks for this. In bth_hci_technichub_pybricksdev_default, it looks like the bootloader is sending the wrong response to Find By Type Value Request. This probably confuses the Windows Bluetooth stack since it is getting responses to commands it didn't send and not getting responses to commands that it did send. I recall that this request was added in a newer version of Windows and caused problems and had to be fixed in our firmware. However, we can't fix it in the bootloader.

image

The capture doesn't even get to the point of starting to send the first bootloader command to the LWP3 characteristic, so I imagine that if you were to capture again, you might get different results.

@laurensvalk
Copy link
Member Author

laurensvalk commented May 15, 2021

On City Hub, when changing the flash request to write without response it does proceed, and we're seeing the same as before, the reply b'\x11\x00' does come back eventually, but apparently only after the next write, so the responses fail to match:

2021-05-15 11:49:52,894: DEBUG: pybricksdev.flash: Erasing flash.
2021-05-15 11:49:52,894: DEBUG: pybricksdev.flash: Clear and prepare reply
2021-05-15 11:49:52,894: DEBUG: pybricksdev.flash: Make and write request
2021-05-15 11:49:52,895: DEBUG: pybricksdev.ble: TX CHUNK: bytearray(b'\x11'), without response
2021-05-15 11:49:52,899: DEBUG: bleak.backends.dotnet.client: Write Characteristic 00001626-1212-efde-1623-785feabcd123 : bytearray(b'\x11')
2021-05-15 11:49:52,899: DEBUG: pybricksdev.flash: Awaiting reply
2021-05-15 11:50:02,910: INFO: pybricksdev.flash: did not receive erase reply, continuing anyway
2021-05-15 11:50:02,910: DEBUG: pybricksdev.flash: Request begin update.
2021-05-15 11:50:02,910: DEBUG: pybricksdev.flash: Clear and prepare reply
2021-05-15 11:50:02,915: DEBUG: pybricksdev.flash: Make and write request
2021-05-15 11:50:02,916: DEBUG: pybricksdev.ble: TX CHUNK: bytearray(b'D\xccm\x02\x00'), with response
2021-05-15 11:50:02,984: DEBUG: pybricksdev.ble: DATA bytearray(b'\x11\x00')
2021-05-15 11:50:02,985: DEBUG: bleak.backends.dotnet.client: Write Characteristic 00001626-1212-efde-1623-785feabcd123 : bytearray(b'D\xccm\x02\x00')
2021-05-15 11:50:02,986: DEBUG: pybricksdev.flash: Awaiting reply

Then fails with

Exception has occurred: ValueError
Unknown message: bytearray(b'\x11\x00')

Could we send a bogus message (get checksum?) just to clear up any pending replies before starting to send the firmware?

laurensvalk added a commit to pybricks/pybricksdev that referenced this issue May 15, 2021
@laurensvalk
Copy link
Member Author

laurensvalk commented May 15, 2021

Could we send a bogus message (get checksum?) just to clear up any pending replies before starting to send the firmware?

Yes, that works. I've pushed a patch here.

So the flash erase command is now the same again for all hubs (write without response); this just moves the workaround instead of adding another one, so hopefully this isn't too bad.

It can probably be improved by proper handling of the responses instead of sending something and waiting two seconds. If Pybricks Code already does this, then maybe the bogus message isn't needed.

@laurensvalk
Copy link
Member Author

Also, looking at the log above it's almost as if the hub does send that reply, since the script gets it back before the backend send the next one. But just the act of writing to the stack appears to unload that data.

@dlech
Copy link
Member

dlech commented May 15, 2021

Did you try this patch on Linux? I'm pretty sure that it will cause BlueZ to disconnect (it did in the past). If that is the case, we need to do write without response (and the checksum hack) only on Windows.

Does it work on Windows if we just add an async sleep before sending the next message? I wonder if there is some other way we could "poke" the Windows Bluetooth stack to get it to see the notification or if the message really is getting held back on the hub.

@dlech
Copy link
Member

dlech commented May 15, 2021

Could you also do a packet capture of City Hub on Windows with "write with response".

@dlech
Copy link
Member

dlech commented May 15, 2021

Also, what Bluetooth adapter are you using?

@laurensvalk
Copy link
Member Author

Patch has only been tested on windows so far, hence pushed to a windows branch for now.

I also tried a plain async sleep (and a longer timeout, 10s), but this didn't seem to work. The old/bad response always came when attempting the next write.

I'll get back with captures and info on the adapter when I'm back at the PC.

@laurensvalk
Copy link
Member Author

@laurensvalk
Copy link
Member Author

Could you also do a packet capture of City Hub on Windows with "write with response".

With the current Pybricksdev master:

bth_hci_6b0e63bd_cityhub.zip

@laurensvalk
Copy link
Member Author

Did you try this patch on Linux? I'm pretty sure that it will cause BlueZ to disconnect (it did in the past). If that is the case, we need to do write without response (and the checksum hack) only on Windows.

Indeed:

Exception has occurred: BleakError
Not connected

@dlech
Copy link
Member

dlech commented May 15, 2021

I think I have this one: https://www.trust.com/en/product/18187-manga-bluetooth-4-0-adapter

USB VID:PID?

@dlech
Copy link
Member

dlech commented May 15, 2021

Decoded captures from #256 (comment)

bth_hci_pybricksdev_hack (2).zip

Decoded capture from #256 (comment)

bth_hci_6b0e63bd_cityhub (2).zip

@dlech
Copy link
Member

dlech commented May 15, 2021

With the current Pybricksdev master:

bth_hci_6b0e63bd_cityhub.zip

What happened when this ran? It looks like it got cut off before it ever to the point of sending LWP3 commands (probably due to other misbehavior due to buggy implementation of Bluetooth in the bootloader).

EDIT: I now see that this capture contains 2 connections, so the interesting part is in the middle.

dlech added a commit to pybricks/pybricksdev that referenced this issue May 15, 2021
This addresses additional quirks seen on a specific Windows setup with
the city hub.

Issue: pybricks/support#256
@dlech
Copy link
Member

dlech commented May 15, 2021

I pushed a branch with a different workaround that shouldn't break Linux.

@laurensvalk
Copy link
Member Author

laurensvalk commented May 16, 2021

What happened when this ran?

Same as throughout this whole issue; got stuck on trying to erase flash with response.

That patch (pybricks/pybricksdev@c227f02) works too 👍

So does this not break on other Windows machines, given that those were working before the patch?

USB VID:PID?

Bus 001 Device 011: ID 0a12:0001 Cambridge Silicon Radio, Ltd Bluetooth Dongle (HCI mode)

@dlech
Copy link
Member

dlech commented May 16, 2021

So does this not break on other Windows machines, given that those were working before the patch?

It seems to be working fine on my Windows laptop.

@dlech
Copy link
Member

dlech commented May 16, 2021

Bus 001 Device 011: ID 0a12:0001 Cambridge Silicon Radio, Ltd Bluetooth Dongle (HCI mode)

I have a dongle with the same VID:PID. My Windows laptop doesn't seem to be able to load the driver for it, but I was able to use it in a Windows VM without problem (driver worked and pybricksdev worked without hacks).

What is the driver info for this dongle in your Windows VM?

@dlech
Copy link
Member

dlech commented May 16, 2021

After disabling the built-in adapter, I was able to get this dongle to work on my laptop as well. I didn't not encounter any problems with it when flashing firmware in Pybricks Code.

dlech added a commit to pybricks/pybricks-code that referenced this issue May 16, 2021
This works around some issues that have been observed on a specific
Windows computer.

Issue: pybricks/support#256
@dlech
Copy link
Member

dlech commented May 16, 2021

I have pushed a branch to pybricks-code with similar fixes for testing.

dlech added a commit to pybricks/pybricksdev that referenced this issue May 16, 2021
This addresses additional quirks seen on a specific Windows setup with
the city hub.

Issue: pybricks/support#256
@laurensvalk
Copy link
Member Author

laurensvalk commented May 17, 2021

I can confirm that pybricks/pybricks-code@956b6f3 is working for me on Windows for City Hub, Technic Hub, and Move Hub 🎉

Also works on Ubuntu 18.04 with City Hub, Technic Hub, and Move Hub.

@dlech
Copy link
Member

dlech commented May 17, 2021

Hopefully you aren't the only person in the world that had this strange problem. 😉

@laurensvalk
Copy link
Member Author

If we merge it, we'll never know 😄

@dlech
Copy link
Member

dlech commented Aug 13, 2021

Is there anything left to do on this issue?

@laurensvalk
Copy link
Member Author

Reading the messages above, I guess not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working software: pybricks-code Issues with https://code.pybricks.com application topic: bluetooth Issues involving bluetooth topic: flashing firmware Issues related to fhe firmware flashing process
Projects
None yet
Development

No branches or pull requests

2 participants