Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dwc_otg driver causing complete system freeze in stable 6.6.28 kernel (Home Assistant OS, RPi OS) #6172

Open
sairon opened this issue May 16, 2024 · 25 comments

Comments

@sairon
Copy link

sairon commented May 16, 2024

Describe the bug

With the upgrade of Home Assistant OS to latest stable 6.6 kernel, we started to get reports of boot loops when some USB devices are connected: home-assistant/operating-system#3362

Further investigation shown it's caused by the default dwc_otg driver which causes a complete system freeze, with watchdog restarting the device shortly after. I managed to reproduce the same issue on RPi OS (both 32bit and 64bit) using steps described below, with kernel 6.6.20 from the current OS image and latest 6.6.28 from the APT repo. It's still not completely clear to me if it's only reproducible with FIQ enabled, because in my testing it seemed stable without it, however, changing to dwc2 seems to reliably resolve the issue.

There are some reports that also some other USB devices (Zigbee sticks) trigger the same issue. RPi 3B seems to be the most common but there's anecdotal evidence of it happening on RPi 4B as well. We also have reports of downgraded performance of ZB sticks on RPi 4 and 5 (not leading to freeze/boot loop) but it's unclear yet if this is related: home-assistant/operating-system#3352

I'll be happy to perform any further tests or ask other people for more details to get this one sorted out.

Steps to reproduce the behaviour

  1. Install Home Assistant OS 12.3 (based on stable downstream RPi kernel 6.6.28).
  2. Plug in Z-Wave.me UZB stick.
  3. Set up Z-Wave / start the Z-Wave JS add-on which initiates communication with the USB ACM device
  4. System immediately freezes.

Alternatively, on RPi OS:

  1. Install Docker.
  2. Plug in Z-Wave.me UZB stick.
  3. Start the Z-Wave JS UI container: docker run --rm -it -p 8091:8091 -p 3000:3000 --device=/dev/serial/by-id/usb-0658_0200-if00:/dev/zwave --mount source=zwave-js-ui,target=/usr/src/app/store zwavejs/zwave-js-ui:latest
  4. Fill in any Z-Wave keys in the web UI and save the config.
  5. System immediately freezes.

Device (s)

Raspberry Pi 3 Mod. B

System

pi@rpios:~ $ cat /etc/rpi-issue 
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 11096428148f0f2be3985ef3126ee71f99c7f1c2, stage2
pi@rpios:~ $ vcgencmd version
Apr 17 2024 17:29:03 
Copyright (c) 2012 Broadcom
version 86ccc427f35fdc604edc511881cdf579df945fb4 (clean) (release) (start)
pi@rpios:~ $ uname -a
Linux rpios 6.6.28+rpt-rpi-v7 #1 SMP Raspbian 1:6.6.28-1+rpt1 (2024-04-22) armv7l GNU/Linux

Logs

[  142.879733] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  142.885902] rcu:     3-...0: (1 GPs behind) idle=5b8c/1/0x4000000000000000 softirq=37511/37513 fqs=5089
[  142.895112] rcu:     (detected by 2, t=21012 jiffies, g=69409, q=393 ncpus=4)

[  121.819373] WARN::dwc_otg_hcd_urb_dequeue:638: Timed out waiting for FSM NP transfer to complete on 3
[  142.879733] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  142.885902] rcu:     3-...0: (1 GPs behind) idle=5b8c/1/0x4000000000000000 softirq=37511/37513 fqs=5089
[  142.895112] rcu:     (detected by 2, t=21012 jiffies, g=69409, q=393 ncpus=4)
[  142.903141] [  156.130975] mmc1: Timeout waiting for hardware interrupt.
Task dump for CPU 3:
[  142.903147] task:node            state:R  running task     stack:0     pid:2956  ppid:2883   flags:0x00000202
[  142.903166] Call trace:
[  142.903172]  __switch_to+0xe8/0x168
[  142.903192]  0x0
[  156.130975] mmc1: Timeout waiting for hardware interrupt.

Additional context

Might be closely related to #6100 but unlike there, even latest kernel from the 6.6.y branch (6.6.30) did not fix the issue.

@popcornmix
Copy link
Collaborator

popcornmix commented May 16, 2024

Is this a regression? i.e. has this ever been reliable with an older kernel?
You can install historical kernels using rpi-update <hash> to confirm.

@sairon
Copy link
Author

sairon commented May 16, 2024

Is this a regression? i.e. has this ever been reliable with an older kernel?

It is definitely a regression on Home Assistant OS, it is resolved by reverting back (HAOS uses A/B boot mechanism) to build using kernel tag stable_20240124 (6.1.73) , it is reproducible with stable_20240423 (6.6.28). I am not aware of any similar issues in the past, and there are not any relevant changes in HAOS tree between those two builds that could be the cause.

I'll test an older RPi OS kernel and report back shortly.

@sairon
Copy link
Author

sairon commented May 16, 2024

I downgraded 32bit RPi OS to 6.1 from the stable branch (6.1.73) using rpi-update 6c2b033bf556c2a2ae109ec85d86485fa4c16050 and I confirm I can not reproduce it there either. So I think we can safely call it a 6.6 regression.

@popcornmix
Copy link
Collaborator

rpi-update 5fc4f643d2e9c5aa972828705a902d184527ae3f should get you the most recent 6.1 kernel (6.1.77).
rpi-update 7fa525a8a7d42235a8eaa52f5e3636ede9073225 should get you the oldest 6.6 kernel (6.6.5).

If the first works and the second fails, then it's likely the switch to 6.6 tree.
If not, then it's one of the commits on 6.1 or 6.6 and we may be able to narrow down further.

@sairon
Copy link
Author

sairon commented May 16, 2024

  • 6.1.77 does not manifest the issue.
  • 6.6.5 doesn't boot at all (double-checked on 32bit and 64bit OS):

image

@popcornmix
Copy link
Collaborator

Possibly the boot failure is due to 4a8f7f7
Maybe rpi-update 07ff8bbae5c5e6a52c61ca062fdb181fd80202bc is the first build (6.6.20) with that fix.

@sairon
Copy link
Author

sairon commented May 16, 2024

Moved a bit forward in the Git history and re-tested with hash 7c8a2bd9d4cc862929eb49d0c3cef2ffc59a365d (6.6.8), issue is present, last message on HDMI console before the system froze:

image

(FWIW USB enumeration errors are another known issue of this particular USB device: home-assistant/operating-system#2995)

@popcornmix
Copy link
Collaborator

Yes, looks like rpi-update 7c8a2bd9d4cc862929eb49d0c3cef2ffc59a365d (6.6.8) is the first build with the linked commit (and is a very early build on the 6.6 tree).

So seems it started with move to 6.6 tree (which doesn't narrow it down too much).

@wolfpackt99
Copy link

Where can i download an older version that works outside of the raspberry pi imager and flash to the device?

@popcornmix
Copy link
Collaborator

An older version of RPiOS?
There's a lot of historical versions here: https://downloads.raspberrypi.com/

@wolfpackt99
Copy link

An older version of RPiOS? There's a lot of historical versions here: https://downloads.raspberrypi.com/

yes, I have been having lots of other zwave issues so kept installing updates in an attempt to fix it. Then i think I am stuck because of this issue. So, was trying to restore a backup, by flashing device from imager (version from 5/8/2024). Unplugged usb devices and the system starts. Is there a command from the console to rollback, without having to image an older version?

@popcornmix
Copy link
Collaborator

You can revert bootloader/firmware/kernel with rpi-update.
There is no way to revert all of apt.

@wolfpackt99
Copy link

wolfpackt99 commented May 17, 2024

reflashed 12.2 to the sd card. will install a backup before the 12.3 upgrade. And then wait for a fix.

update
i am back running again on 12.2.

@pelwell
Copy link
Contributor

pelwell commented May 17, 2024

If I remember correctly, at least one of the Z-Wave dongles is/was seriously non-USB standards-compliant. @P33M?

@mvdnes
Copy link

mvdnes commented May 18, 2024

Ah yes, the stick I have the issue with is mentioned in https://forums.raspberrypi.com/viewtopic.php?f=28&t=245031#p1502030 and #3027.

However, this was causing problems with Pi4 and not the Pi3, on which the current problem presents. It would be interesting to see if adding a hub in between solves the issue or not.

The USB stick from sairon is a different one though.

@ipoupaille
Copy link

Might be closely related to #6100 but unlike there, even latest kernel from the 6.6.y branch (6.6.30) did not fix the issue.

The last kernel do not fix the issue for me. the dwc2 driver fix it for zwave stick, but break another think.
I am still with 6.1.77 kernel.

@P33M
Copy link
Contributor

P33M commented May 20, 2024

If I remember correctly, at least one of the Z-Wave dongles is/was seriously non-USB standards-compliant. @P33M?

It was an Aeotec dongle and the symptom there was "failure to enumerate" not a hang during use.
Nothing substantial changed in dwc_otg between 6.1 and 6.6 - the fact that mmc dies as well as USB points to some fundamental breakage.

@bcutter
Copy link

bcutter commented Jul 18, 2024

Wondering what's the current status on this (obviously kind of major) issue after a silence of 2 months?

@ea7kir
Copy link

ea7kir commented Aug 11, 2024

For what its worth, I have a similar problem with an ethernet-usb adapter.

Raspberry Pi 4 Bookwork Desktop (64-bit) with
1 RJ45 -> LAN
1 USB -> Analog Devices Pluto SDR
1 USB -> ASIX USB/Ethernet Dongle -> RJ45 on a DDMAL HDMI Video Encoder

The Pluto always connects, but the Dongle does not.

pi@txtouch:~ $ uname -a
Linux txtouch 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

pi@txtouch:~ $ lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 004: ID 0b95:7720 ASIX Electronics Corp. AX88772
Bus 001 Device 003: ID 0456:b673 Analog Devices, Inc. LibIIO based AD9363 Software Defined Radio [ADALM-PLUTO]
Bus 001 Device 002: ID 2109:3431 VIA Labs, Inc. Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

pi@txtouch:~ $ nmcli device
DEVICE TYPE STATE CONNECTION
eth0 ethernet connected Wired connection 1
eth1 ethernet connected Wired connection 2
lo loopback connected (externally) lo
eth2 ethernet connecting (getting IP configuration) Wired connection 3
wlan0 wifi disconnected --
p2p-dev-wlan0 wifi-p2p disconnected --

pi@txtouch:~ $ nmcli monitor
NetworkManager is running
eth2: connection failed
Networkmanager is now in the 'connected (site only)' state
eth2: disconnected
eth2: using connection 'Wired connection 3'
eth2: connecting (prepare)
Networkmanager is now in the 'connecting' state
eth2: connecting (configuring)
eth2: connecting (getting IP configuration)
eth2: connection failed
Networkmanager is now in the 'connected (site only)' state
eth2: disconnected

@espeir
Copy link

espeir commented Aug 14, 2024

I saw that 13.0 released today. Any idea if this issue has been resolved?

@itCarl
Copy link

itCarl commented Aug 14, 2024

I saw that 13.0 released today. Any idea if this issue has been resolved?

Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+.
But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.

@cvladan
Copy link

cvladan commented Aug 15, 2024

Me also!

https://www.reddit.com/r/homeassistant/comments/1est4zd/update_often_crashes_everything/

@bcutter
Copy link

bcutter commented Aug 25, 2024

I saw that 13.0 released today. Any idea if this issue has been resolved?

Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+. But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.

What version did you run before updating to HA OS 13.0?

@itCarl
Copy link

itCarl commented Aug 26, 2024

I saw that 13.0 released today. Any idea if this issue has been resolved?

Since i have updated to HA OS 13 i have this problem. before the update mine was running fine with a ZigBee stick and RPI3B+. But now its broken 😢 and restarting all the time. I was able to deactivate the "Sonoff Zigbee 3.0 USB Dongle Plus" Integration it is running but without zigbee sensors.

What version did you run before updating to HA OS 13.0?

12.3 or 12.4 can't remember wich one exactly... anyway i fixed it by doing a fresh install of HAOS 13.0. Now its running like before even with 13.0.

@Skuair
Copy link

Skuair commented Aug 27, 2024

Hello,
I have the Sonoff dobgle-e plus with Home assistant OS on RPi B3+.
From my side the problem was only when the zigbee2mqtt add-on was starting: the host restarted each time. But it appeared only when I restarted the host yesterday, which was still in 12.2, last time I restarted was months ago.
So I took this time to upgrade the dongle and when I restarted all the thing and plugged the dongle no more problems so I thought the firmware update of the dongle resolved it.

I decided to jump to 13.1 to see how it's going on: so the host restarted and same problem appeared again, restart loop only when z2m add-on start.

So, this is how I overcame it after tests, I have to respect these steps:

  • stop z2m add-on and don't let it start at HA start
  • unplug the dongle
  • restart host
  • wait some minutes after restart
  • plug the dongle
  • start z2m

My conclusion is for now, the host must not restart when dongle is plugged, I suppose for the next host update I will have to replay these steps before update and only at the end, plug the dongle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests