-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2022.1+] TP-Link WDR4300 hangs during reboot #2904
Comments
does this also happen with the very similar WDR3600 ? |
Probably. We had a few isolated cases where a WDR3600 needed a power cycle after an upgrade but it is not clear if this is at all related to the problem described here. We don't have enough (failing) devices to have a confident answer. |
It might be worth mentioning that the special symbol at the end of the log is printed during a boot as well. I'm not sure if this is printed before or after the bootloader loaded though. EDIT:
|
We also had reports in our community when I rolled out 2022.1 but thought it was random, and we didn't have proper logs or anything else. #2655 |
We observed this when transitioning from 2022.1.2 to 2022.1.4 on WDR4300 and more frequently on Ubiquiti AC lite. In our observation, the update was fine when the machine was rebooted just prior to the update, which may be suggesting an out-of-memory issue. |
@smoe Just to clarify, we were able to reproduce the issue on a freshly booted device as well. |
One thing that comes to my mind is the usage of the newer ar934x SPI controller driver, at least no device reported in this issue uses the older ar71xx driver. This driver was first shipped with OpenWrt 21.02, matching the observation it does not break with older releases based on OpenWrt 19.07 and older. If you are still able to reproduce this issue, you can modify the ar934x DTSI to use the compatible for the ar71xx SPI controller. Ping me in case i should provide you with a patch. If this fixes the reboot issue, we have a better path where to look next. |
@blocktrron thank you for looking into this. To avoid misunderstandings, you suggest to do this change here in OpenWRT? diff --git a/target/linux/ath79/dts/ar934x.dtsi b/target/linux/ath79/dts/ar934x.dtsi
index d88c7bfabc..15201b197e 100644
--- a/target/linux/ath79/dts/ar934x.dtsi
+++ b/target/linux/ath79/dts/ar934x.dtsi
@@ -199,15 +199,17 @@
};
spi: spi@1f000000 {
- compatible = "qca,ar934x-spi";
- reg = <0x1f000000 0x1c>;
+ compatible = "qca,ar7240-spi",
+ "qca,ar7100-spi";
+ reg = <0x1f000000 0x10>;
clocks = <&pll ATH79_CLK_AHB>;
+ clock-names = "ahb";
+
+ status = "disabled";
#address-cells = <1>;
#size-cells = <0>;
-
- status = "disabled";
};
}; |
@grische Almost. Just revert this commit in the file: openwrt/openwrt@ebf0d8d#diff-45ad725f9ec8cc2da88738047b1d5c4d1e69df19194bd22394d3736e03093613 |
@blocktrron I was able to reproduce a hang after reboot even with the above commit reverted using Gluon v2023.1: Here is the respective branch: https://github.com/grische/site-ffm/commits/test/revert-ath79-add-new-ar934x-spi-driver/ |
@grische Are these hangs only reproducible after writing a upgrade image or does a regular reboot invocation also trigger a spurious hang? |
I have a test WDR4300 device where I can reproduce the hangs during a reboot every other time. Surprisingly often actually. |
On the exact same setup, I tested it with
|
Add a cache-barrier after the reset-register write. This fixes spurious reboot issues on TP-Link WDR3600 and WDR4300 devices with Zental DDR2 DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
The bug was fixed upstream in
|
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 2fe8ecd880396b5ae25fe9583aaa1d71be0b8468)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 916af73)
I also ran the reboot tests on main and on the ar9344-reset branch on a TP-Link TL-WDR4300 v1:
Serial console output can be found here: https://gist.github.com/grische/be44b330d7f8c7fff88939f979be9d32 |
I might have bad news: the ar9344-reset branch got stuck during a reboot on my WDR4300 at the 705th reboot. I added the last two reboots into the above gist: https://gist.githubusercontent.com/grische/be44b330d7f8c7fff88939f979be9d32/raw/634a35543445ee220a2c77e8ee6a3b7e285e72d3/ttyUSB_2025-01-05T16%25EF%2580%25BA39%25EF%2580%25BA52+01%25EF%2580%25BA00.0.tail |
The hanging reboot is 10 seconds faster and these bits here are not happening:
|
The timings across all reboots are not very reliable as I attempt a reboot "every 30 seconds", so the 10s shift could have been coincidental. Some reboot time stats: Minimum: 82 seconds: grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{ print $2 }' | sort -n | head -n 3
82.410005
82.529950
82.848242 Maximum: 185 seconds grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{ print $2 }' | sort -n | tail -n 3
179.763697
179.999500
185.094121 Average: 93.6 seconds grep "Restarting" ttyUSB_2025-01-05T16%EF%80%BA39%EF%80%BA52+01%EF%80%BA00.0 | tr -d '\]' | awk '{s+=$2}END{print "average:",s/NR}'
average: 93.6536 When I grep the log of almost 700 reboots (see gist above), I can find the following parts
|
@rotanid the tested patch d3f2342 and the upstreamed patch openwrt/openwrt@0c52c9d are identical. And yes, this would need a backport to Gluon v2023.2.x. At least the first version of the patch was easily backported to kernel 5.15. |
This bump includes two major changes / fixes: - c06d4df974 mac80211: set basic-rate for mesh interfaces See freifunk-gluon/gluon#3185 - 0c52c9d6fc ath79: reset ETH switch for AR9344 freifunk-gluon/gluon#2904
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
This bump includes two major changes / fixes: - c06d4df974 mac80211: set basic-rate for mesh interfaces See freifunk-gluon/gluon#3185 - 0c52c9d6fc ath79: reset ETH switch for AR9344 freifunk-gluon/gluon#2904 - many more fixes (incl. FB4040 MAC issues)
It seems the OpenWRT-24.10-backported patches used in ab1c311 does not have any effect. I am back at around 1 hang per 5 reboots on average using this commit. The build artifacts can be found here: https://github.com/freifunkMUC/site-ffm/actions/runs/12723841862?pr=558 The patches that are being applied on top of the above commit should have no impact: Any suggestion what could have gone wrong? |
question: does those "hang" are on (about) 10% of "simple" reboot, or does it need something like an autoupdater performing a job? |
yes, it's on simple reboots. so if your community does regular reboots then every now and then a device will stay offline until powercycled by the user. though this issue has been present in earlier versions |
for 2021.1 i am very sure that there is no such instability, since we have at least 2 devices which do not have the issue. I have an 3600 on 2023.1.x on my desk which i could reboot via cronjob, if it helps. |
3600 on gluon 2023.1.x did more than 500 reboots now (every 4-5 minutes) without any issue. |
I did another round of tests with larger amounts of reboots (>1k) and I got:
|
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net> (cherry picked from commit 144af32)
new update has been pushed by @blocktrron for this issue |
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
I was able to confirm that there were no hangs with the modified patch from nrb with the 1/1/10ms had no hungs after 2500+ reboots (within 72+ hours): I assume the recently upstreamed patch is identical to the tested one? |
i don't know any C but to me it looks different to the upstream patch, just look at them in two windows side-by-side |
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
According to datasheet, on AR9344 the switch and switch analog need to be reset first before initiating a full reset. Resetting these systems fixes spurious reset hangs on Atheros AR9344 SoCs. Link: freifunk-gluon/gluon#2904 Signed-off-by: David Bauer <mail@david-bauer.net>
Bug report
What is the problem?
Occasionally (>10% of all devices), hang after an autoupdate and need a manual powercycle to reboot.
I managed to reproduce this while a serial cable was attached:
I am not sure if this is related to #185, but we were not able to reproduce it (yet) with a reboot.
What is the expected behaviour?
That the WDR4300 comes back up after an update.
Gluon Version:
v2022.1.2 and v2022.1.3
Probably also earlier v2022.x
We experienced similar behaviour during the initial v2022.1 deployment, but discarded it as "random".
It was more severe with the v2022.1.3 deployment (probably just because of chance) and I was able to reproduce it with a serial cable attached when upgrading from v2022.1.3 to v2022.1.4.
Site Configuration:
https://github.com/freifunkMUC/site-ffm/blob/833829e68f97e4781f175bdd688d7f498a7efe53/site.conf
Custom patches:
https://github.com/freifunkMUC/site-ffm/tree/833829e68f97e4781f175bdd688d7f498a7efe53/patches
The text was updated successfully, but these errors were encountered: