-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in i2c_bcm2835 in 4.10 due to NULL pointer dereference #89
Comments
If I boot with cpuidle.off=1 to see if it's cpu idle coming into play I get a very similar crash after about 30 mins of uptime too: [ 1819.171351] Unable to handle kernel NULL pointer dereference at virtual address 00000002 [ 1819.179873] pgd = c0204000 [ 1819.182713] [00000002] *pgd=00000000 [ 1819.186475] Internal error: Oops: 5 [#1] SMP ARM [ 1819.191322] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6tab le_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip 6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt dwc2 udc_core bcm2835_wdt bcm2835_rng bcm2835_dma leds_gpio i2c_dev mmc_block sdhci_iproc sdhci_p ltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 [ 1819.253338] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1819.261928] Hardware name: Generic DT based system [ 1819.266958] task: c0dc7a80 task.stack: c0dbe000 [ 1819.271725] PC is at bcm2835_i2c_isr+0x34/0x168 [i2c_bcm2835] [ 1819.283681] LR is at __handle_irq_event_percpu+0xbc/0x208 [ 1819.295326] pc : [] lr : [] psr: 000f0193 [ 1819.295326] sp : c0dbfe18 ip : 00000015 fp : ffffe000 [ 1819.319345] r10: c0dbfe5c r9 : ef084200 r8 : 00000000 [ 1819.330647] r7 : 00000014 r6 : 00000053 r5 : ef084200 r4 : ee932818 [ 1819.343199] r3 : 00000052 r2 : 00000000 r1 : ee932818 r0 : 00000053 [ 1819.355637] Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none [ 1819.368883] Control: 10c5383d Table: 2d0b806a DAC: 00000051 [ 1819.380618] Process swapper/0 (pid: 0, stack limit = 0xc0dbe220) [ 1819.392596] Stack: (0xc0dbfe18 to 0xc0dc0000) [ 1819.402860] fe00: eeb58900 ef084200 [ 1819.422574] fe20: 00000053 c02b0064 ef084200 c0dbfe5c ef084200 ef084200 ef084200 c0dbb298 [ 1819.442330] fe40: 00000014 ef02d880 c0dbe000 c09178dc c0dbff4c c02b01d0 ef084200 00000000 [ 1819.462737] fe60: ef084200 ef084200 ef084260 c02b0258 ef084200 ef084210 c0dbb298 c02b3bc0 [ 1819.483322] fe80: c0dc4300 00000000 c0dbb298 c02af330 c0dc4300 c055fe7c 00000000 c02af330 [ 1819.503893] fea0: 00000000 c02af914 c027b4f8 600f0013 ffffffff c0dbfef4 ef7010c0 c091c9cc [ 1819.524449] fec0: ef303a00 00000002 c0dbe000 0000ed88 ef303a00 ee93cc40 c0dbd0c0 c0dc7a80 [ 1819.544987] fee0: ef7010c0 2e944000 c09178dc c0dbff4c 00000000 c0dbff10 c09178dc c027b4f8 [ 1819.565526] ff00: 600f0013 ffffffff 00000051 00000000 00000001 c0275790 ffffffff 00000000 [ 1819.586413] ff20: 00000002 00000002 eeb8a018 c0dc7a80 ef303a00 ee93cc40 00000000 ef7010c0 [ 1819.607730] ff40: ef7010c0 c09261e0 c0dbffa4 c09178dc 000001a7 c0dc2ddc 410fd034 00000000 [ 1819.629411] ff60: 00000000 c02d3e74 c0917aac 00000000 8f0ef8f4 000001a7 ef6fdb00 8b14da52 [ 1819.651639] ff80: 000001a7 ffffe000 00000001 c0dc2d40 ffffffff c0d35e30 410fd034 00000000 [ 1819.674348] ffa0: c0dbffb4 c0917aac 000000bb 00000001 c0dbffbc c0917df0 00000000 c0299ff4 [ 1819.697300] ffc0: c0ed4cc0 c0c9ac38 ffffffff ffffffff 00000000 c0c9a690 00000000 c0d35e30 [ 1819.720615] ffe0: c0ed5694 c0dc2d5c c0d35e2c c0dc97cc 0020406a 0020807c 00000000 00000000 [ 1819.744346] [] (bcm2835_i2c_isr [i2c_bcm2835]) from [] (__handle_irq_event_percpu+0xbc/0x208) [ 1819.770266] [] (__handle_irq_event_percpu) from [] (handle_irq_event_percpu+0x20/0x5c) [ 1819.795564] [] (handle_irq_event_percpu) from [] (handle_irq_event+0x4c/0x70) [ 1819.820056] [] (handle_irq_event) from [] (handle_level_irq+0xf0/0x124) [ 1819.844022] [] (handle_level_irq) from [] (generic_handle_irq+0x20/0x30) [ 1819.868100] [] (generic_handle_irq) from [] (bcm2836_chained_handle_irq+0x38/0x40) [ 1819.893115] [] (bcm2836_chained_handle_irq) from [] (generic_handle_irq+0x20/0x30) [ 1819.918149] [] (generic_handle_irq) from [] (__handle_domain_irq+0xa8/0xbc) [ 1819.942571] [] (__handle_domain_irq) from [] (__irq_svc+0x6c/0x90) [ 1819.958545] [] (__irq_svc) from [] (finish_task_switch+0x104/0x21c) [ 1819.974554] [] (finish_task_switch) from [] (__schedule+0x614/0x74c) [ 1819.998167] [] (__schedule) from [] (schedule+0x98/0xbc) [ 1820.013104] [] (schedule) from [] (schedule_preempt_disabled+0x14/0x18) [ 1820.036858] [] (schedule_preempt_disabled) from [] (cpu_startup_entry+0x20/0x24) [ 1820.061448] [] (cpu_startup_entry) from [] (start_kernel+0x364/0x3dc) [ 1820.085057] Code: 1a000011 e3130002 0a00001c e5912290 (e1d220b2) [ 1820.107028] ---[ end trace b95fea60a12456d3 ]--- [ 1820.119235] Kernel panic - not syncing: Fatal exception in interrupt [ 1820.133194] CPU3: stopping [ 1820.143265] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.167332] Hardware name: Generic DT based system [ 1820.179347] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 1820.194425] [] (show_stack) from [] (dump_stack+0x80/0xa0) [ 1820.208896] [] (dump_stack) from [] (handle_IPI+0x158/0x2a4) [ 1820.223416] [] (handle_IPI) from [] (__irq_svc+0x6c/0x90) [ 1820.237567] [] (__irq_svc) from [] (arch_cpu_idle+0x24/0x40) [ 1820.251917] [] (arch_cpu_idle) from [] (do_idle+0xfc/0x1d8) [ 1820.266102] [] (do_idle) from [] (cpu_startup_entry+0x20/0x24) [ 1820.280493] [] (cpu_startup_entry) from [<00209a2c>] (0x209a2c) [ 1820.293997] CPU2: stopping [ 1820.303163] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.325433] Hardware name: Generic DT based system [ 1820.336563] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 1820.350811] [] (show_stack) from [] (dump_stack+0x80/0xa0) [ 1820.364489] [] (dump_stack) from [] (handle_IPI+0x158/0x2a4) [ 1820.378227] [] (handle_IPI) from [] (__irq_svc+0x6c/0x90) [ 1820.391621] [] (__irq_svc) from [] (arch_cpu_idle+0x24/0x40) [ 1820.405268] [] (arch_cpu_idle) from [] (do_idle+0xfc/0x1d8) [ 1820.418749] [] (do_idle) from [] (cpu_startup_entry+0x20/0x24) [ 1820.432444] [] (cpu_startup_entry) from [<00209a2c>] (0x209a2c) [ 1820.445238] CPU1: stopping [ 1820.453696] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 4.10.0-0.rc8.git0.1.fc26.armv7hl #1 [ 1820.474640] Hardware name: Generic DT based system [ 1820.485210] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 1820.498929] [] (show_stack) from [] (dump_stack+0x80/0xa0) [ 1820.512112] [] (dump_stack) from [] (handle_IPI+0x158/0x2a4) [ 1820.525440] [] (handle_IPI) from [] (__irq_svc+0x6c/0x90) [ 1820.538431] [] (__irq_svc) from [] (arch_cpu_idle+0x24/0x40) [ 1820.551731] [] (arch_cpu_idle) from [] (do_idle+0xfc/0x1d8) [ 1820.564903] [] (do_idle) from [] (cpu_startup_entry+0x20/0x24) [ 1820.578345] [] (cpu_startup_entry) from [<00209a2c>] (0x209a2c) [ 1820.591031] ---[ end Kernel panic - not syncing: Fatal exception in interrupt |
Hi Peter, Could you please try this patch? |
Building a test kernel with it now |
So the "i2c-bcm2835 3f805000.i2c: i2c transfer timed out" errors seem to be gone but I get this crash on boot (this is a minimal install so no X/Wayland GUI). The system is still usable. The only difference is the patch. [ 52.462794] ------------[ cut here ]------------ [ 52.471797] ------------[ cut here ]------------ [ 52.471829] WARNING: CPU: 1 PID: 841 at lib/dma-debug.c:612 debug_dma_assert_idle+0x148/0x1e0 [ 52.471838] vc4-drm soc:gpu: DMA-API: cpu touching an active dma mapped cacheline [cln=0x00c421c0] [ 52.471842] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 52.472058] CPU: 1 PID: 841 Comm: systemd-udevd Not tainted 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 52.472062] Hardware name: Generic DT based system [ 52.472092] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 52.472109] [] (show_stack) from [] (dump_stack+0x9c/0xd0) [ 52.472129] [] (dump_stack) from [] (__warn+0xe4/0x104) [ 52.472148] [] (__warn) from [] (warn_slowpath_fmt+0x3c/0x4c) [ 52.472167] [] (warn_slowpath_fmt) from [] (debug_dma_assert_idle+0x148/0x1e0) [ 52.472186] [] (debug_dma_assert_idle) from [] (wp_page_copy+0xfc/0x644) [ 52.472201] [] (wp_page_copy) from [] (do_wp_page+0x3a4/0x5ac) [ 52.472214] [] (do_wp_page) from [] (handle_mm_fault+0xd78/0xe3c) [ 52.472232] [] (handle_mm_fault) from [] (do_page_fault+0x240/0x3b4) [ 52.472250] [] (do_page_fault) from [] (do_DataAbort+0x40/0xc0) [ 52.472265] [] (do_DataAbort) from [] (__dabt_usr+0x40/0x60) [ 52.472271] Exception stack(0xec53ffb0 to 0xec53fff8) [ 52.472282] ffa0: 8142a9d8 81423580 81423660 00000000 [ 52.472296] ffc0: 8142a9d8 bed507f4 bed507f0 b6f4ac88 b6f4ac88 8142c2b8 8142bf50 8142c2b8 [ 52.472307] ffe0: 00000000 bed507e0 7f5ab3fc 7f5e05a4 a00f0010 ffffffff [ 52.472401] ---[ end trace e2930fa58afe35f3 ]--- [ 52.472405] Mapped at: [ 52.472482] vc4_bo_create+0x44/0x108 [vc4] [ 52.472546] vc4_overflow_mem_work+0x20/0x170 [vc4] [ 52.472555] process_one_work+0x324/0x84c [ 52.472563] worker_thread+0x294/0x3f4 [ 52.472572] kthread+0x130/0x14c [ 52.896654] WARNING: CPU: 3 PID: 465 at lib/dma-debug.c:612 debug_dma_assert_idle+0x148/0x1e0 [ 52.917240] vc4-drm soc:gpu: DMA-API: cpu touching an active dma mapped cacheline [cln=0x00c421c0] [ 52.938505] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 53.050163] CPU: 3 PID: 465 Comm: systemd-udevd Tainted: G W 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 53.073586] Hardware name: Generic DT based system [ 53.084817] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 53.099196] [] (show_stack) from [] (dump_stack+0x9c/0xd0) [ 53.113038] [] (dump_stack) from [] (__warn+0xe4/0x104) [ 53.126566] [] (__warn) from [] (warn_slowpath_fmt+0x3c/0x4c) [ 53.140765] [] (warn_slowpath_fmt) from [] (debug_dma_assert_idle+0x148/0x1e0) [ 53.162452] [] (debug_dma_assert_idle) from [] (wp_page_copy+0xfc/0x644) [ 53.184062] [] (wp_page_copy) from [] (do_wp_page+0x3a4/0x5ac) [ 53.198517] [] (do_wp_page) from [] (handle_mm_fault+0xd78/0xe3c) [ 53.213312] [] (handle_mm_fault) from [] (do_page_fault+0x240/0x3b4) [ 53.235632] [] (do_page_fault) from [] (do_DataAbort+0x40/0xc0) [ 53.250651] [] (do_DataAbort) from [] (__dabt_usr+0x40/0x60) [ 53.265603] Exception stack(0xc2291fb0 to 0xc2291ff8) [ 53.277933] 1fa0: 8142a518 00000009 00000024 00000000 [ 53.300414] 1fc0: b6df17a4 00000020 00020338 7f5abb04 00000000 8142a518 813d5cf4 8142c2b8 [ 53.322682] 1fe0: bed507bc bed507f0 00000003 b6d0f184 20070010 ffffffff [ 53.336720] ---[ end trace e2930fa58afe35f4 ]--- [ 53.348371] Mapped at: [ 53.357587] vc4_bo_create+0x44/0x108 [vc4] [ 53.368574] vc4_overflow_mem_work+0x20/0x170 [vc4] [ 53.380095] process_one_work+0x324/0x84c [ 53.390631] worker_thread+0x294/0x3f4 [ 53.400620] kthread+0x130/0x14c |
And after a while of running I've got this one (which I expected given the warning in the patch) but posting it for reference: [ 2168.372043] ------------[ cut here ]------------ [ 2168.382726] WARNING: CPU: 0 PID: 317 at drivers/i2c/busses/i2c-bcm2835.c:199 bcm2835_i2c_isr+0x80/0x190 [i2c_bcm2835] [ 2168.405576] Got interrupt, but curr_msg is NULL [ 2168.416141] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32_arm_ce iptable_security ebtable_filter ebtables ip6table_filter ip6_tables smsc95xx usbnet mii vc4 drm_kms_helper drm fb_sys_fops syscopyarea dwc2 sysfillrect sysimgblt udc_core leds_gpio bcm2835_dma bcm2835_rng bcm2835_wdt mmc_block sdhci_iproc sdhci_pltfm sdhci mmc_core pwm_bcm2835 i2c_bcm2835 i2c_dev [ 2168.526466] CPU: 0 PID: 317 Comm: kworker/0:2 Tainted: G W 4.10.0-0.rc8.git0.1.rpi1.fc25.armv7hl #1 [ 2168.549331] Hardware name: Generic DT based system [ 2168.560469] Workqueue: events output_poll_execute [drm_kms_helper] [ 2168.572952] [] (unwind_backtrace) from [] (show_stack+0x18/0x1c) [ 2168.587086] [] (show_stack) from [] (dump_stack+0x9c/0xd0) [ 2168.600643] [] (dump_stack) from [] (__warn+0xe4/0x104) [ 2168.613905] [] (__warn) from [] (warn_slowpath_fmt+0x3c/0x4c) [ 2168.627668] [] (warn_slowpath_fmt) from [] (bcm2835_i2c_isr+0x80/0x190 [i2c_bcm2835]) [ 2168.649538] [] (bcm2835_i2c_isr [i2c_bcm2835]) from [] (__handle_irq_event_percpu+0x118/0x4a8) [ 2168.672485] [] (__handle_irq_event_percpu) from [] (handle_irq_event_percpu+0x20/0x5c) [ 2168.695293] [] (handle_irq_event_percpu) from [] (handle_irq_event+0x40/0x64) [ 2168.717857] [] (handle_irq_event) from [] (handle_level_irq+0xe4/0x118) [ 2168.740506] [] (handle_level_irq) from [] (generic_handle_irq+0x20/0x30) [ 2168.763426] [] (generic_handle_irq) from [] (bcm2836_chained_handle_irq+0x3c/0x44) [ 2168.787525] [] (bcm2836_chained_handle_irq) from [] (generic_handle_irq+0x20/0x30) [ 2168.811673] [] (generic_handle_irq) from [] (__handle_domain_irq+0xa8/0xbc) [ 2168.835207] [] (__handle_domain_irq) from [] (__irq_svc+0x70/0x98) [ 2168.850739] [] (__irq_svc) from [] (_raw_spin_unlock_irqrestore+0x60/0x64) [ 2168.874053] [] (_raw_spin_unlock_irqrestore) from [] (__debug_object_init+0x2ac/0x374) [ 2168.898587] [] (__debug_object_init) from [] (init_timer_on_stack_key+0x24/0x5c) [ 2168.922919] [] (init_timer_on_stack_key) from [] (schedule_timeout+0x84/0x4e0) [ 2168.947281] [] (schedule_timeout) from [] (wait_for_common+0x10c/0x1a4) [ 2168.971395] [] (wait_for_common) from [] (bcm2835_i2c_xfer+0x104/0x190 [i2c_bcm2835]) [ 2168.996971] [] (bcm2835_i2c_xfer [i2c_bcm2835]) from [] (__i2c_transfer+0x3dc/0x9cc) [ 2169.022448] [] (__i2c_transfer) from [] (i2c_transfer+0x90/0xd8) [ 2169.038630] [] (i2c_transfer) from [] (drm_do_probe_ddc_edid+0xa8/0xf8 [drm]) [ 2169.063870] [] (drm_do_probe_ddc_edid [drm]) from [] (drm_probe_ddc+0x1c/0x2c [drm]) [ 2169.089550] [] (drm_probe_ddc [drm]) from [] (vc4_hdmi_connector_detect+0x4c/0x74 [vc4]) [ 2169.115441] [] (vc4_hdmi_connector_detect [vc4]) from [] (output_poll_execute+0xfc/0x1a4 [drm_kms_helper]) [ 2169.142904] [] (output_poll_execute [drm_kms_helper]) from [] (process_one_work+0x324/0x84c) [ 2169.168964] [] (process_one_work) from [] (worker_thread+0x294/0x3f4) [ 2169.192909] [] (worker_thread) from [] (kthread+0x130/0x14c) [ 2169.208354] [] (kthread) from [] (ret_from_fork+0x14/0x24) [ 2169.223530] ---[ end trace e2930fa58afe35f5 ]--- |
This patch doesn't fix the timeout issue. I've heard they aren't reliable to reproduce. Could you please attach your kernel config? |
@lategoodbye define a "stock kernel". This is the Fedora kernel which is pretty close to upstream vanilla 4.10 linus kernel. We carry a couple of patches for the RPi but nothing around i2c and only a single patch for vc4 (to fix a OOPs which anolt has already sent for upstream). The kernel config is currently https://pbrobinson.fedorapeople.org/config-4.10.0-0.rc8.git0.1.fc26.armv7hl This config works fine with <= 4.9 and has only appeared in 4.10. I did pull back some of the i2c patches to 4.9 to try and fix the timeout issue but we saw the same lock ups on the RPi3 so I dropped them as it was the quickest way to move it back to stability. In terms of reproducible, it occurs fairly regularly on a Fedora 25 minimal (or GUI based) install with the 4.10 kernel. In the case of a minimal install it's basically just sitting there doing nothing. With that it'll basically crash within 2 hours, generally a lot less and often within a few minutes. With the i2c patches I back ported to 4.9 (https://pbrobinson.fedorapeople.org/bcm283x-fixes-i2c.patch) we had dozens of reports of issues so I would say at least for Fedora it's very reproducible. |
What is the procedure under which this happens - i.e. what is the driver that is producing this i2c message? What strikes me as strange is that curr_msg is never set to NULL in the code so besides when initializing the I2C driver Does the 4.8 version of the driver work on 4.10 without any issues when you replace the 4.10 version? |
Well the difference between 4.9 and 4.10 are the following patches. If you revert them I fully expect it to operate fine. When I applied these patches to 4.9 we did see the issue on 4.9: 9446f62 i2c: bcm2835: Add support for dynamic clock |
Looking at the traces some more - it seems to be related to hdmi scanning: So I wonder if HDMI i2c gets "probed" and later "removed" in this sequence somehow? Maybe we need to clean BCM2835_I2C_S and Fifos during initialization. |
I believe that i2c is used on the Pi3 for EDID detection via an expander and that might explain the issue happening on the RPi3 and not the RPi2. From the commit message of 9d44abb "On Pi0/1/2, we use an external GPIO line for hotplug detection, since So it might be using i2c0 but @anholt might be able to comment further with more details |
A long time ago Gerd was submitting the pi3 patch series with the fxl6408 gpio expander driver present, which would fight over i2c0 (iirc) with the firmware. I think that patch isn't in your tree, though, given that you've got that series of 10s apart i2c timeout log messages (which we still need to delete) that look like our fallback pi3 monitor detection. So, a curr_msg = NULL means that we haven't even started a transfer on the i2c that's catching the interrupt, since we set it early in bcm2835_i2c_xfer, and don't ever NULL it back out. Here's my guess for what's going on: The firmware is doing activity on i2c0 talking to fxl6408 (undervoltage detection). Generally, the firmware's irq chain is much shorter than Linux's, so it tends to win the race to process the interrupt for i2c0. Occasionally, though, something else is going on and the firmware loses, at which point Linux gets into its irq handler and looks for the current message on i2c0. @popcornmix has told me that the firmware will hit i2c0 on all boards at various times: "e.g. following under-voltage, over-temperature, a turbo request from arm, or a hard GPU job (e.g. MVC video decode triggers turbo mode)". Some of those don't apply to upstream Linux today (VCHI not enabled means no video decode, and we don't have turbo requests because the cpufreq driver never landed), but I think this also forgot instances like "the 7" display is attached and the firmware is hitting i2c0 for touchscreen". I think we need to keep i2c0 status="disabled" in the bcm2835-rpi.dtsi, so that we don't ever try to play on their bus. It would be nice if we could get some confirmation that i2c0 is really the bus that's crashing -- do you think you could get some printk in here for figuring out which one it is? |
If you've got a patch to add the printk to the places you want I'm happy to do a build and provide any information it outputs. |
I think this should do it. Without the curr_msg check, it gave me a bunch of nicely formatted DONE messages. |
@anholt with it running with that patch for 90 mins I've had the following two debugs. I'll leave it running over night (or longer if it's useful). Do they provide confirmation as to whether it's i2c0 or not? [ 4752.959711] i2c-bcm2835 3f205000.i2c: Processing DONE interrupt with no msg |
3f205000.i2c is i2c0. Maybe we should disable the Irq while i2c is not used to avoid such race issues? I guess the only safe way is disabling i2c0 on the rpi3 in the dt. |
I think we should fix the pin configuration for i2c and pwm on RPi 3 before we play with those irqs. |
Is there a patch proposed for the i2c and pwm issues on the RPi3? |
I'll prepare a patch. Please give me some time. |
So much for me thinking that I had hardened the isr against problems... Well well, yes, disabling i2c0 in mainline is a first step. /* Some comment about the firmware and i2c0 */
if (WARN_ON_ONCE(!i2c_dev->curr_msg))
return IRQ_NONE; The downstream 2708 driver handles this situation: https://github.com/raspberrypi/linux/blob/rpi-4.9.y/drivers/i2c/busses/i2c-bcm2708.c#L192 We should probably set I think we need a comment somehwere that explains exactly in which situations the firmware uses i2c0. And I guess this means that pre 4.10 the driver did occasionally acknowledge the i2c0 interrupts which the firmware was waiting for, silently timing out it's transfers. Thanks for fixing it up @lategoodbye. |
@notro |
I prefer Eric's too, the log warning is much less scary for the average user (which means less people filing bug reports). I've pushed Eric's patch into the Fedora 4.10 kernel for a work around to the hard crash while we work out the best course of action in terms of DT, irq or what ever the final outcome is. |
I think we should bail out at once if we get an unexpected interrupt, since that means the firmware is using the device. And if I'm not mistaken, the interrupt will fire again since we don't acknowledge it and the firmware will probably catch it first the next time. dev_warn is ok by me, maybe dev_warn_ratelimited if we don't ack the interrupt. |
We shouldn't try to harden our isr against spurious IRQs, we should just disable the controller in the DT. We don't get to drive it, and if we ever do anything with it from Linux, only suffering will come of it. |
Okay, i send a first patch series to fix the NULL pointer dereference and the invalid pin configuration (without disabling i2c0). If the i2c timeouts still occurs i would add an additional patch to disable i2c0. Since the issue would occur with 4.9 DT blobs and a 4.10 kernel, we also need to fix i2c-bcm2835. |
When aacraid init fails with "AAC0: adapter self-test failed.", shutdown leads to UBSAN warning and then oops: [154316.118423] ================================================================================ [154316.118508] UBSAN: Undefined behaviour in drivers/scsi/scsi_lib.c:2328:27 [154316.118566] member access within null pointer of type 'struct Scsi_Host' [154316.118631] CPU: 2 PID: 14530 Comm: reboot Tainted: G W 4.15.0-dirty anholt#89 [154316.118701] Hardware name: Hewlett Packard HP NetServer/HP System Board, BIOS 4.06.46 PW 06/25/2003 [154316.118774] Call Trace: [154316.118848] dump_stack+0x48/0x65 [154316.118916] ubsan_epilogue+0xe/0x40 [154316.118976] __ubsan_handle_type_mismatch+0xfb/0x180 [154316.119043] scsi_block_requests+0x20/0x30 [154316.119135] aac_shutdown+0x18/0x40 [aacraid] [154316.119196] pci_device_shutdown+0x33/0x50 [154316.119269] device_shutdown+0x18a/0x390 [...] [154316.123435] BUG: unable to handle kernel NULL pointer dereference at 000000f4 [154316.123515] IP: scsi_block_requests+0xa/0x30 This is because aac_shutdown() does struct Scsi_Host *shost = pci_get_drvdata(dev); scsi_block_requests(shost); and that assumes shost has been assigned with pci_set_drvdata(). However, pci_set_drvdata(pdev, shost) is done in aac_probe_one() far after bailing out with error from calling the init function ((*aac_drivers[index].init)(aac)), and when the init function fails, no error is returned from aac_probe_one() so PCI layer assumes there is driver attached, and tries to shut it down later. Fix it by returning error from aac_probe_one() when card-specific init function fails. This fixes reboot on my HP NetRAID-4M with dead battery. Signed-off-by: Meelis Roos <mroos@linux.ee> Reviewed-by: Dave Carroll <david.carroll@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
syzkaller was able to trigger the following panic for AF_XDP: BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 Write of size 8 at addr 0000000000000060 by task syz-executor246/4527 CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 xdp_umem_reg net/xdp/xdp_umem.c:334 [inline] xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349 xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531 __sys_setsockopt+0x1bd/0x390 net/socket.c:1935 __do_sys_setsockopt net/socket.c:1946 [inline] __se_sys_setsockopt net/socket.c:1943 [inline] __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe In xdp_umem_reg() the call to xdp_umem_account_pages() passed with CAP_IPC_LOCK where we didn't need to end up charging rlimit on memlock for the current user and therefore umem->user continues to be NULL. Later on through fault injection syzkaller triggered a failure in either umem->pgs or umem->pages allocation such that we bail out and undo accounting in xdp_umem_unaccount_pages() where we eventually hit the panic since it tries to deref the umem->user. The code is pretty close to mm_account_pinned_pages() and mm_unaccount_pinned_pages() pair and potentially could reuse it even in a later cleanup, and it appears that the initial commit c0c77d8 ("xsk: add user memory registration support sockopt") got this right while later follow-up introduced the bug via a49049e ("xsk: simplified umem setup"). Fixes: a49049e ("xsk: simplified umem setup") Reported-by: syzbot+979217770b09ebf5c407@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
…from a pageblock syzbot reported the following error from a tree with a head commit of baf76f0 ("slip: make slhc_free() silently accept an error pointer") BUG: unable to handle kernel paging request at ffffea0003348000 #PF error: [normal kernel read fault] PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline] RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline] RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579 Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49 c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246 RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000 RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001 RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000 R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000 FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0 Call Trace: fast_isolate_around mm/compaction.c:1243 [inline] fast_isolate_freepages mm/compaction.c:1418 [inline] isolate_freepages mm/compaction.c:1438 [inline] compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550 There is no reproducer and it is difficult to hit -- 1 crash every few days. The issue is very similar to the fix in commit 6b0868c ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints"). When isolating free pages around a target pageblock, the boundary handling is off by one and can stray into the next pageblock. Triggering the syzbot error requires that the end of pageblock is section or zone aligned, and that the next section is unpopulated. A more subtle consequence of the bug is that pageblocks were being improperly used as migration targets which potentially hurts fragmentation avoidance in the long-term one page at a time. A debugging patch revealed that it's definitely possible to stray outside of a pageblock which is not intended. While syzbot cannot be used to verify this patch, it was confirmed that the debugging warning no longer triggers with this patch applied. It has also been confirmed that the THP allocation stress tests are not degraded by this patch. Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net Fixes: e332f74 ("mm, compaction: be selective about what pageblocks to clear skip hints") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Qian Cai <cai@lca.pw> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After some time of uptime on the Raspberry Pi 3 we're seeing a crash with 4.10 in the i2c_bcm2835 driver. The RPI2 is stable. It happens even on a minimal install with no GUI running.
The crash I captured on the serial console is:
The text was updated successfully, but these errors were encountered: