otg_fs / usbfs potential lock up #59

Codetector1374 · 2024-11-01T08:10:02Z

Writing this more or less as a PSA, the otgfs / usbfs driver seems to be able to enter a lock up state where an interrupt can fire and call the waker but the waker is not woken up.

currently debugging this issue, feel free to help if interested.

ExplodingWaffle · 2024-11-01T10:10:12Z

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered?

i had a quick look at the usbfs code for the first time just now and one thing that sticks out to me is that you’re checking interrupt flags etc in the poll_fns. the reason i did this in the interrupt handler in my stab at usbhs is that the hardware is very prone to letting you miss interrupts if you aren’t careful: when you receive a transfer complete interrupt, you will “lose” the information stored in the registers if another transfer occurs, including rx_len, which happens fairly often. so the poll_fn never sees the interrupt it’s looking for (the hardware sets the endpoint to nak after the ack is sent… i thought? it doesn’t say this in the RM now that i’m reading it. well, if it did, you won’t even get the next transfers interrupt which would have a screwed up datatoggle)

in fact, i suppose it is possible that the interrupt handler is “preempted” by the hardware: say you read the int_st, then another transfer completes, then you read rx_len: that rx_len would be for the wrong transfer. race conditions should be checked for by clearing the interrupt flag early, then making sure it is still cleared at the end, else send an error for the endpoint (we can’t be sure what we read was correct as we might have raced) and loop back around to handle the new transfer (or we could just let the interrupt handler fire again… but i don’t really trust the interrupt latency on risc v for that)

two other things:

ch32-hal/src/otg_fs/endpoint.rs

Line 168 in e2bee21

debug_assert_eq!(len, buf.len());

this assertion is wrong. the host is 100% allowed to send a shorter packet then we were expecting: that’s what this check is for https://github.com/embassy-rs/embassy/blob/6e0b08291b63a0da8eba9284869d1d046bc5dabb/embassy-usb/src/lib.rs#L418
and also:

ch32-hal/src/otg_fs/endpoint.rs

Line 176 in e2bee21

token => {

checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

Codetector1374 · 2024-11-01T15:19:00Z

First of all, thank you so much for review our code! Really could use an extra pair of eyes.

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered

Well, I believe so but will double check again later today. We added some prints, the prints acquire a critical section so they should be "globally ordered". (debugging IN endpoint write, where have a more or less consistent repro)

Code with more prints:
https://github.com/vapor-keeb/ch32-hal/tree/0b97d20327f53677d55baf3dc784002c2d7a730a/src/otg_fs
Resulting log is:
https://gist.github.com/Codetector1374/ddccb83b2816a4a1ba06675c170a0abb

[TRACE]endpoint.rs:86  : endpoint 1 IN
[TRACE]endpoint.rs:129 : suprious wakeup
[TRACE]    mod.rs:60  : irq: 6a
[TRACE]    mod.rs:86  : Walking up: 1
[TRACE]   main.rs:74  : <0,1> changed to High

Above is where the "bug" shows up, by seeing both endpoint 1 IN and suprious wakeup (I know it is not actually spurious, poll_fn is expect to run once, can re-run because of other futures completed in a same join etc.) we know we have for sure registered ourself on the waker, then we can see irq 6a which would be a transfer IRQ, and it did in fact wake up Endpoint 1. However, we never see a suprious wakeup again (or the request actually completing)

This is what led me to believe there is a lost wakeup.

checking interrupt flags etc in the poll_fns

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns.

ch32-hal/src/otg_fs/mod.rs

Line 226 in d7f1b26

w.set_int_busy(true);

So I have actually never seen a case (yet) that the registers are clobbered / lost for otg_fs as long as the int_busy is set. (which the WCH example code also does)

this assertion is wrong.

That is a good catch. I think we didn't actually think that through as much when we wrote that assertion.

checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

I see, I did in fact have a misunderstanding of the usb spec. But I guess this is "technically not broken yet" but good point we probably should fix that in the future.

ExplodingWaffle · 2024-11-02T01:47:04Z

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns.

Ah, I'd glossed over int_busy. That makes sense (as long as there isn't a bug with it or something)

As for your problem: I'm not really sure. maybe the waker is not registered somehow?! unfortunately AtomicWaker doesn't give you a way to check. only other guess is an executor bug but that seems unlikely. might have a proper look tomorrow, if I can repro I have a usb analyzer and might be able to get a debugger working.

Codetector1374 · 2024-11-02T01:48:32Z

I have an analyzer too, I don't believe there is very much interesting going on the bus. If you are interested I'm more than happy to schedule an interactive session / voice call where we can debug this.

Dummyc0m · 2024-11-02T07:34:27Z

Very interesting, turned off integrated-timer in embassy-executor and added generic-queue in embassy-time and the issue is no longer reproducible...

Still unsure what's going on TBH.

Dummyc0m · 2024-11-02T22:46:28Z

In embassy-executor/src/raw/mod.rs

/// Wake a task by `TaskRef`.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.enqueue(task);
        }
    }
}

/// Wake a task by `TaskRef` without calling pend.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task_no_pend(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.run_queue.enqueue(task);
        }
    }
}

It looks like the executor.run_queue.enqueue error is not checked. @Codetector1374 found that we have the flag set in the state but no task on the run_queue.

Edit: just kidding, the run_queue.enqueue doesn't return an error.

Codetector1374 · 2024-11-02T22:50:16Z

Hummm looks like either we have a hardware bug in the atomic instruction or there is some bug in embassy_executor

Basically TaskHeader::state is 3 which indicate we are "enqueued" but the run_queue in the executor looks empty.

(gdb) p *(0x20000060 as *const embassy_executor::raw::TaskHeader)
$2 = embassy_executor::raw::TaskHeader {
  state: embassy_executor::raw::state::State {
    state: core::sync::atomic::AtomicU32 {
      v: core::cell::UnsafeCell<u32> {
        value: 3
      }
    }
  },
  run_queue_item: embassy_executor::raw::run_queue::RunQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  executor: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
    value: core::cell::UnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
      value: core::option::Option<&embassy_executor::raw::SyncExecutor>::Some(0x20007fec)
    }
  },
  poll_fn: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
    value: core::cell::UnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
      value: core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>::Some(0x5152 <embassy_executor::raw::TaskStorage<vapor_keeb::____embassy_main_task::{async_fn_env#0}>::poll<vapor_keeb::____embassy_main_task::{async_fn_env#0}>>)
    }
  },
  expires_at: embassy_executor::raw::util::SyncUnsafeCell<u64> {
    value: core::cell::UnsafeCell<u64> {
      value: 18446744073709551615
    }
  },
  timer_queue_item: embassy_executor::raw::timer_queue::TimerQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  }
}



(gdb) p *(0x20007fec as *const embassy_executor::raw::SyncExecutor)
$4 = embassy_executor::raw::SyncExecutor {
  run_queue: embassy_executor::raw::run_queue::RunQueue {
    head: core::sync::atomic::AtomicPtr<embassy_executor::raw::TaskHeader> {
      p: core::cell::UnsafeCell<*mut embassy_executor::raw::TaskHeader> {
        value: 0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
      }
    }
  },
  pender: embassy_executor::raw::Pender (
    0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
  ),
  timer_queue: embassy_executor::raw::timer_queue::TimerQueue {
    head: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  alarm: embassy_time_driver::AlarmHandle {
    id: 0
  }
}```

Dummyc0m · 2024-11-02T23:45:36Z

gg

Codetector1374 · 2024-11-02T23:45:36Z

Ugh.... sounds like we don't have atomics....

Thanks..... WCH

Dummyc0m · 2024-11-02T23:48:10Z

to elaborate, we switched the state/run_queue impl within embassy-executor from atomic to critical section, and the problem stopped reproducing.

It could be the atomic impl is wrong, but neither of us could come up with a flaw, and given WCH's track record...

ExplodingWaffle · 2024-11-03T00:09:19Z

you know… i was wondering why they removed atomics from some of the new V3 cores. might be why 🙃

(this Exists wrt testing atomics on riscv. never even touched it myself so no idea how hard it would be to run. all that really matters in practice though is “it works to spec” or “nope” so if one actually cared to prove if it doesn’t work it might just be easier to come up with a single counterexample)

This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59

This introduce a breaking change where the atomic is no longer availiable DMA can no longer be using atomic, so critical section is used instead. As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.

This change remvoes all use of atomic from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.

This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59

This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken. There is also a feature you can enable `unsafe-trust-wch-atomics` you can enable to remove this compile error See issue: ch32-rs/ch32-hal#59

ExplodingWaffle pointed out[1] the host is allowed to send a shorter packet than what we were expecting. Remove the assertions and links to embassy. [1] ch32-rs#59 (comment) Reported-by: Harry Brooke <harry.brooke1@hotmail.co.uk>

This change remvoes all use of atomic (CAS) from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.

ExplodingWaffle pointed out[1] the host is allowed to send a shorter packet than what we were expecting. Remove the assertions and links to embassy. [1] #59 (comment) Reported-by: Harry Brooke <harry.brooke1@hotmail.co.uk>

This change remvoes all use of atomic (CAS) from ch32-hal As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.

ExplodingWaffle · 2024-11-04T11:47:16Z

you've probably got this down as a lost cause :) but it still strikes me as odd that they could mess this up and still advertise it as a feature, so i did a bit of reading about atomics on riscv and there is one smoking gun. (one caveat: I haven't seen if your issue is caused by LR/SC or AMOs, but i'm assuming LR/SC because i really can't see what could go wrong with an AMO)

exhibit a, unpriv ISA 14.2. "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions:

A store-conditional instruction to a scratch word of memory should be used to forcibly
invalidate any existing load reservation:
• during a preemptive context switch, and

exhibit b, priv ISA 3.3.2. Trap-Return Instructions:

If the A extension is supported, the xRET instruction is allowed to clear any outstanding LR address
reservation but is not required to. Trap handlers should explicitly clear the reservation if required (e.g.,
by using a dummy SC) before executing the xRET.

exhibit c, https://github.com/torvalds/linux/blob/a99efa00891b66405ebd25e49868efc701fe1546/arch/riscv/kernel/entry.S#L110

Dangling load reservations can be the result of taking a trap in the
middle of an LR/SC sequence, but can also be the result of a taken
forward branch around an SC -- which is how we implement CAS. As a
result we need to clear reservations between the last CAS and the
jump back to the new context.

it's a little sneaky especially considering "preemptive context switch" isn't actually defined anywhere, but maybe ISRs should be invalidating the reservation set to avoid problems in CAS everywhere else?
basically: maybe try throwing in a dummy SC before mret here? https://github.com/ch32-rs/qingke/blob/73a21d1b527327c17bd80b8722d66e10a4fcd46e/qingke-rt/macros/src/lib.rs#L238

This change remvoes all use of atomic (CAS) from ch32-hal As discovered in #59, the QingKeV4 atomic implementation is likely broken. As a result we added a compiler check to make sure the atomic exetnsion is disabled in ch32-rs/qingke#8. This change updates the dependency to use the new qingke as well as remove any reference to `core::atomic` in ch32-hal.

Codetector1374 · 2024-11-04T16:11:23Z

Let me try this today you might be correct. But I think at least reading the Linux code it sounds like it is really only required if we are "saving / restoring" program state on a different "HART". Because on the same hart, with interrupts,

lr <variable A>
 ----- interrupt
lr <variable A>
// do stuff
sc <variable A>
mret
---- resume
sc <variable A> should fail... because LR multiple times on the same address is NOP, and it should've cleared it by running the store conditional).

Codetector1374 · 2024-11-04T16:56:37Z

Ohhhhhhh I had a "shower thought" I think I understand how can this be messed up:

We have the hardware save and restore enabled..... What if ... What if.... the hardware save restore for the interrupt context is save and restoring the address reservation? that would allow the second store conditional to succeed! I probably should also try that.

Codetector1374 · 2024-11-05T05:47:47Z

Actually just tried

disabling the hardware interrupt save restore
adding a sc.w before the mret

neither fixed the problem, unfortunately.

Looks like we will just have to disable atomics for now.

ExplodingWaffle · 2024-11-05T11:48:54Z

worth a try. i should really just find the time to repro this myself 😅.
i think you are correct that a “real” preemptive scheduler would be needed to cause problems (the dangling lr) here, not just interrupts. i don’t even think nested interrupts (which i believe are enabled, and used by time-driver-systick) could cause this one.
thanks for your efforts 👍

Codetector1374 · 2024-11-05T14:22:13Z

Oh I repro this just mashing a key on my “hid example”
Check vapor-keeb/vapor-keeb’s main

and was using tick timer 1

Dummyc0m · 2024-11-05T14:32:53Z

Discovered by accident lol. The repro process is kinda funny.

Codetector1374 added the bug Something isn't working label Nov 1, 2024

Codetector1374 self-assigned this Nov 1, 2024

Dummyc0m mentioned this issue Nov 3, 2024

feat: introduce compile error if compiling with atomic on ch32-rs/qingke#8

Merged

Codetector1374 mentioned this issue Nov 3, 2024

bugfix(broken atomic): atomic removal #61

Merged

Dummyc0m mentioned this issue Nov 3, 2024

Misc fixes for OTG_FS and enable building on stable toolchain #62

Merged

Thom-de-Jong mentioned this issue Nov 14, 2024

Questions regarding the problem with atomics ch32-rs/qingke#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otg_fs / usbfs potential lock up #59

otg_fs / usbfs potential lock up #59

Codetector1374 commented Nov 1, 2024 •

edited

Loading

ExplodingWaffle commented Nov 1, 2024

Codetector1374 commented Nov 1, 2024 •

edited

Loading

ExplodingWaffle commented Nov 2, 2024

Codetector1374 commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024 •

edited

Loading

Codetector1374 commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024

Codetector1374 commented Nov 2, 2024 •

edited

Loading

Dummyc0m commented Nov 2, 2024

ExplodingWaffle commented Nov 3, 2024

ExplodingWaffle commented Nov 4, 2024

Codetector1374 commented Nov 4, 2024 •

edited

Loading

Codetector1374 commented Nov 4, 2024

Codetector1374 commented Nov 5, 2024 •

edited

Loading

ExplodingWaffle commented Nov 5, 2024

Codetector1374 commented Nov 5, 2024 •

edited

Loading

Dummyc0m commented Nov 5, 2024 via email

otg_fs / usbfs potential lock up #59

otg_fs / usbfs potential lock up #59

Comments

Codetector1374 commented Nov 1, 2024 • edited Loading

ExplodingWaffle commented Nov 1, 2024

Codetector1374 commented Nov 1, 2024 • edited Loading

ExplodingWaffle commented Nov 2, 2024

Codetector1374 commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024 • edited Loading

Codetector1374 commented Nov 2, 2024

Dummyc0m commented Nov 2, 2024

Codetector1374 commented Nov 2, 2024 • edited Loading

Dummyc0m commented Nov 2, 2024

ExplodingWaffle commented Nov 3, 2024

ExplodingWaffle commented Nov 4, 2024

Codetector1374 commented Nov 4, 2024 • edited Loading

Codetector1374 commented Nov 4, 2024

Codetector1374 commented Nov 5, 2024 • edited Loading

ExplodingWaffle commented Nov 5, 2024

Codetector1374 commented Nov 5, 2024 • edited Loading

Dummyc0m commented Nov 5, 2024 via email

Codetector1374 commented Nov 1, 2024 •

edited

Loading

Codetector1374 commented Nov 1, 2024 •

edited

Loading

Dummyc0m commented Nov 2, 2024 •

edited

Loading

Codetector1374 commented Nov 2, 2024 •

edited

Loading

Codetector1374 commented Nov 4, 2024 •

edited

Loading

Codetector1374 commented Nov 5, 2024 •

edited

Loading

Codetector1374 commented Nov 5, 2024 •

edited

Loading