Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otg_fs / usbfs potential lock up #59

Open
Codetector1374 opened this issue Nov 1, 2024 · 18 comments
Open

otg_fs / usbfs potential lock up #59

Codetector1374 opened this issue Nov 1, 2024 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@Codetector1374
Copy link
Member

Codetector1374 commented Nov 1, 2024

Writing this more or less as a PSA, the otgfs / usbfs driver seems to be able to enter a lock up state where an interrupt can fire and call the waker but the waker is not woken up.

currently debugging this issue, feel free to help if interested.

@Codetector1374 Codetector1374 added the bug Something isn't working label Nov 1, 2024
@Codetector1374 Codetector1374 self-assigned this Nov 1, 2024
@ExplodingWaffle
Copy link
Contributor

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered?

i had a quick look at the usbfs code for the first time just now and one thing that sticks out to me is that you’re checking interrupt flags etc in the poll_fns. the reason i did this in the interrupt handler in my stab at usbhs is that the hardware is very prone to letting you miss interrupts if you aren’t careful: when you receive a transfer complete interrupt, you will “lose” the information stored in the registers if another transfer occurs, including rx_len, which happens fairly often. so the poll_fn never sees the interrupt it’s looking for (the hardware sets the endpoint to nak after the ack is sent… i thought? it doesn’t say this in the RM now that i’m reading it. well, if it did, you won’t even get the next transfers interrupt which would have a screwed up datatoggle)

in fact, i suppose it is possible that the interrupt handler is “preempted” by the hardware: say you read the int_st, then another transfer completes, then you read rx_len: that rx_len would be for the wrong transfer. race conditions should be checked for by clearing the interrupt flag early, then making sure it is still cleared at the end, else send an error for the endpoint (we can’t be sure what we read was correct as we might have raced) and loop back around to handle the new transfer (or we could just let the interrupt handler fire again… but i don’t really trust the interrupt latency on risc v for that)

two other things:

debug_assert_eq!(len, buf.len());
this assertion is wrong. the host is 100% allowed to send a shorter packet then we were expecting: that’s what this check is for https://github.com/embassy-rs/embassy/blob/6e0b08291b63a0da8eba9284869d1d046bc5dabb/embassy-usb/src/lib.rs#L418
and also:
token => {

checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

@Codetector1374
Copy link
Member Author

Codetector1374 commented Nov 1, 2024

First of all, thank you so much for review our code! Really could use an extra pair of eyes.

just to check, are you sure the waker is not being woken, and not that it is being woken but returns pending again even though the event you were waiting for should have triggered

Well, I believe so but will double check again later today. We added some prints, the prints acquire a critical section so they should be "globally ordered". (debugging IN endpoint write, where have a more or less consistent repro)

Code with more prints:
https://github.com/vapor-keeb/ch32-hal/tree/0b97d20327f53677d55baf3dc784002c2d7a730a/src/otg_fs
Resulting log is:
https://gist.github.com/Codetector1374/ddccb83b2816a4a1ba06675c170a0abb

[TRACE]endpoint.rs:86  : endpoint 1 IN
[TRACE]endpoint.rs:129 : suprious wakeup
[TRACE]    mod.rs:60  : irq: 6a
[TRACE]    mod.rs:86  : Walking up: 1
[TRACE]   main.rs:74  : <0,1> changed to High

Above is where the "bug" shows up, by seeing both endpoint 1 IN and suprious wakeup (I know it is not actually spurious, poll_fn is expect to run once, can re-run because of other futures completed in a same join etc.) we know we have for sure registered ourself on the waker, then we can see irq 6a which would be a transfer IRQ, and it did in fact wake up Endpoint 1. However, we never see a suprious wakeup again (or the request actually completing)

This is what led me to believe there is a lost wakeup.

checking interrupt flags etc in the poll_fns

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns.

w.set_int_busy(true);

So I have actually never seen a case (yet) that the registers are clobbered / lost for otg_fs as long as the int_busy is set. (which the WCH example code also does)

this assertion is wrong.

That is a good catch. I think we didn't actually think that through as much when we wrote that assertion.

checks like this one are not needed since it should totally be possible to have an IN and OUT endpoint on the same number (except for control endpoints)

I see, I did in fact have a misunderstanding of the usb spec. But I guess this is "technically not broken yet" but good point we probably should fix that in the future.
Screenshot 2024-11-01 at 8 18 05 AM
Screenshot 2024-11-01 at 8 17 56 AM

@ExplodingWaffle
Copy link
Contributor

Our understanding has been that with int_busy below the hw will automatically NAK all further host request without disturbing any register until the int_fg.transfer is cleared. Which we clear in our poll_fns.

Ah, I'd glossed over int_busy. That makes sense (as long as there isn't a bug with it or something)

As for your problem: I'm not really sure. maybe the waker is not registered somehow?! unfortunately AtomicWaker doesn't give you a way to check. only other guess is an executor bug but that seems unlikely. might have a proper look tomorrow, if I can repro I have a usb analyzer and might be able to get a debugger working.

@Codetector1374
Copy link
Member Author

I have an analyzer too, I don't believe there is very much interesting going on the bus. If you are interested I'm more than happy to schedule an interactive session / voice call where we can debug this.

@Dummyc0m
Copy link
Member

Dummyc0m commented Nov 2, 2024

Very interesting, turned off integrated-timer in embassy-executor and added generic-queue in embassy-time and the issue is no longer reproducible...

Still unsure what's going on TBH.

@Dummyc0m
Copy link
Member

Dummyc0m commented Nov 2, 2024

In embassy-executor/src/raw/mod.rs

/// Wake a task by `TaskRef`.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.enqueue(task);
        }
    }
}

/// Wake a task by `TaskRef` without calling pend.
///
/// You can obtain a `TaskRef` from a `Waker` using [`task_from_waker`].
pub fn wake_task_no_pend(task: TaskRef) {
    let header = task.header();
    if header.state.run_enqueue() {
        // We have just marked the task as scheduled, so enqueue it.
        unsafe {
            let executor = header.executor.get().unwrap_unchecked();
            executor.run_queue.enqueue(task);
        }
    }
}

It looks like the executor.run_queue.enqueue error is not checked. @Codetector1374 found that we have the flag set in the state but no task on the run_queue.

Edit: just kidding, the run_queue.enqueue doesn't return an error.

@Codetector1374
Copy link
Member Author

Hummm looks like either we have a hardware bug in the atomic instruction or there is some bug in embassy_executor

Basically TaskHeader::state is 3 which indicate we are "enqueued" but the run_queue in the executor looks empty.

(gdb) p *(0x20000060 as *const embassy_executor::raw::TaskHeader)
$2 = embassy_executor::raw::TaskHeader {
  state: embassy_executor::raw::state::State {
    state: core::sync::atomic::AtomicU32 {
      v: core::cell::UnsafeCell<u32> {
        value: 3
      }
    }
  },
  run_queue_item: embassy_executor::raw::run_queue::RunQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  executor: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
    value: core::cell::UnsafeCell<core::option::Option<&embassy_executor::raw::SyncExecutor>> {
      value: core::option::Option<&embassy_executor::raw::SyncExecutor>::Some(0x20007fec)
    }
  },
  poll_fn: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
    value: core::cell::UnsafeCell<core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>> {
      value: core::option::Option<unsafe fn(embassy_executor::raw::TaskRef)>::Some(0x5152 <embassy_executor::raw::TaskStorage<vapor_keeb::____embassy_main_task::{async_fn_env#0}>::poll<vapor_keeb::____embassy_main_task::{async_fn_env#0}>>)
    }
  },
  expires_at: embassy_executor::raw::util::SyncUnsafeCell<u64> {
    value: core::cell::UnsafeCell<u64> {
      value: 18446744073709551615
    }
  },
  timer_queue_item: embassy_executor::raw::timer_queue::TimerQueueItem {
    next: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  }
}



(gdb) p *(0x20007fec as *const embassy_executor::raw::SyncExecutor)
$4 = embassy_executor::raw::SyncExecutor {
  run_queue: embassy_executor::raw::run_queue::RunQueue {
    head: core::sync::atomic::AtomicPtr<embassy_executor::raw::TaskHeader> {
      p: core::cell::UnsafeCell<*mut embassy_executor::raw::TaskHeader> {
        value: 0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
      }
    }
  },
  pender: embassy_executor::raw::Pender (
    0x0 <defmt::export::fmt<embassy_usb::class::hid::ReportId>>
  ),
  timer_queue: embassy_executor::raw::timer_queue::TimerQueue {
    head: embassy_executor::raw::util::SyncUnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
      value: core::cell::UnsafeCell<core::option::Option<embassy_executor::raw::TaskRef>> {
        value: core::option::Option<embassy_executor::raw::TaskRef>::None
      }
    }
  },
  alarm: embassy_time_driver::AlarmHandle {
    id: 0
  }
}```

@Dummyc0m
Copy link
Member

Dummyc0m commented Nov 2, 2024

gg

@Codetector1374
Copy link
Member Author

Codetector1374 commented Nov 2, 2024

Ugh.... sounds like we don't have atomics....

Thanks..... WCH

@Dummyc0m
Copy link
Member

Dummyc0m commented Nov 2, 2024

to elaborate, we switched the state/run_queue impl within embassy-executor from atomic to critical section, and the problem stopped reproducing.

It could be the atomic impl is wrong, but neither of us could come up with a flaw, and given WCH's track record...

@ExplodingWaffle
Copy link
Contributor

you know… i was wondering why they removed atomics from some of the new V3 cores. might be why 🙃

(this Exists wrt testing atomics on riscv. never even touched it myself so no idea how hard it would be to run. all that really matters in practice though is “it works to spec” or “nope” so if one actually cared to prove if it doesn’t work it might just be easier to come up with a single counterexample)

Codetector1374 added a commit to vapor-keeb/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken.

There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error

See issue: ch32-rs/ch32-hal#59
Codetector1374 added a commit to vapor-keeb/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target with atomics. It looks like at least on the the QingKe V4 atomics are suspiciously broken.

There is also a feature you can enable `i-understand-that-atomics-are-likely-broken` you can enable to remove this compile error

See issue: ch32-rs/ch32-hal#59
Codetector1374 added a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
This introduce a breaking change where the atomic is no longer availiable
DMA can no longer be using atomic, so critical section is used instead.

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
Codetector1374 added a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
This change remvoes all use of atomic from ch32-hal

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
Codetector1374 added a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
This change remvoes all use of atomic from ch32-hal

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
Codetector1374 added a commit to vapor-keeb/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target
with atomics. It looks like at least on the the QingKe V4 atomics are
suspiciously broken.

There is also a feature you can enable
`i-understand-that-atomics-are-likely-broken` you can enable to remove
this compile error

See issue: ch32-rs/ch32-hal#59
Codetector1374 added a commit to vapor-keeb/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target
with atomics. It looks like at least on the the QingKe V4 atomics are
suspiciously broken.

There is also a feature you can enable
`i-understand-that-atomics-are-likely-broken` you can enable to remove
this compile error

See issue: ch32-rs/ch32-hal#59
Codetector1374 added a commit to vapor-keeb/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target
with atomics. It looks like at least on the the QingKe V4 atomics are
suspiciously broken.

There is also a feature you can enable
`unsafe-trust-wch-atomics` you can enable to remove
this compile error

See issue: ch32-rs/ch32-hal#59
andelf pushed a commit to ch32-rs/qingke that referenced this issue Nov 3, 2024
This change introduce a compile error if the toolchain is using a target
with atomics. It looks like at least on the the QingKe V4 atomics are
suspiciously broken.

There is also a feature you can enable
`unsafe-trust-wch-atomics` you can enable to remove
this compile error

See issue: ch32-rs/ch32-hal#59
Dummyc0m pushed a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
ExplodingWaffle pointed out[1] the host is allowed to send a shorter
packet than what we were expecting. Remove the assertions and links to
embassy.

[1] ch32-rs#59 (comment)

Reported-by: Harry Brooke <harry.brooke1@hotmail.co.uk>
Codetector1374 added a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
This change remvoes all use of atomic (CAS) from ch32-hal

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
Codetector1374 added a commit to vapor-keeb/ch32-hal that referenced this issue Nov 3, 2024
This change remvoes all use of atomic (CAS) from ch32-hal

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
Codetector1374 added a commit that referenced this issue Nov 3, 2024
ExplodingWaffle pointed out[1] the host is allowed to send a shorter
packet than what we were expecting. Remove the assertions and links to
embassy.

[1] #59 (comment)

Reported-by: Harry Brooke <harry.brooke1@hotmail.co.uk>
Dummyc0m pushed a commit to vapor-keeb/ch32-hal that referenced this issue Nov 4, 2024
This change remvoes all use of atomic (CAS) from ch32-hal

As discovered in ch32-rs#59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
@ExplodingWaffle
Copy link
Contributor

you've probably got this down as a lost cause :) but it still strikes me as odd that they could mess this up and still advertise it as a feature, so i did a bit of reading about atomics on riscv and there is one smoking gun. (one caveat: I haven't seen if your issue is caused by LR/SC or AMOs, but i'm assuming LR/SC because i really can't see what could go wrong with an AMO)

exhibit a, unpriv ISA 14.2. "Zalrsc" Extension for Load-Reserved/Store-Conditional Instructions:

A store-conditional instruction to a scratch word of memory should be used to forcibly
invalidate any existing load reservation:
• during a preemptive context switch, and

exhibit b, priv ISA 3.3.2. Trap-Return Instructions:

If the A extension is supported, the xRET instruction is allowed to clear any outstanding LR address
reservation but is not required to. Trap handlers should explicitly clear the reservation if required (e.g.,
by using a dummy SC) before executing the xRET.

exhibit c, https://github.com/torvalds/linux/blob/a99efa00891b66405ebd25e49868efc701fe1546/arch/riscv/kernel/entry.S#L110

Dangling load reservations can be the result of taking a trap in the
middle of an LR/SC sequence, but can also be the result of a taken
forward branch around an SC -- which is how we implement CAS. As a
result we need to clear reservations between the last CAS and the
jump back to the new context.

it's a little sneaky especially considering "preemptive context switch" isn't actually defined anywhere, but maybe ISRs should be invalidating the reservation set to avoid problems in CAS everywhere else?
basically: maybe try throwing in a dummy SC before mret here? https://github.com/ch32-rs/qingke/blob/73a21d1b527327c17bd80b8722d66e10a4fcd46e/qingke-rt/macros/src/lib.rs#L238

andelf pushed a commit that referenced this issue Nov 4, 2024
This change remvoes all use of atomic (CAS) from ch32-hal

As discovered in #59, the QingKeV4 atomic implementation is likely
broken. As a result we added a compiler check to make sure the atomic
exetnsion is disabled in ch32-rs/qingke#8. This change updates the
dependency to use the new qingke as well as remove any reference to
`core::atomic` in ch32-hal.
@Codetector1374
Copy link
Member Author

Codetector1374 commented Nov 4, 2024

Let me try this today you might be correct. But I think at least reading the Linux code it sounds like it is really only required if we are "saving / restoring" program state on a different "HART". Because on the same hart, with interrupts,

lr <variable A>
 ----- interrupt
lr <variable A>
// do stuff
sc <variable A>
mret
---- resume
sc <variable A> should fail... because LR multiple times on the same address is NOP, and it should've cleared it by running the store conditional).

@Codetector1374
Copy link
Member Author

Ohhhhhhh I had a "shower thought" I think I understand how can this be messed up:

We have the hardware save and restore enabled..... What if ... What if.... the hardware save restore for the interrupt context is save and restoring the address reservation? that would allow the second store conditional to succeed! I probably should also try that.

@Codetector1374
Copy link
Member Author

Codetector1374 commented Nov 5, 2024

Actually just tried

  • disabling the hardware interrupt save restore
  • adding a sc.w before the mret

neither fixed the problem, unfortunately.

Looks like we will just have to disable atomics for now.

@ExplodingWaffle
Copy link
Contributor

worth a try. i should really just find the time to repro this myself 😅.
i think you are correct that a “real” preemptive scheduler would be needed to cause problems (the dangling lr) here, not just interrupts. i don’t even think nested interrupts (which i believe are enabled, and used by time-driver-systick) could cause this one.
thanks for your efforts 👍

@Codetector1374
Copy link
Member Author

Codetector1374 commented Nov 5, 2024

Oh I repro this just mashing a key on my “hid example”
Check vapor-keeb/vapor-keeb’s main

and was using tick timer 1

@Dummyc0m
Copy link
Member

Dummyc0m commented Nov 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants