Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adafruit qtpy rp2040 target starts CDC/ACM USB only once - suspect flash timing #401

Closed
wants to merge 1 commit into from

Conversation

wa1tnr
Copy link

@wa1tnr wa1tnr commented May 12, 2021

Changed:

PICO_FLASH_SPI_CLKDIV from 2 to 4
(to slow the clock, by halving it)

Reference system Linux host PC will not enumerate /dev/ttyACM0
(or related device names) except upon UF2 firmware upload.

UF2 upload is fine and program will run once, until power is removed.
Then, never again.

This patch is meant to allow the (end-user authored) firmware/program
to run repeatedly, including cycling power to the target board.

First (SPI clock divisor) value tried was '4' (from '2').

No other experiments done, in search of the optimal clock divisor.

Not at all sure what the clock divisor 'does'.

Simple guess:

Going from 2, to 4 would 'halve' the clock frequency (divide it
by 2).

Halving the frequency would cause the subsystem to evolve
all events more slowly, which apparently helps in enumeration
of the CDC/ACM device to the host PC.

@lurch
Copy link
Contributor

lurch commented May 12, 2021

ping @tannewt and @dhalbert and @ladyada for comment?

@ladyada
Copy link

ladyada commented May 12, 2021

@lurch the QT Py 2040 uses the winbond 25Q64JVXGIQ chip recommended by the pi foundation

@wa1tnr
Copy link
Author

wa1tnr commented May 12, 2021

ItsyBitsy RP2040 enumerates on my Linux host PC with a clock divisor of 2.

@dhalbert
Copy link
Contributor

We have one other user who has trouble getting a QT Py RP2040 running CircuitPython to respond to the reset button, but the Itsy works. This sounds similar: https://forums.adafruit.com/viewtopic.php?f=60&t=178640. We will investigate that. Are you running a pico-sdk C program?

@lurch
Copy link
Contributor

lurch commented May 12, 2021

"This ran fine until I hit the reset button. Then I had to hit it 6 times to get it running again."

@dhalbert Yeah, it does sound like there might be some marginal timing somewhere that works with the flash chip on some QT Py RP2040 boards but not with the flash chip on other QT Py RP2040 boards? 🤷

@wa1tnr
Copy link
Author

wa1tnr commented May 12, 2021

Hi @dhalbert - Yes, in a pico-sdk C program.

The issue manifests in pico-examples 'hello_world' but I usually test in CamelForth .

Since it happens with pico-examples hello world (serial USB) I didn't investigate further wrt the code being run.

Same code runs fine on Adafruit Feather RP2040, ItsyBitsy RP2040, Raspberry Pi Pico RP2040.

Host PC Dell Optiplex about 10 yrs old, Debian Linux amd64. Suspect it may have slight trouble enumerating USB, as a second device doesn't get enumerated, if two CDC/ACM devices are inserted into the USB jack array on the PC chassis. Only /dev/ttyACM0 presents (not /dev/ttyACM1, for example).

I don't remember this being a problem in the past, with (for example) the Arduino IDE.

No issue, this morning, with enumerating QTPY RP2040 (/dev/ttyACM0) running pico-sdk C program, concurrent with enumeration (and conversation with) CP2104 Friend (/dev/ttyUSB0) (which is talking to STM32F40x Black Pill on the Black Pill's USART - concurrently.

Both conversations CDC/ACM based (text interpreters running Forth).

@wa1tnr
Copy link
Author

wa1tnr commented May 12, 2021

This prebuilt UF2 exhibits the issue (does not survive removal of power; runs fine upon UF2 upload, one time only):

wa1tnr: CamelForth

@dhalbert
Copy link
Contributor

The same flash chip (maybe even from the same reel) is being used on the QT Py and the ItsyBitsy. I don't see any board configuration build differences, so this may be electrical, but it sounds like we might need to adjust some clock speed.

@lurch
Copy link
Contributor

lurch commented May 12, 2021

a second device doesn't get enumerated, if two CDC/ACM devices are inserted into the USB jack array on the PC chassis.

See Errata RP2040-E5 in the RP2040 datasheet. Easiest fix is to plug the two different RP2040 devices into separate USB hubs.

@wa1tnr
Copy link
Author

wa1tnr commented May 12, 2021

a second device doesn't get enumerated, if two CDC/ACM devices are inserted into the USB jack array on the PC chassis.

See Errata RP2040-E5 in the RP2040 datasheet. Easiest fix is to plug the two different RP2040 devices into separate USB hubs.

Thanks!

The content below is optional reading; tangental to this PR. ;)


I was able to
38 int main(void) {
39 stdio_init_all();
40 sleep_ms(1200);
41 rp2040_usb_device_enumeration_fix();

and flash to both targets (Adafruit Feather RP2040, QTPy RP2040).

There doesn't seem to be a preference for one over the other.

Both will enumerate, and I'll have both /dev/ttyACM0 and /dev/ttyACM1 available for interactive sessions (minicom, seyon, hyperterm &c.)

The one thing I can't do is 'claim' the interface (by invoking minicom, seyon, hyperterm &c.) and then try to enumerate the other target board (by plugging in its USB cable).

Both cables must be plugged in, and enumeration verified, before proceeding further.

Under those circumstances, I can have concurrent sessions on the two /dev/ttyACM devices, without a special USB hub (I don't own any external USB hubs).

@tannewt
Copy link
Contributor

tannewt commented May 13, 2021

@hathach has been maintaining these board defs. I found 4 was a reliable divisor for CircuitPython. It needs to account for all of the different command speed limits (not just the read speed) and 62.5 mhz can end up over the common limit of 60mhz for commands.

@hathach
Copy link
Contributor

hathach commented May 14, 2021

Yeah, I also noticed this issue as well when using qtpy as picoprobe. I have also tried to change the PICO_FLASH_SPI_CLKDIV=4 but it also does not solve the issue either. Furthermore, I have pulled and compiled the latest circuitpython adafruit/circuitpython@35ee4ad, it also has the same issues. To sum up

  1. Either reset, power off can get cause device failed to run (not enumerated)
  2. Once happens, it can occasionally run normally again after random number of reset + power. Which indicate that the qspi flash is not corrupted at all
  3. UPDATE: if leaving the board by itself for a few minutes, it will start to run again !!! Which indicate there is probably some blocking delay of some kinds !!!!

Currently I have no ideas why it failed to run, further investigation is needed.

@eightycc
Copy link

Something to look into is the default output driver strength. The flash part on my QT Py RP2040 is a Winbond Q64JVXGIQ, which according to Winbond's Rev K datasheet defaults to 25% driver strength on read operations (see pg. 17, section 7.1.6). I'm going to give kicking it up to 50, 75, and 100% a try and will report back.

@eightycc
Copy link

Something else to note, I can get my QT Py RP2040 to rock solid operation by switching to boot2_generic. That's not ideal as there's a pretty big performance hit, but it does point to a problem in boot2_w25q080. Since I can't bond a probe to SWCLK/SWDIO (those are some small pins), I'm resorting to using the LED to find where its hanging up.

@eightycc
Copy link

Well that's weird. I was able to get it to work reliably with boot2_generic_03, but after attempting to set read driver strength I seem to have bricked the flash. Guess I'll try bonding to those tiny pins...

@Wren6991
Copy link
Contributor

There are a lot of moving parts between changing the SPI speed and USB operations failing. It could be some flash signal integrity issue, or it could be the different cache miss delay bumping against some hidden timing issue in the USB stack or elsewhere.

Are you able to reproduce this with any simpler applications (like blink from pico-examples)?

@Wren6991
Copy link
Contributor

Like @tannewt said 62.5 MHz is quite high for some flash operations (particularly 03h tends to have a lower frequency limit) but the ROM programming routines use a fixed divisor of 6 (see here) when used with the stock hardware/flash code in the SDK, independently of what is used for XIP, so I wouldn't expect that to be affected by changing the second stage.

@eightycc
Copy link

A closer reading of the Winbond datasheet reveals that certain SR bits are marked somewhat cryptically "Volatile / Non-Volatile Writable". What this means in practice is that on reset, these bits are copied from flash into the SR flip-flops. Using two different instructions, Write Enable (06h) or Write Enable for Volatile SR (50h), the programmer can permanently alter the SR bit in flash or temporarily (until the next reset) alter the bit in the flip-flop it's been copied into, respectively.

@wa1tnr
Copy link
Author

wa1tnr commented May 18, 2021

Are you able to reproduce this with any simpler applications (like blink from pico-examples)?

I don't know who 'you' is. ;)

hello_world in pico-examples exhibits the behavior - that's how I knew not to tear my own code apart looking for a flaw.

i.e. pico-examples/hello_world/usb/hello_usb.c

@eightycc
Copy link

Sweet success! By setting flash read driver strength to 75% in non-volatile Status Register 3 bits DRV1 and DRV2, I'm able to run with PICO_BOOT_STAGE2_CHOOSE_W25Q080 1 and PICO_FLASH_SPI_CLKDIV 4. If I attempt to kick the divider down to 2, it fails.

@eightycc
Copy link

Here is the utility I cobbled together to update flash read driver strength:

https://github.com/eightycc/fix_qtpy_rp2040

@dhalbert
Copy link
Contributor

dhalbert commented May 21, 2021

We have been seeing some problems with long crystal oscillator startup time on a few samples of Qt Py RP2040. I'm not sure that's related to the problem you're seeing, but try changing this line:

xosc_hw->startup = startup_delay;

to

xosc_hw->startup = startup_delay * 32;   // or even * 64

If you are willing to set at least one Qt Py RP2040 that was acting up back to the stock drive strength, and then trying the above, that would be an interesting test. But I don't know why the drive strength should have anything to do with flaky clocking.

@eightycc
Copy link

If you are willing to set at least one Qt Py RP2040 that was acting up back to the stock drive strength, and then trying the above, that would be an interesting test.

Will do.

But I don't know why the drive strength should have anything to do with flaky clocking.

Could be we've got more than one problem in play. What I'm seeing looks more like a signal integrity problem in XIP mode, so adding drive (25% -> 75%) makes sense as a remedy.

@dhalbert
Copy link
Contributor

What we actually saw on some Saleae traces was the the SCK signal to the flash was irregular and too fast after the xosc was started and used. So we surmised that the crystal oscillator was having trouble starting up and lengthened the startup delay experimentally. We haven't yet looked at analog traces.

@eightycc
Copy link

eightycc commented May 21, 2021

Stranger and stranger. The QT Py that was failing reliably for me now works with flash read drive at 25% and no additional delay in xosc_init(). Likewise, any combination of 75% drive and additional xosc delay also works. That's with clock divider = 4 and w25q80 second stage boot. With clock divider = 2 all combinations of drive and delay fail. I'll go over it again tomorrow to be sure I didn't miss anything.

@dhalbert
Copy link
Contributor

dhalbert commented May 21, 2021

Hmm! I hadn't looked in the bootrom code and assumed that xosc_init() was only being called later, because theoretically you could run the chip without a crystal. EDIT: I realized that you must have the xosc for USB to work. See #401 (comment).

(EDIT) Maybe it's simply the extra delay that's being added in xosc_init() that is helping for some reason. I first tried small delays like 2x, 6x, etc. Even 16x was not consistently reliable - I had to go to 32x.

We are using a clock divider of 4, and we have our own stage2 boot, templatized and written in C:
https://github.com/adafruit/circuitpython/blob/main/ports/raspberrypi/stage2.c.jinja

I'd be very interested in a simple program that just dumped all the NVM parameters in the flash chip. I have several boards that work fine with one date code on the Winbond chip, and one that does not with another date code. I wonder if they have different factory settings.

(I think there is also a difference in the datecodes of the RP2040 chips, but they are both B1)

Date code on QT Py and other boards without problems:

image (2)

Date code on QT Py with problems:

image (3)

@eightycc
Copy link

Fascinating info on the date codes. I have QT Py's stuffed with 2048 and 2051 date coded flash on hand.

I'd be very interested in a simple program that just dumped all the NVM parameters in the flash chip. I have several boards that work fine with one date code on the Winbond chip, and one that does not with another date code. I wonder if they have different factory settings.

I'll get on it later today.

@dhalbert
Copy link
Contributor

dhalbert commented May 22, 2021

So just the unique id is different - oh well! Thanks for checking, in any case!

@eightycc
Copy link

I've gone over bootrom and SDK initialization code and have found that initial clk_sys frequency will differ, depending on how the part booted. If bootrom succeeds loading a valid stage2 boot out of power on, then clk_sys will be ~12MHz sourced from rosc; otherwise, if loaded out of a uf2 drop, it will be 48MHz sourced from xosc via its PLL.

Also significant is that by the time we're running clock_init(), code is executing from flash via XIP. So, we're changing clk_peri frequency in code we're executing via XIP. I wonder why this works at all?

@eightycc
Copy link

Regarding lack of initialization of xosc_hw->startup in bootrom, the RP2040 defaults to 47, the same value that is calculated by uint32_t startup_delay = (((12 * MHZ) / 1000) + 128) / 256;.

@kilograham
Copy link
Contributor

@Wren6991 you may have comments on initial clocking

@dshadoff
Copy link
Contributor

One more data point:
I have a QtPy RP2040 which was working well on initial program, but would neither reset properly nor start properly when connected to power. I could not get it to restart in these cases, even inconsistently.

The lot code on the Winbond memory is similar to the lot code specified above as 'good' (so these lot codes may be a red herring):
Q64JVXGIQ
2051-6036
7B600ZY

When I adjusted the CLK_DIV to 4 as suggested by @wa1tnr , I was able to consistently reset the board once power was applied, but initial startup from USB power was inconsistent. Applying applying the startup_delay *32 as suggested by @dhalbert , the initial power-on seems to have been corrected as well.

Let me know if there if there is any more needed information I can help to provide.

@dhalbert
Copy link
Contributor

Regarding lack of initialization of xosc_hw->startup in bootrom, the RP2040 defaults to 47, the same value that is calculated by uint32_t startup_delay = (((12 * MHZ) / 1000) + 128) / 256;.

Did you find this out by reading the register? (Since it's not in the datasheet.)

@dhalbert
Copy link
Contributor

I've gone over bootrom and SDK initialization code and have found that initial clk_sys frequency will differ, depending on how the part booted. If bootrom succeeds loading a valid stage2 boot out of power on, then clk_sys will be ~12MHz sourced from rosc; otherwise, if loaded out of a uf2 drop, it will be 48MHz sourced from xosc via its PLL.

This makes sense given the symptoms. Our Saleae traces on the SCK of the the flash chip show it clocked quite slowly for a while, I assume while stage2 is being read. Then the xosc is turned on in the pre-main() code, and the SCK clocks become too fast and very irregular.

If instead the bootloader goes into USB mode, either because the boot button ws pressed or there was no valid stage, then xosc is turned on and the execution remains in the boot ROM. Any xosc startup issue would probably just delay USB operation briefly.

@eightycc
Copy link

eightycc commented May 24, 2021

Did you find this out by reading the register? (Since it's not in the datasheet.)

See section datasheet 2.16.3, "The 1ms default is sufficient for the RP2040 reference design..." which I verified by setting up a GPIO as a trigger and measuring it using my trusty Rigol.

An interesting characteristic of xosc is that STARTUP_DELAY counts crystal pulses, so if the crystal is misbehaving the delay can be unpredictable.

I've brought sys_clk out on a GPIO so I can scope it, and observed that in the failing case the xosc startup is truncated and xosc appears to be running much faster than 12MHz. Once the PLL VCO thinks it's stabilized, it too is running way too fast and produces nothing but a hot mess (term of art) on its output. I can't be precise about the frequencies as they exceed my scope's bandwidth.

The way that clock initialization works between bootrom and SDK initialization is not as well structured as I'd wish, but it does seem to work in most cases. I rewrote clocks_init() and xosc_init() to setup clocks correctly to the best of my understanding, and the failure still occurs.

Any xosc startup issue would probably just delay USB operation briefly.

Yes, I believe that helps drain more mystery out of this bug.

The bottom line is that some QT Py's have a defect. It may be a wonky crystal or a layout problem. Your fix (increase STARTUP_DELAY) works by delaying long enough for the crystal to finally stabilize. At this point I'd go with that and move on.

@eightycc
Copy link

One more thing, I'm convinced that the Winbond flash part is not part of the problem. What's coming out of the PLL is such a mess that it's unreasonable to expect anything using it as a clock to work.

@daveythacher
Copy link
Contributor

@daveythacher
Copy link
Contributor

@eightycc Figure 117

@eightycc
Copy link

eightycc commented May 28, 2021

@daveythacher Looked at your references, but I'm not grokking what your driving at?

Figure 117 of the rp2040 datasheet shows a 2:1 ratio (SCKDV == 2) of ssi_clk (aka peri_clk) and sclk_out illustrating why that is the maximum allowable ratio, i.e., a ratio of 1:1 is not supported.

In the Infineon reference, the pertinent reference for the problem at hand is section 6.3.1 Parasitic RC or LC Oscillation which matches the problem I've observed with the failing QT Py board closely enough that Adafruit engineers should take note.

@eightycc
Copy link

eightycc commented May 28, 2021

Just in case it gets lost, the original pull should also be incorporated. With SCKDV == 2, the resulting sclk_out frequency, assuming a clk_peri frequency of 125MHz, exceeds the Winbond W25Q64JV fR rating of 50MHz (see section 9.6 AC Electrical Characteristics in its datasheet). Since only even SCKDV values are valid, SCKDV needs to be set to 4.

So, this pull is needed for all boards with a Winbond W25Q64JV. Additionally, the pull by @dahlbert (#457) is needed for QT Py RP2040 parts to work around its xosc starting problem.

@wa1tnr
Copy link
Author

wa1tnr commented May 28, 2021

When I sent in the PR, I just figured it was early in development - and I got very lucky on a wild guess.

Just in case it gets lost, the original pull should also be incorporated.

@tannewt mentioned [the subject matter of] what you just wrote, it as well:

#401 (comment)

Thanks!

@lurch
Copy link
Contributor

lurch commented May 28, 2021

#457 modifies all of the Adafruit board configs - should this PR do the same? 🤷

@eightycc
Copy link

eightycc commented May 28, 2021

#457 modifies all of the Adafruit board configs - should this PR do the same?

Yes, an excellent suggestion. @wa1tnr, could you do that?

@daveythacher
Copy link
Contributor

When I sent in the PR, I just figured it was early in development - and I got very lucky on a wild guess.

Just in case it gets lost, the original pull should also be incorporated.

@tannewt mentioned [the subject matter of] what you just wrote, it as well:

#401 (comment)

Thanks!

#401 (comment)

@eightycc
Copy link

eightycc commented May 29, 2021

@daveythacher Mea culpa. After carefully examining the SSI initialization code, I can see that the SSI baud rate will be properly handled through a transition from SDK code to bootrom code and back again. So, why was I seeing a failure with a divider of 2 vs. 4? Bad test hygiene on my part, i.e., building with a locally patched copy of pico-sdk that was picking up the generic stage 2 boot.

Going back to a clean checkout of pico-sdk, I can confirm that the QT Py RP2040 works with (1) #457, (2) PICO_BOOT_STAGE2_CHOOSE_W25Q080 1, and (3)PICO_FLASH_SPI_CLKDIV 2.

Humbly, I conclude that this pull is not necessary. Apologies to everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.