Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYNC_IN jitter #16

Closed
jordens opened this issue Nov 6, 2018 · 140 comments
Closed

SYNC_IN jitter #16

jordens opened this issue Nov 6, 2018 · 140 comments

Comments

@jordens
Copy link
Member

jordens commented Nov 6, 2018

The jitter on the SYNC_IN signal from Kasli to the AD9910 (throught the LVDS buffers and the fanout) is very high in some caes (the tester setup connected to the buildbot).

At validation delay 1 (hold and setup margin 1 tap) the window is just 2 taps wide (a tap is about 75 ps).
http://buildbot.m-labs.hk/builders/artiq/builds/2669/steps/python_unittest_2/logs/stdio

This is the SMP_ERR matrix on tester, rows are increasing validation delay, columns are SYNC_IN delay on the AD9910.:

[1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

There also seems to be some bounce around the edges (top row).

On the systems I have here I get about 5-6 tap wide windows at validation delay 1. That's not stellar but OK.
When using the SYNC signal on board from the first DDS, the window at validation delay 1 is 8 taps wide on tester, 8-9 taps here.
Assuming equal tap delay for the validation delays and the SYNC_IN delays, the theoretically best case is validation delay 4 and a window width of ~4 or a validation delay of 1 and a window width of ~10 (i.e. SYNC_IN delay periodicity minus twice the validation delay).

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
[1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In both cases Kasli/v1.1 and Urukul-AD9910/v1.3, connected via MMCX to Kasli-J1.

The jitter seems to come in part from Urukul and in part (the larger part) from Kasli. And it varies between setups.
I changed a couple things (see the artiq changelog) to optimize jitter on the RTIO clock but there was little effect. From Vivado the max peak-peak jitter on the clock driving the SYNC output buffer in the FPGA is ~90ps.
I tried running the SYNC fanout from the supposedly quieter P1V8A rail but that doesn't seem to work at all.

@gkasprow @marmeladapk could you have a look at the jitter on SYNC_IN (EEM0:7, before and after the sync fanout, compared to the Kasli MMCX clock)?

@cjbe @klickverbot When you were playing with SYNC, did you look at SMP_ERR? How did you select the SYNC_IN delay and the validation delay? Did you scan them?

c.f. m-labs/artiq#1143

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

I tried with the fully loaded Tester variant and Urukul on EEM0/1 on the hardware here. That also reproduces the narrow window (2 taps wide at validation delay 1) and the bouncing. The bouncing is something that I can't really explain with regular FPGA supply rail crowding.

Then I tried with a variant containing only that Urukul EEM and nothing else. Both with it on EEM0/1 and EEM5/4. Both are very jittery as well (window size ~4 at validation delay 1).

And I tried clocking the SYNC output register from its own dedicated BUFG. No significant change.

@cjbe
Copy link
Member

cjbe commented Nov 6, 2018

@jordens we look at SMP_ERRs and scanned to find the optimum point, and see similar things to you (i.e. similarly narrowed window widths for the Kasli clock vs the DDS sync_out).

I am also a bit worried about this, but have not had time to look at it properly yet. Empirically it is working without problem across two systems in our lab.

When I tested this thoroughly (#3 (comment)) I saw several SMP_ERRs per 10^10 samples (300ps validation window), but I did not see any losses of sync. (To see this I logged channels from 2 different Urukuls on a scope on persist, started the system from cold, and ran it for a day to achieve 10^10 resyncs).

@dnadlinger
Copy link
Member

dnadlinger commented Nov 6, 2018

@jordens We set the SYNC_IN delay by scanning it and manually choosing the centre of the "eye" (i.e. error-free region). Two examples from a while ago, using two Urukul v1.0 connected to the same Kasli v1.0:

rurukul0: 0   [0, 0, 0, 0]
rurukul0: 1   [1000, 1000, 259, 0]
rurukul0: 2   [1000, 1000, 1000, 1000]
rurukul0: 3   [1000, 1000, 1000, 1000]
rurukul0: 4   [1000, 1000, 1000, 1000]
rurukul0: 5   [1000, 1000, 1000, 995]
rurukul0: 6   [1000, 1000, 1000, 1000]
rurukul0: 7   [1000, 1000, 1000, 1000]
rurukul0: 8   [0, 0, 0, 1000]
rurukul0: 9   [0, 0, 0, 0]
rurukul0: 10  [0, 0, 0, 0]
rurukul0: 11  [0, 0, 0, 0]
rurukul0: 12  [0, 0, 0, 0]
rurukul0: 13  [0, 0, 0, 0]
rurukul0: 14  [1000, 758, 499, 49]
rurukul0: 15  [1000, 1000, 1000, 1000]
rurukul0: 16  [1000, 1000, 1000, 1000]
rurukul0: 17  [1000, 1000, 1000, 1000]
rurukul0: 18  [1000, 1000, 1000, 1000]
rurukul0: 19  [1000, 1000, 1000, 1000]
rurukul0: 20  [1000, 1000, 1000, 1000]
rurukul0: 21  [1000, 1000, 870, 1000]
rurukul0: 22  [0, 0, 0, 0]
rurukul0: 23  [0, 0, 0, 0]
rurukul0: 24  [0, 0, 0, 0]
rurukul0: 25  [0, 0, 0, 0]
rurukul0: 26  [0, 0, 0, 0]
rurukul0: 27  [0, 0, 0, 2]
rurukul0: 28  [1000, 1000, 1000, 1000]
rurukul0: 29  [1000, 1000, 1000, 1000]
rurukul0: 30  [1000, 1000, 1000, 1000]
rurukul0: 31  [1000, 1000, 1000, 1000]

rurukul1: 0   [0, 0, 0, 0]
rurukul1: 1   [0, 0, 0, 0]
rurukul1: 2   [0, 0, 0, 0]
rurukul1: 3   [1000, 0, 156, 0]
rurukul1: 4   [1000, 1000, 1000, 1000]
rurukul1: 5   [1000, 1000, 1000, 1000]
rurukul1: 6   [1000, 1000, 1000, 937]
rurukul1: 7   [1000, 1000, 1000, 1000]
rurukul1: 8   [1000, 1000, 1000, 1000]
rurukul1: 9   [1000, 1000, 1000, 1000]
rurukul1: 10  [0, 61, 0, 0]
rurukul1: 11  [0, 0, 0, 0]
rurukul1: 12  [0, 0, 0, 0]
rurukul1: 13  [0, 0, 0, 0]
rurukul1: 14  [0, 0, 0, 0]
rurukul1: 15  [0, 0, 0, 0]
rurukul1: 16  [1000, 0, 1000, 0]
rurukul1: 17  [1000, 900, 1000, 1000]
rurukul1: 18  [1000, 1000, 1000, 1000]
rurukul1: 19  [77, 1000, 1000, 1000]
rurukul1: 20  [1000, 1000, 1000, 1000]
rurukul1: 21  [1000, 1000, 1000, 1000]
rurukul1: 22  [1000, 1000, 1000, 1000]
rurukul1: 23  [0, 1000, 0, 1000]
rurukul1: 24  [0, 0, 0, 0]
rurukul1: 25  [0, 0, 0, 0]
rurukul1: 26  [0, 0, 0, 0]
rurukul1: 27  [0, 0, 0, 0]
rurukul1: 28  [0, 0, 0, 0]
rurukul1: 29  [362, 0, 1000, 0]
rurukul1: 30  [1000, 766, 1000, 27]
rurukul1: 31  [1000, 1000, 1000, 1000]

Urukul v1.1 connected to a Kasli v1.1:

0   [1000, 1000, 1000, 1000]
1   [1000, 1000, 1000, 763]
2   [1000, 1000, 1000, 1000]
3   [60, 0, 1000, 1000]
4   [0, 0, 0, 245]
5   [0, 0, 0, 0]
6   [0, 0, 0, 0]
7   [0, 0, 0, 0]
8   [0, 0, 0, 0]
9   [0, 0, 0, 0]
10  [0, 38, 0, 1]
11  [946, 1000, 0, 511]
12  [1000, 1000, 3, 1000]
13  [1000, 1000, 930, 1000]
14  [1000, 1000, 1000, 977]
15  [1000, 1000, 1000, 1000]
16  [955, 30, 1000, 1000]
17  [0, 0, 1000, 1000]
18  [0, 0, 0, 0]
19  [0, 0, 0, 0]
20  [0, 0, 0, 0]
21  [0, 0, 0, 0]
22  [0, 0, 0, 0]
23  [0, 0, 0, 0]
24  [999, 1000, 0, 217]
25  [1000, 1000, 0, 1000]
26  [1000, 1000, 0, 1000]
27  [1000, 1000, 838, 1000]
28  [1000, 1000, 1000, 1000]
29  [1000, 995, 1000, 1000]
30  [0, 0, 1000, 1000]
31  [0, 0, 1000, 4]

These are the number of SMP_ERRs per 1000 trials for each of the channels, with validation tap setting 0. At 2, the windows are down to 2 taps; completely closed at 4.

As Chris mentioned, we haven't seen any errors in production yet, but we haven't been looking very hard (i.e. only indirectly through relatively crappy quadrupole laser gates).

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

At 62.5 MHz SYNC_IN there are already 1e10 resyncs after 3 minutes of running. And since SMP_ERR is latching and checking each one of them, I haven't seen an invalid (re)sync in >1e13. I am also uncertain how "loss of sync" would manifest itself on the outputs if there is no frequency/phase change. My guess is that it would not even be a 16ns transient and even if there is a 16ns transient, you'll capture that only if you mix or diff the channels on a scope. The reasoning is based on conjecture how the DDS works internally (ADI patents and SAWG: SYNC_CLK will have a pair of a short and a long cycle length glitches but since the outputs run at 1 GHz output will be exactly the same).

I.e. I am not worried about using a window that is 6 taps wide at validation delay 1. But I am worried about dealing with a window that is only 2 taps wide at validation delay 1.

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

You shouldn't need to repeatedly check SMP_ERR. It's latching. Just let it hammer for a couple µs.
That width of 5 at validation delay 0 (2 at 2 and 0 at 4) is as bad as the data from Tester and worse than the 5-6 at 1 that I get here (with PTB2).

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

I forgot your posts on the other issue. Thanks for digging them out.

The IO_UPDATE delay tuning is done. That is now measured without external hardware and can be done at runtime. And it is stable to the ns over all PVT cases I have looked at.

@dnadlinger
Copy link
Member

Yep, I saw your (nice) commits – we'll definitely have a look at porting our code over from the quick stopgap fix to your driver soon.

@cjbe
Copy link
Member

cjbe commented Nov 6, 2018

@jordens in our work we are using the 'clear phase accumulator on IO_UPDATE' mode. This means that a sync error looks like a 1ns phase origin glitch, which is very obvious (i.e. 90 degrees at 250 MHz).

For my sync tests I checked the alignment of the RF outputs with an RTIO TTL output - I confirmed that sitting outside of the window caused obvious phase alignment errors, and that sitting at the edge of the window (as measured with 0 validation delay) caused a small but measurable phase alignment error rate.

@cjbe
Copy link
Member

cjbe commented Nov 6, 2018

You shouldn't need to repeatedly check SMP_ERR. It's latching. Just let it hammer for a couple µs.

Yeah - the repeated checking on the eye scans is just to get an error rate estimate.

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

@cjbe Could you clarify what you mean by "sync error"? Not "SMP_ERR", probably. And how are you seeing that "sync error" if the next SYNC_IN event (corrective reset of the SYNC_CLK generator) is just 16 ns away? How long are the "glitches"?
Pretty sure that SYNC_IN hitting the wrong SYSCLK cycle is invisible as long as there is no phase/frequency change at the same time. And as you say, you haven't seen anything and I haven't seen anything either, even when provoking SMP_ERR.
You would see a misalignment between DDS outputs if the SYNC_CLK is misaligned at the same time as the IO_UPDATE event. That's what I can see as well. Is that what you tested?

Yeah - the repeated checking on the eye scans is just to get an error rate estimate.

But then you iterated over the 1000-iteration another 4 times...

@klickverbot ACK.

But we should figure out where that high SYNC_IN jitter comes from. If someone with access to jitter measurement tools could have a look, that would be great. Might also move this to Kasli.

@jordens
Copy link
Member Author

jordens commented Nov 6, 2018

From a quick look with a spectrum analyzer and scope, SYNC_IN after going through the fanout, another IDC cable and a LVDS-to-CMOS converter is pretty clean. Spurs (from sys_clk logic modulating rtio_clk) on the SYNC_IN fundamental are down ~50 dB, on the 7th harmonic down ~30 dB. Also very clean close in to carrier (1 kHz to 1 MHz). The jitter is on rather fast timescales as already a couple dozen µs of sampling show the problem.

@cjbe
Copy link
Member

cjbe commented Nov 6, 2018

@jordens

Could you clarify what you mean by "sync error"? Not "SMP_ERR", probably. And how are you seeing that "sync error" if the next SYNC_IN event (corrective reset of the SYNC_CLK generator) is just 16 ns away? How long are the "glitches"?

By 'sync error' I mean observing that the relationship between the DDS phase and an RTIO event is incorrect. The DDS is in a mode where the phase accumulator is reset to zero on IO_UPDATE. If the DDS state machine is not properly synced when it registers the IO_UPDATE the DDS phase is incorrect (i.e. the DDS chooses the wrong edge of the 1 GHz clock as the phase origin, leading to ~90 degree phase shifts for 250 MHz output).

I triggered the scope from an RTIO TTL output at a fixed delay from the IO_UPDATE - if everything is working correctly the DDS phase should be fixed relative to the RTIO output event. If this phase is incorrect the DDS was not properly synced at IO_UPDATE.

@cjbe
Copy link
Member

cjbe commented Nov 6, 2018

@jordens

But then you iterated over the 1000-iteration another 4 times...

Ah - these are the SMP_ERR counts for channels 0..3, so that
24 [999, 1000, 0, 217]
means that at validation delay tap 24 channel 0 had 999 errors out of 1000 samples, channel 1 had 1000 errors out of 1000 samples, etc.

jordens added a commit to m-labs/artiq that referenced this issue Nov 7, 2018
sinara-hw/Urukul#16

Signed-off-by: Robert Jördens <rj@quartiq.de>
@gkasprow
Copy link
Member

IT's worth looking at the CPLD IO suply rail. The SMPS may work in discontinuous mode causing high ripples on CPLD supply.

@jordens
Copy link
Member Author

jordens commented Nov 24, 2018

That signal doesn't go through the CPLD. It would need to be crosstalk from the control lines of the fan out. The fan out supply seemed clean.

@jordens
Copy link
Member Author

jordens commented Dec 4, 2018

With the current (extremely lenient) algorithm and about two taps of margin even CFL tubes being switched on will reliably cause SMP_ERR to latch here. This is in a grounded, closed enclosure (albeit not RF shielded). There is something wrong here.

@gkasprow
Copy link
Member

gkasprow commented Dec 4, 2018

You can try with FSEN pin state on LVDS receivers. It may affect the jitter.

@AUTProgram
Copy link

I have run sync_scan from the ad9910 test suite several times on two cards of the old (v1.0) and two cards of the new (v1.3) hardware versions of Urukul. Overall I did about 30 runs on each card.

For cards of revision 1.0, the errors resulting from different validation delays were quite variable, typical results for one card would look like these:

about 70% of all runs:

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

about 20% of all runs:

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

about 10% of all runs:

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

and for the other card
about 2/3 of runs:

[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]	
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

about 1/3 of runs:

[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

For two cards of the new revision 1.3, all runs basically gave the same result:

[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The results for the (Creotec) v1.3 boards are consistent with pk-pk jitter of approx 400ps.

@AUTProgram
Copy link

@gkasprow what level of jitter would you expect for the LVDS SYSREF generated by Kasli after going through cabling, LVDS buffers, etc? Any idea why the newer boards seem to behave better than the old ones?

@jordens which versions of the hardware did you test?

@hartytp
Copy link
Collaborator

hartytp commented Dec 12, 2018

We were setting up to look into the sync in jitter issue and see if we could locate its origin, however we weren't able to reproduce it in our test setup.

Some other details (probably not material, but for completeness):

  • using a Kasli v1.1 (Technosystem) with the -3 speed grade
  • urukul clocked at 1GHz using the 125MHz Kasli clock via MMCX

@gkasprow
Copy link
Member

gkasprow commented Dec 12, 2018

Jitter is dominated by the LVDS transceivers and could be even 0.3ns.
obraz

@hartytp
Copy link
Collaborator

hartytp commented Dec 12, 2018

@gkasprow that 300ps is almost all data-dependent jitter, right? I'd have to double check what we're doing in the calibration code, but I'm not sure that could account for what we see.

It also doesn't explain the observation that we see no "eye" for some window sizes on the older hardware.

@gkasprow
Copy link
Member

Yes. So with square wave pattern it should not be visible. The boards could differ by level of 3V3 rail noise that could affect jitter significantly.

@gkasprow
Copy link
Member

Such deterministic jitter depends on activity on neighbouring channels. So you can check if during calibration something happens on SPI.

@hartytp
Copy link
Collaborator

hartytp commented Dec 12, 2018

Such deterministic jitter depends on activity on neighbouring channels. So you can check if during calibration something happens on SPI.

Anyway, even so, 300ps of deterministic jitter still wouldn't explain the issues @jordens and @cjbe observed.

The boards could differ by level of 3V3 rail noise that could affect jitter significantly.

From a quick skim over the schematics, I didn't see any changes to the power supplies which could explain this, but maybe I missed something.

Could also be something to do with the clocking of the Urukuls from Kasli since I'm using the newer Kasli and @cjbe was using the older Kasli with worse clock distribution/floated MMCXs.

@gkasprow
Copy link
Member

That could be a matter of i.e. capacitors used. Other vendor means different characteristics.

@hartytp
Copy link
Collaborator

hartytp commented Dec 12, 2018

@gkasprow I was wondering about that kind of thing. If, the decoupling somewhere is a bit marginal then the quality of the capacitors used could have a large impact on performance. Anyway, even our results with the v1.0 hardware look better than the data @jordens posted at the top of this issue, so I don't think this is just to do with the vendor of Urukul.

@gkasprow
Copy link
Member

If you have SSA, you can simply pass known clock signal to the Urukul and back and see how it gets degraded.
I've just ordered 6GHz SSA to my lab, so won't have to borrow it any more. They will deliver it in a few days. To do such test I will need one problematic Urukul.

@jordens
Copy link
Member Author

jordens commented Dec 13, 2018

I had already checked crosstalk from busy SPI lines and I had looked at the signal after the fanout and another lvds-cmos converter, with a SA and not with a SSA though. My suspicion is that there is something going on between the fanout and the dds input. The jitter timescales are not slow (<100µs).

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

Okay, so the clock buffer is not giving a deterministic input -> output phase relationship for the DDS clock?

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

Are all control pins on that buffer correctly driven (e.g. no floating divider reset etc).

@marmeladapk
Copy link
Member

@hartytp During tests IN_SEL on IC19 switches for a moment to high.

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

Odd...I'd have to recheck the ARTIQ code to make sure that's unexpected.

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

@marmeladapk can you try shorting IN_SEL so that the MMCX clock is always used and then take another eye scan, please? If that looks good then this is just some SW issue with the CPLD config.

@marmeladapk
Copy link
Member

MMCX OSC sel = 1, IN_SEL = 0, triggered from Kasli MMCX, sync_sel = 0 (FPGA)

..[0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0]
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
13 0 [0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]

11000

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

Thanks! Okay, well still bad eye scans even though there are no visible glitches on the SYNC line (previous measurements suggest that there wasn't excessive jitter there either).

I'm out of ideas for things to test on the hw...

@hartytp
Copy link
Collaborator

hartytp commented Jun 6, 2019

@marmeladapk if you still have that setup in tact, can you repeat that last measurement on a finer timebase. I'd like to triple check there are no glitches on the SYNC_IN line.

@marmeladapk
Copy link
Member

@hartytp I'll do it on Tuesday.

@hartytp
Copy link
Collaborator

hartytp commented Jun 7, 2019

thanks

@gkasprow
Copy link
Member

gkasprow commented Jun 7, 2019

we can try to see with same scope both FPGA clock and SYNC signal directly on the Kasli. SYNC is available on EEM, the clock is also easy to observe. In this way, we would see exactly what FPGA feels.

@marmeladapk
Copy link
Member

12000

@hartytp
Copy link
Collaborator

hartytp commented Jun 11, 2019

Thanks @marmeladapk! Nothing there looks suspicious. Since you're looking at SYNC_OUT as well as SYNC_IN, I believe this also rules out things like noise on the DDS PLL (as SYNC_OUT is derived from the internal clock independently of SYNC_IN).

I suggest we leave this for now and produce a release that fixes the other issues.

@hartytp
Copy link
Collaborator

hartytp commented Aug 16, 2019

@cjbe spent some more time looking at this issue a few weeks back. IIRC he was looking at the output of IC16 on a fast scope using a balun and coax lines soldered to the EEM connector pins. Scope on persist to also catch glitches etc.

Modified the ARTIQ driver so that we can switch between DDS0 and Kasli as the sync sources by only changing CLK_SEL and nothing else.

Running eye scans on DDS0 with the two sync sources. All done with 0 validation window as recommended by ADI (IIRC there is a post on the forum about how other validation delays aren't expected to work, but that's not explained in the data sheet).

Good eye scans with DDS0 as SYNC source, bad with Kasli as source. On the scope, there is no visible difference in the jitter with the two sync sources. Measurement was good enough to rule out there being enough jitter to account for the bad eye scans. The only visible change in waveforms (other than the DC phase being different due to cable lengths etc) was that the Kasli duty cycle is ~55%, while DDS0 is bang on 50%. Hard to see how small changes in duty cycle could cause issues (the DDS is DC coupled and edge-sensitive).

Next thing to do would be to repeat this measurement at the DDS pin, but so far this remains a bit of a mystery.

@gkasprow
Copy link
Member

Can you include the EEM carrier or at least long ribbon cable to the loop? It might be an issue with crosstalk...

@hartytp
Copy link
Collaborator

hartytp commented Aug 29, 2019

Can you include the EEM carrier or at least long ribbon cable to the loop? It might be an issue with crosstalk...

That was included and doesn't appear to be the issue.

@WeiDaZhang
Copy link

The sync_smp_err shows a "double-window" in sync_receiver tap scan can be resolved by adding an ODDR in rtio.phy.ttl_simple.ClockGen in the gateware on Kasli.

It typically occurs as the following.
The rows are sync receiver delay taps, and columns are 4 channels on a urukul.
[0, 0, 0, 0]
[0, 0, 0, 0]
[1000, 1000, 917, 194]
[1000, 1000, 1000, 1000]
[7, 638, 1000, 1000]
[805, 152, 0, 0]
[1000, 1000, 1000, 5]
[1000, 1000, 1000, 1000]
[100, 334, 1000, 1000]
[0, 0, 0, 1000]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
[847, 978, 2, 0]
[1000, 1000, 1000, 1000]
[1000, 1000, 1000, 1000]
[0, 0, 675, 1000]
[1000, 1000, 1000, 0]
[1000, 1000, 1000, 503]
[1000, 1000, 1000, 1000]
[0, 0, 3, 1000]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 1, 0, 0]
[1000, 1000, 1000, 0]
[1000, 1000, 1000, 1000]
[899, 113, 1000, 1000]
[1000, 1000, 1000, 1]
[1000, 1000, 1000, 0]

Or
thumbnail_clip_image001

We have done the following few tests in verifying the ODDR:

  • Test 1, to reproduce the "double-window" on a satellite Kasli with two Urukuls

  • Test 2, to swap the two Urukuls with their cables to each other's slots on Kasli

  • Test 3, to swap back and flash the patched gateware with ODDR

  • Test 4, to flash the original gateware back

The result is visualised in following figure, the bright yellow area is where SYNC_SMP_ERR asserts.
Kasli-Urukul Resolve in phy ion lab1
All the scan are done in Sync validation delay[3:0] tap = 0.

After adding the ODDR, the window of the SYNC_SMP_ERR become clear and wide.
It is not as wide as the DDS sourced SYNC_IN.
The DDS results SYNC_SMP_ERR in 2 taps in general, whereas FPGA-ODDR sourced results SYNC_SMP_ERR in 3 taps occasionally 4.
But the "double window" is avoid which allows "eye" detection algorithm to find the optimised tap.
We will make a pull request to artiq soon.

@jordens
Copy link
Member Author

jordens commented Oct 18, 2019

Interesting. Let's understand this.

Oddr with both phases hooked up to the ttl frequency generator output? Which mode and which kind of pipelining?
I don't understand yet how this helps. Assuming in both cases the register gets packed into the output buffer I don't get why it would make a difference.
But if you accidentally changed it to negative clockedge then the ground bounce might be smaller...
Does an oserdes give even better results?

@WeiDaZhang
Copy link

Oddr with both phases hooked up to the ttl frequency generator output? Which mode and which kind of pipelining?

Yes, and SAME_EDGE.

I don't understand yet how this helps. Assuming in both cases the register gets packed into the output buffer I don't get why it would make a difference.

AFAIK, rtio.phy.ttl_simple.ClockGen didn't pack the register into "I/O Tile" before adding the ODDR. If it does the two cases are then implemented identically, and would not make any difference.

Does an oserdes give even better results?

OSERDES and OLOGIC (where the register sits) are alternative to each other, aren't they?
Didn't try it.
It can't be better than DDS-sourced SYNC_IN really. From DDS we got a window of 10, and 9 from ODDR. I think there isn't much gap there.

@jordens
Copy link
Member Author

jordens commented Oct 21, 2019

Ok. SAME_EDGE would still leave it toggling on the rising edge.
We generally expect the toolchain to pack these registers. I'd be surprised if it didn't here. If it didn't we should find out and check all similar cases (e.g. TTL I/O or SPI clocks) where we assume packing.
I'm also confused why this would not be visible in my measurements of the jitter or Pawel's.
Yes. OLOGIC and OSERDES are mutually exclusive. My assumption is that ORSERDES works with faster bitrates and thus has even better jitter properties.

@WeiDaZhang
Copy link

If it didn't we should find out and check all similar cases (e.g. TTL I/O or SPI clocks) where we assume packing.

I'll try to have a look.

My assumption is that ORSERDES works with faster bitrates and thus has even better jitter properties.

Yes, true, make sense.

@hartytp
Copy link
Collaborator

hartytp commented Oct 21, 2019

We generally expect the toolchain to pack these registers. I'd be surprised if it didn't here

@WeiDaZhang did you check the floorplan?

Can you also fix this by setting IOB=TRUE on the final FF in the clk gen?

@WeiDaZhang
Copy link

@WeiDaZhang did you check the floorplan?

That's what I was about to do.

Can you also fix this by setting IOB=TRUE on the final FF in the clk gen?

I frankly don't know how to do it in migen.
I imagine it would be something like
Instance("FD", i_C=ClockSignal("rio_phy"), i_D=blablalba, o_Q=blablabla, attr={("IOB", "TRUE")}),
But my understanding is that the two ways are identical.
The benefit might be IOB=TRUE should compatible with all Xilinx.

dnadlinger added a commit to dnadlinger/artiq that referenced this issue Oct 29, 2019
Without this, the final register in the SYNC signal TTLClockGen
isn't (always) placed in the I/O tile, leading to more jitter
than necessary, and causing "double window" artefacts. See
sinara-hw/Urukul#16 for more details.

(Patch based on work by Weida Zhang, testing by various members
of the community in Oxford and elsewhere.)
dnadlinger added a commit to dnadlinger/artiq that referenced this issue Oct 30, 2019
Without this, the final register in the SYNC signal TTLClockGen
isn't (always) placed in the I/O tile, leading to more jitter
than necessary, and causing "double window" artefacts. See
sinara-hw/Urukul#16 for more details.

(Patch based on work by Weida Zhang, testing by various members
of the community in Oxford and elsewhere.)
@hartytp
Copy link
Collaborator

hartytp commented Oct 30, 2019

This appears to be FPGA jitter, not a hardware issue with Urukul.

@hartytp hartytp closed this as completed Oct 30, 2019
sbourdeauducq pushed a commit to m-labs/artiq that referenced this issue Nov 5, 2019
Without this, the final register in the SYNC signal TTLClockGen
isn't (always) placed in the I/O tile, leading to more jitter
than necessary, and causing "double window" artefacts. See
sinara-hw/Urukul#16 for more details.

(Patch based on work by Weida Zhang, testing by various members
of the community in Oxford and elsewhere.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants