Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periph_timer: systematic proportional error in timer_set #10545

Open
jia200x opened this issue Dec 4, 2018 · 26 comments
Open

periph_timer: systematic proportional error in timer_set #10545

jia200x opened this issue Dec 4, 2018 · 26 comments
Assignees
Labels
Area: timers Area: timer subsystems Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@jia200x
Copy link
Member

jia200x commented Dec 4, 2018

Description

While debugging LoRaWAN I discovered there is a systematic error while using timer_set. When setting a timer to T units of time, it actually gets triggered in T + X/100*T (with X a board specific drift).
This issue is at least present on samr21-xpro (0.48% drift) and b-l072z-lrwan1 (1% drift).

This is causing some problems in LoRaWAN. For the Join Request procedure, the first reception window should be opened 5 seconds right after there's a TX done event. In b-l072z-lrwan1 the RX window is opened 47 ms later than expected. This explains why this compensation works ok on this board but produces problems in others (@danpetry I know you were having issues with this).

Steps to reproduce the issue

I ran the following test (samr21-xpro, 32bit timer) which produces timer cycles (with timeout increasing linearly, 200ms step) and measured the LED pin with a logic analyzer:

#include <stdio.h>
#include "xtimer.h"
#include "timex.h"
#include "periph/timer.h"
#include "random.h"

#include "msg.h"
#include "thread.h"

#define T 200000
#define TIMER TIMER_DEV(1)
#define TIMER_MAX 0xffffffff
#define TIMER_OFFSET 0xffff

#define LED_OFF LED0_OFF
#define LED_TOGGLE LED0_TOGGLE

unsigned int i=T;

static void cb(void *arg, int chan)
{
    (void) arg;
    (void) chan;
    LED_TOGGLE;
    if (i+T < TIMER_MAX)
        timer_set(TIMER, 0, i);
    i += T;
}

int main(void)
{
    // Init timer with 1 MHz
    timer_init(TIMER, 1000000ul, cb, NULL);
    LED_OFF;
    timer_set(TIMER, 0, TIMER_OFFSET);

    return 0;
}

Expected results

The actual timeout timer for the n cycle (e.g how long does the nth cycle take) in milliseconds would be:

t'[n] = TIMER_OVERHEAD + n*200 

Here TIMER_OVERHEAD is the intrinsic delay of timer_set (ISR + callback served). Since TIMER_OVERHEAD should be constant:

t''[n] = diff(t'[n]) = 200

So for a real measurement, I would expect diff(t[n]) to be centered in 200 and have some values around.

Actual results

I got the result of the logic analyzer here. It contains the timestamp when there was a flank. Then, I ran the following using python's pandas to extract the min, avg and max value of the sample's diff(t[n]):

import pandas as pd

data = pd.read_csv("results/untitled.csv")
data = data["Time[s]"][3:]
p_data = data.diff().diff()*1000 #Get derivative of cycles timeout timer in ms
p_data.min(), p_data.mean(), p_data.max()

Got (200.57575000000583, 200.9604469696972, 201.59066666666803)
So, diff(t[n]) is slightly to the right. There's a deviation of ((200.9604469696972 - 200)/200 * 100) = 0.48%, as mentioned before.

TL;DR

This code toggles the LED every 10 seconds:

#include <stdio.h>
#include "timex.h"
#include "periph/timer.h"

#define T 10000000
#define TIMER TIMER_DEV(1)
#define TIMER_MAX 0xffffffff
#define TIMER_OFFSET 0xffff

#define LED_OFF LED0_OFF
#define LED_TOGGLE LED0_TOGGLE

unsigned int i=T;

static void cb(void *arg, int chan)
{
    (void) arg;
    (void) chan;
    LED_TOGGLE;
    if (i+T < TIMER_MAX)
        timer_set(TIMER, 0, T);
    //i += T;
}

int main(void)
{
    //int interval = 0;
    timer_init(TIMER, 1000000ul, cb, NULL);
    LED_OFF;
    timer_set(TIMER, 0, T);

    return 0;
}

Here is the output of the logic analyzer when I ran this snippet on samr21-xpro. Then, I calculated the length between LED toggles:

     Time[s]   Channel 0
2  10.051050         1.0
3  10.050970        -1.0
4  10.050887         1.0
5  10.050525        -1.0
6  10.050817         1.0

Greater than 10 (avg: 10.05085). Drift here is:

((10.05085-10)/10) * 100= 0.51 %

Sort of consistent with the value reported before for samr21-xpro

@jia200x
Copy link
Member Author

jia200x commented Dec 4, 2018

Can someone confirm if this issue is present? @kaspar030 @MichelRottleuthner

@jia200x
Copy link
Member Author

jia200x commented Dec 4, 2018

(didn't put it in the description, but obviously this issue also affects xtimer)

@jia200x
Copy link
Member Author

jia200x commented Dec 4, 2018

this might be also related to #10352

@jia200x jia200x added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: timers Area: timer subsystems labels Dec 4, 2018
@jnohlgard
Copy link
Member

Did you try the application tests/bench_timer?
It performs the same kind of test but using another hardware timer on the same CPU as the reference. It produces some statistics about the error distribution as well.

@jia200x
Copy link
Member Author

jia200x commented Dec 4, 2018

Did you try the application tests/bench_timer?

IIRC I got similar results with both TIMER_DEV(0) and TIMER_DEV(1). So, the issue might be invisible to that test :/

I will confirm this tomorrow (and of course, run tests/bench_timer)

@kaspar030
Copy link
Contributor

Got (200.57575000000583, 200.9604469696972, 201.59066666666803)
So, diff(t[n]) is slightly to the right. There's a deviation of ((200.9604469696972 - 200)/200 * 100) = 0.48%, as mentioned before.

But that deviation is (more or less) constant, right? When printing the differences I get these:

 [200.9545     200.97325    200.95366667 200.853      201.03866667
 200.89083333 201.15025    200.77225    201.13441667 200.84225
 201.08283333 200.77816667 200.9925     200.96558333 201.056
 200.95925    200.84258333 200.8585     201.59066667 201.1005
 200.76441667 200.57575   ]

Note that the maximum is at pos n-3, the minimum is the last value.
So the actual measured elapsed time differences are off from min=0.575 to max=1.59, but it is not increasing.

When setting a timer to T units of time, it actually gets triggered in T + X/100*T

According to your data, it is more like T + 0.048*200ms.

Anyways, a milisecond spread when using periph timer is not good at all, and ~10k usec TIMER_OVERHEAD (0.048 * 200ms) is very weird.

Could you, with the oscilioscope, measure the time it takes for just plainly setting the gpios?

LED0_TOGGLE;
LED0_TOGGLE;
LED0_TOGGLE;

Maybe the GPIO is not as fast as it should be.

@kaspar030
Copy link
Contributor

Could you, with the oscilioscope, measure the time it takes for just plainly setting the gpios?

Maybe a couple thousand times in a loop, so we can get a baseline on what accuracy the measurement has?

@kaspar030
Copy link
Contributor

t[n] = TIMER_OVERHEAD + n*200

BTW, your code actually sets relative times (timer_set(i)), thus the expected time for each step is

t[n] = t[n-1] + TIMER_OVERHEAD + n*200.

@jia200x
Copy link
Member Author

jia200x commented Dec 5, 2018

But that deviation is (more or less) constant, right? When printing the differences I get these:

Yes, the deviation is more or less constant. For samr21-xpro is 0.48%. For b-l072z-lrwan1 around 1%.

Could you, with the oscilioscope, measure the time it takes for just plainly setting the gpios?

I already did that and IIRC it was for sure less than 10 us. I will measure again.

Anyways, a milisecond spread when using periph timer is not good at all, and ~10k usec TIMER_OVERHEAD (0.048 * 200ms) is very weird.

With TIMER_OVERHEAD I meant the ISR + callback overhead. That should be always constant.

BTW, your code actually sets relative times (timer_set(i)), thus the expected time for each step is

The description and the variable of the formula is misleading. I will fix it.
Basically this is how long does the nth cycle take (first cycle 200ms, then 400ms, 600ms, 800ms, ...)
In reality each cycle should be slightly longer due to the TIMER_OVERHEAD (as described above), but this one should be constant (the time to receive and serve the interrupt shouldn't be dependent on the timeout value). Then, t[n+1] - t[n] should be constant to 200ms. A slight drift of 0.48% in this value will produce a drift of 0.48% on every call to timer_set :(

@MichelRottleuthner
Copy link
Contributor

The clock configuration of this board looks suspicious regarding this issue.

From boards/samr21-xpro/include/periph_conf.h it looks like there are three possible settings for the clock config:

  1. CLOCK_USE_PLL = 1
  2. CLOCK_USE_XOSC32_DFLL = 1
  3. CLOCK_USE_XOSC32_DFLL = CLOCK_USE_PLL = 0

Then looking at cpu/samd21/cpu.c and cpu/samd21/periph/timer.c gives the following clock sources for the timers:

  1. OSC8M->[GCLK1]->timer{0,1}
  2. OSC8M->[GCLK1]->timer{0,1}
  3. OSC8M->[GCLK0]->timer{0,1}

So to me this looks like independent of the actually available clocks the timer always uses OSC8M.
According to datasheet page 976 the accuracy of this oscillator is bad even when using flash calibration values.
(0.75% to 2% depending on condtions + temperature and supply effects).

@jia200x
Copy link
Member Author

jia200x commented Dec 5, 2018

(fixed formula to make it clearer)

@kaspar030
Copy link
Contributor

In reality each cycle should be slightly longer due to the TIMER_OVERHEAD (as described above), but this one should be constant (the time to receive and serve the interrupt shouldn't be dependent on the timeout value). Then, t[n+1] - t[n] should be constant to 200ms. A slight drift of 0.48% in this value will produce a drift of 0.48% on every call to timer_set :(

To what do you think the error is proportional? 0.48% of what? If the length of each cycle increases, but the error is pretty constant at 0.48% of 200ms, I'd rather assume that the error is constant. It is certainly not in proportion to the cycle length, otherwise the error would increase.

What happens if you use 400ms, 800ms difference between cycles? If the error is proportional to the difference, it would stay close to 0.48% of 400ms, or 0.48% of 800ms.

So to me this looks like independent of the actually available clocks the timer always uses OSC8M.
According to datasheet page 976 the accuracy of this oscillator is bad even when using flash calibration values.
(0.75% to 2% depending on condtions + temperature and supply effects).

That seems like a very probable explanation.

@jia200x
Copy link
Member Author

jia200x commented Dec 5, 2018

To what do you think the error is proportional? 0.48% of what? If the length of each cycle increases, but the error is pretty constant at 0.48% of 200ms, I'd rather assume that the error is constant. It is certainly not in proportion to the cycle length, otherwise the error would increase.

The error is proportional to the length of the timeout window (a.k.a the difference between the callback and the moment were timer_set is called).
As shown in the TL;DR section, for 10 seconds it gives an error of 0.5%. Thus, is not constant.

In fact:
diff(t[n]) = t[n+1] - t[n] = 200 + 0.0048*200 => t[n] = 200*n + 0.0048*200*n = 200*n*(1+0.0048)
If the expected time is e[n] = 200*n, then the drift with the actual value is:
t[n] / e[n] = 0.48%

So it's really important diff(t[n]) is centered in 200 ms (in practice there are always drifts because of clock jitter, but they somehow cancel if it's centered)

@kaspar030
Copy link
Contributor

The error is proportional to the length of the timeout window (a.k.a the difference between the callback and the moment were timer_set is called).

Look again at the data:

[200.9545     200.97325    200.95366667 200.853      201.03866667
 200.89083333 201.15025    200.77225    201.13441667 200.84225
 201.08283333 200.77816667 200.9925     200.96558333 201.056
 200.95925    200.84258333 200.8585     201.59066667 201.1005
 200.76441667 200.57575   ]

There's 0.48% of 200 ms average deviation, with a standard deviation of 0.19. But it is not increasing with increasing interval length. It is not 0.48% of 200*n. The data says, for intervals 200*(1......n), the differences between two n, which should be 200ms, are too high by 1 ms, on average. So actually, the differences are (more or less) centered at 200ms. Or maybe I don't get the data.

As shown in the TL;DR section, for 10 seconds it gives an error of 0.5%. Thus, is not constant.

True that. Too bad you didn't run the "increasing 200ms" capture so it arrives at 10s.

Could you try reading the timer in a loop, toggling the led at specific intervals?

e.g.,

uint32_t n = 0
while(1) {
 n++;
if ((n & 0xffff) == 0) {
LED0_TOGGLE;
}

so we get some data on how accurate the timer actually is?

@jia200x
Copy link
Member Author

jia200x commented Dec 5, 2018

Look again at the data:

Note this is the data of the second discrete derivative of the timestamps.
With the latest description this means:

t[n] = Time where there was a flank
t'[n] = diff(t[n]) = Duration of nth cycle (expected to be  [200, 400, 600, 800...] )
t''[n] = diff(t'[n]) = Time difference between cycle n+1 and n (expected to be [200, 200, 200, 200, ...])

The cycle time t'[n] is equal to TIMER_OVERHEAD (probably in the order of microseconds) plus 200*n.

With this data:

[200.9545     200.97325    200.95366667 200.853      201.03866667
 200.89083333 201.15025    200.77225    201.13441667 200.84225
 201.08283333 200.77816667 200.9925     200.96558333 201.056
 200.95925    200.84258333 200.8585     201.59066667 201.1005
 200.76441667 200.57575   ]

The mean of the differences between cycles (t''[n]) is 200.96. It means the cycle time t'[n] = TIMER_OVERHEAD + 200.96*n.

If TIMER_OVERHEAD is not 0 (which was discarded in #10545#issuecomment) the deviation against the expected time is:

t'[n] / e'[n] = (TIMER_OVERHEAD + 200.96*n)/ (TIMER_OVERHEAD + 200*n) = 1 + 0.96*n/(TIMER_OVERHEAD+200*n)

For low values of TIMER_OVERHEAD or high values of n this tends to 1+0.96/200 = 1.0048 drift (0.48%)

Or maybe I don't get the data.

Sorry, I wasn't clear with the presentation of the data :(

Anyway, I will do more measurements and report the results back

@jia200x
Copy link
Member Author

jia200x commented Dec 5, 2018

For 10 seconds (n=50000) the expected time with this 0.48% drift is approx:

200.96*50000 = 10.048 seconds (pretty close to what I measured before with the 10 seconds test)

@kfessel
Copy link
Contributor

kfessel commented Dec 7, 2018

Please tell me that I am wrong but:
It seems to me that the b-l072z-lrwan1 board is configured to use the stm internal HSI (internal RC oscillator) which is callibrated in factory to 1% @ 25°C -> this is not good enaugh for the LoRa join Procedure.
Please tell me that I am wrong.

@jia200x
Copy link
Member Author

jia200x commented Dec 7, 2018

It seems to me that the b-l072z-lrwan1 board is configured to use the stm internal HSI (internal RC oscillator) which is callibrated in factory to 1% @ 25°C -> this is not good enaugh for the LoRa join Procedure.
Please tell me that I am wrong.

Well, this could explain why we get 1% of drift in the b-l072z-lrwan1 board

@kfessel
Copy link
Contributor

kfessel commented Dec 7, 2018

And also why the thing that should not have worked (50 ms offset) worked for you and not for me

@kfessel
Copy link
Contributor

kfessel commented Dec 7, 2018

but when we fix it we also have to take a look on power management for the board since the external oci is powerd by an configurable jumper which may be powerd by a pin (PA12)

@kaspar030
Copy link
Contributor

The mean of the differences between cycles (t''[n]) is 200.96. It means the cycle time t'[n] = TIMER_OVERHEAD + 200.96*n.

Ok, now I'm with you. I wasn't getting that the 200.96 add up. Thanks for clarifying!

Why didn't you just state that the timer runs 0.48% too fast?

@jia200x
Copy link
Member Author

jia200x commented Dec 7, 2018

When setting a timer to T units of time, it actually gets triggered in T + X/100*T (with X a board specific drift).
This issue is at least present on samr21-xpro (0.48% drift) and b-l072z-lrwan1 (1% drift).

You mean 0.48% slower? (Expected 5s => measured 5.047)

That's what I tried to say with:

While debugging LoRaWAN I discovered there is a systematic error while using timer_set. When setting a timer to T units of time, it actually gets triggered in T + X/100*T (with X a board specific drift).
This issue is at least present on samr21-xpro (0.48% drift) and b-l072z-lrwan1 (1% drift).

@jia200x
Copy link
Member Author

jia200x commented Dec 7, 2018

Anyway, both boards have inaccurate clock sources for the timers. That indeed explains these drifts...

@stale
Copy link

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@stale stale bot closed this as completed Sep 10, 2019
@smlng
Copy link
Member

smlng commented Sep 10, 2019

I think that relates to the discussion we had during the summit regarding the fact that timer_set should disable IRQs while doing timer_read and timer_set_absolute. Hence, reopen

@smlng smlng reopened this Sep 10, 2019
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Sep 10, 2019
@MichelRottleuthner
Copy link
Contributor

Actually I think it is only because of using an inaccurate clock source as I described in one of the comments above. But maybe we should open a separate issue for the periph timer thing.

@miri64 miri64 added this to the Release 2020.07 milestone Jul 3, 2020
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: timers Area: timer subsystems Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

8 participants