Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide information about the quality of a resampled metric #1021

Open
3 tasks
llucax opened this issue Aug 5, 2024 · 13 comments
Open
3 tasks

Provide information about the quality of a resampled metric #1021

llucax opened this issue Aug 5, 2024 · 13 comments
Labels
part:data-pipeline Affects the data pipeline type:enhancement New feature or enhancement visitble to users
Milestone

Comments

@llucax
Copy link
Contributor

llucax commented Aug 5, 2024

What's needed?

We need a way to inform users about the quality of a resampled metric.

For example, if a sample was calculated only using one very old value, the data quality should be low, while if the data was calculated based on many samples and we had up to date samples, then the quality should be high.

This way actors could make more informed decisions on how to use that data.

Proposed solution

  • Expose resampler SourceProperties via the resampling actor
  • Add more relevant statistics to SourceProperties
  • Make FormulaEngines aggregate statistics from the components it uses and expose its own statistics

Use cases

No response

Alternatives and workarounds

No response

Additional context

No response

@cwasicki
Copy link
Collaborator

cwasicki commented Aug 6, 2024

In my opinion this is interesting for formulas, e.g. to know how many None's were ignored in the calculation.

@llucax
Copy link
Contributor Author

llucax commented Aug 8, 2024

@frequenz-floss/python-sdk-team unless someone steps in and shows a use case for this, I think I will close.

@shsms
Copy link
Contributor

shsms commented Aug 8, 2024

We have often seen lower data rates from components without warning because of site-specific issues. I have seen this happen many times, including last week.

Apps need to be able to identify degraded data quality so that they know to be more conservative in their goals. Without it, they will assume that the latest values have a higher accuracy and will overshoot.

@llucax
Copy link
Contributor Author

llucax commented Aug 8, 2024

But if we assume a small sampling period, which is want to aim for (1s), then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right? I agree we need to know when data is degraded, what I'm not sure if the resampler is the best place to do so. I think the resampler should only cover for very short outages, stuff that should be transparent to app developers. Once data is bad enough that you care, the resampler should be fixing it in the first place, right?

@llucax
Copy link
Contributor Author

llucax commented Aug 8, 2024

So one suggestion was to use the LatestValueCache, extending it to expire the last value and store the timestamp of the last value.

@shsms
Copy link
Contributor

shsms commented Aug 8, 2024

then you know that the data rate is low or the quality of the data is bad because the resampler will start producing None, right?

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5. Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

@shsms
Copy link
Contributor

shsms commented Aug 8, 2024

what I'm not sure if the resampler is the best place to do so.

I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

@llucax
Copy link
Contributor Author

llucax commented Aug 9, 2024

I think the resampler shouldn't produce None and expect manual intervention like increasing data age in number of sampling periods to 5.

Let's see if we are talking about the same.

When? If data is not coming, then yes, it should produce None, there is no data. Right? This might happen temporarily or always. If a site is always producing slow data rates, then there is something fucked with that location, and IMHO in that case, yes, we should fix the location or change the period manually, at least from what I understood @thomas-nicolai-frequenz said, the resampling period can't be changed so lightly or the machine learning part can break.

If it happens sporadically, we should be able to recover when the data comes with the normal rate.

Like Christoph said, that is too disruptive for big locations. The resampler should adjust to max data age, if it determines that data rate is lower than the max data age, such that the buffer will have the latest value. But that's a separate issue I guess.

What do you mean by "adjust to the max data age"? Do you mean it should adjust the max_data_age_in_periods so that we get at least one sample for the low rate input? If so, I don´t think we should do that, this is effectively changing the resampling function dynamically depending on the input data rate.

what I'm not sure if the resampler is the best place to do so.
I think it is, because like you said, it tracks source info already and just has to send out one value at startup, and later, whenever the source info is recalculated.

Yeah, but it is done for different reasons. Again, the global resampler is just a way to homogenize the input data assuming the data that comes... comes, and comes at a reasonable rate. If we have no data, the resampler should return None, if you still need to work with an old value, you should save the latest value and the age of this latest value yourself.

So this issue is only about knowing if the data for the last 3 seconds (according to the current defaults we use, resampling period of 1s and max_age_in_periods of 3) is good or bad, and my question still is, do we even need this kind of granularity?

@llucax
Copy link
Contributor Author

llucax commented Aug 9, 2024

OK, looking at the code, I have some interesting findings that I forgot about:

  • The resampler supports upsampling, and the max_data_age_in_periods considers the input sampling period in this case, not the (output) resampling period:
    max_data_age_in_periods: float = 3.0
    """The maximum age a sample can have to be considered *relevant* for resampling.

    Expressed in number of periods, where period is the `resampling_period`
    if we are downsampling (resampling period bigger than the input period) or
    the *input sampling period* if we are upsampling (input period bigger than
    the resampling period).

    It must be bigger than 1.0.

    Example:
        If `resampling_period` is 3 seconds, the input sampling period is
        1 and `max_data_age_in_periods` is 2, then data older than 3*2
        = 6 seconds will be discarded when creating a new sample and never
        passed to the resampling function.

        If `resampling_period` is 3 seconds, the input sampling period is
        5 and `max_data_age_in_periods` is 2, then data older than 5*2
        = 10 seconds will be discarded when creating a new sample and never
        passed to the resampling function.
    """
  • If the resampler is downsampling, then amount of time considered to pass samples to the resampling function is constant, but if it is upsampling, it is already dynamic (as it depends on the input sampling period) 😱

  • The input sampling period is calculated each time a sample comes, but it is an average of the whole lifetime of the input, so if an input rate changes over time, the value used as input sample period will almost not change. This might be good or bad depending on how we see it.

So if some location is sending samples every 5 seconds (consistently and from the start), the resampler should be able to cope with it without issues, data for the last 15 seconds should be used to calculate the current sample. If this didn't happen, maybe we have a bug in the resampler.

@cwasicki
Copy link
Collaborator

cwasicki commented Aug 9, 2024

it is already dynamic (as it depends on the input sampling period)

Are you sure that this is done if the input data is not on a fixed sampling period? IIUC it can also be None, which I assumed would be used if we use the raw data as input.

@llucax
Copy link
Contributor Author

llucax commented Aug 13, 2024

I didn't get what do you mean by "the input data is not on a fixed sampling period".

@cwasicki
Copy link
Collaborator

If we resample irregular sample periods, e.g. if it's done on the raw data from the components I am not sure we can rely on that.

@llucax
Copy link
Contributor Author

llucax commented Aug 14, 2024

So, if we are downsampling, the data considered for the current window is always a fixed time span (max_age_in_periods * resampling_period). If we are upsampling though, then the input samples with the following age are considered for the current window: max_age_in_periods * input_sampling_period, where input_sampling_period is dynamic (will be updated for each received sample as total_time_receiving / total_samples_received), so if the input source rate is stable, it should be more or less constant, but if we have gaps often, then the input_sampling_period will increase as it is an average.

But also for the downsampling case, if a source is flaky at the beginning, we might consider we are actually upsampling the source, because the data rate is too low. Once it recovers, it should be switched to downsampling.

I'm not saying this is what we want, I'm just saying this is what the resampler is doing right now.

@shsms shsms modified the milestones: v1.0.0-rc800, v1.0.0-rc900 Aug 22, 2024
@llucax llucax modified the milestones: v1.0.0-rc900, 1.0.0-rc1000 Sep 2, 2024
@llucax llucax modified the milestones: v1.0.0-rc1000, v1.0.0-rc1100 Oct 21, 2024
@llucax llucax modified the milestones: v1.0.0-rc1100, v1.0.0-rc1200 Nov 11, 2024
@llucax llucax modified the milestones: v1.0.0-rc1200, v1.0.0-rc1400 Nov 20, 2024
@llucax llucax removed this from the v1.0.0-rc1400 milestone Nov 29, 2024
@llucax llucax added this to the v1.0.0-rc1500 milestone Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:data-pipeline Affects the data pipeline type:enhancement New feature or enhancement visitble to users
Projects
Status: To do
Development

No branches or pull requests

3 participants