Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating kurtosis from histogram #10

Closed
david-res opened this issue May 21, 2024 · 31 comments
Closed

Calculating kurtosis from histogram #10

david-res opened this issue May 21, 2024 · 31 comments
Assignees
Labels
question Further information is requested

Comments

@david-res
Copy link

How would you suggest that I calculate the Kurtosis of the histogram?
Ideally I would have at least 1024 bin (10 bit adc resolution)

Would you be able to provide such an example?

@RobTillaart RobTillaart self-assigned this May 21, 2024
@RobTillaart RobTillaart added the question Further information is requested label May 21, 2024
@RobTillaart
Copy link
Owner

Thanks for your question, as I am in the middle of a project it might take a few days to come back to it.

@RobTillaart
Copy link
Owner

RobTillaart commented May 21, 2024

1024 bins means you need 4K RAM to hold all bins if you use the Histogram class.
If you use Histogram16 you need 2K RAM to hold the bins.

Some googling showed there is quite some math involved.
These links might help you to get started.

(from the last link, not Arduino code)

#ifndef RUNNINGSTATS_H
#define RUNNINGSTATS_H

class RunningStats
{
public:
    RunningStats();
    void Clear();
    void Push(double x);
    long long NumDataValues() const;
    double Mean() const;
    double Variance() const;
    double StandardDeviation() const;
    double Skewness() const;
    double Kurtosis() const;

    friend RunningStats operator+(const RunningStats a, const RunningStats b);
    RunningStats& operator+=(const RunningStats &rhs);

private:
    long long n;
    double M1, M2, M3, M4;
};

#endif

And here is the implementation file RunningStats.cpp.

#include "RunningStats.h"
#include <cmath>
#include <vector>

RunningStats::RunningStats() 
{
    Clear();
}

void RunningStats::Clear()
{
    n = 0;
    M1 = M2 = M3 = M4 = 0.0;
}

void RunningStats::Push(double x)
{
    double delta, delta_n, delta_n2, term1;

    long long n1 = n;
    n++;
    delta = x - M1;
    delta_n = delta / n;
    delta_n2 = delta_n * delta_n;
    term1 = delta * delta_n * n1;
    M1 += delta_n;
    M4 += term1 * delta_n2 * (n*n - 3*n + 3) + 6 * delta_n2 * M2 - 4 * delta_n * M3;
    M3 += term1 * delta_n * (n - 2) - 3 * delta_n * M2;
    M2 += term1;
}

long long RunningStats::NumDataValues() const
{
    return n;
}

double RunningStats::Mean() const
{
    return M1;
}

double RunningStats::Variance() const
{
    return M2/(n-1.0);
}

double RunningStats::StandardDeviation() const
{
    return sqrt( Variance() );
}

double RunningStats::Skewness() const
{
    return sqrt(double(n)) * M3/ pow(M2, 1.5);
}

double RunningStats::Kurtosis() const
{
    return double(n)*M4 / (M2*M2) - 3.0;
}

RunningStats operator+(const RunningStats a, const RunningStats b)
{
    RunningStats combined;
    
    combined.n = a.n + b.n;
    
    double delta = b.M1 - a.M1;
    double delta2 = delta*delta;
    double delta3 = delta*delta2;
    double delta4 = delta2*delta2;
    
    combined.M1 = (a.n*a.M1 + b.n*b.M1) / combined.n;
    
    combined.M2 = a.M2 + b.M2 + 
                  delta2 * a.n * b.n / combined.n;
    
    combined.M3 = a.M3 + b.M3 + 
                  delta3 * a.n * b.n * (a.n - b.n)/(combined.n*combined.n);
    combined.M3 += 3.0*delta * (a.n*b.M2 - b.n*a.M2) / combined.n;
    
    combined.M4 = a.M4 + b.M4 + delta4*a.n*b.n * (a.n*a.n - a.n*b.n + b.n*b.n) / 
                  (combined.n*combined.n*combined.n);
    combined.M4 += 6.0*delta2 * (a.n*a.n*b.M2 + b.n*b.n*a.M2)/(combined.n*combined.n) + 
                  4.0*delta*(a.n*b.M3 - b.n*a.M3) / combined.n;
    
    return combined;
}

RunningStats& RunningStats::operator+=(const RunningStats& rhs)
{ 
        RunningStats combined = *this + rhs;
        *this = combined;
        return *this;
}

@RobTillaart
Copy link
Owner

Found some time,

Made an experimental sketch + library based upon the above code.
It does not implement the adding of two Kurtosis objects.

Kurtosis.zip

@RobTillaart
Copy link
Owner

Slightly faster skew() as a division is replaced by multiplication (to be verified).

    double skew()
    {
      return sqrt(double(_count)) * M3 * pow(M2, -1.5);
    };

@david-res
Copy link
Author

@RobTillaart that is very kind of you to jump into this and put together that library!

Today my my code does something of this kind:
I have two histogram buffers, I increment a bin on each ADC value (running at 250Khz sample rate) and I swap buffers every 100ms, so that I can process one while the other is being filled.

RAM consumption is not of an issue as I am using a Teensy 4.1 with 1MB of internal RAM and 16MB of PSRAM,

I assume that in order to setup the same kind of architecture, I would need two histogram classes?

@RobTillaart
Copy link
Owner

RobTillaart commented May 21, 2024

(Not a statistician, never worked with the kurtosis/skewness until today, so all disclaimers apply)

Thinking out loud,

Using two buffers sounds logical however swapping the buffers means (1) either you clear and start all over again or (2) you continue with the contents of that histogram.

  • case 1 you loose historical data which is good or not (for you to decide)
  • case 2 you miss 50% of historical data during analysis, which could introduce some interference if your dataset has a similar frequency in it. (e.g. if you are sampling a signal that changes every 100 ms too).

What you should do is to use three histogram objects(x,Y,Z),
two of them (X and Y)are filled alternatingly,
the third (Z) is the sum which adds the data, so you always work on all samples.

in pseudo code

task_filler()
{
  forever()
  {
    value = adc();
    if (flag == x) X[adc]++;
    else Y[adc]++;
  }
}

task_analyzer()
{
  Kurtosis K;
  forever()
  {
    K.reset();
    //  add the Y histo to the Z histo and add the new Z value to K.
    for (idx = 0; idx < max; idx++) 
    {
      if (flag == x)  Z[idx] += Y[idx];   //  while x is being filled we analyze Y
      else Z[idx] += X[idx];  //  and vice versa
      K.add(Z[idx]);
    }
    print(K.count + "\t");
    print(K.kurtosis());

    //  clear process historgram, "swap" X and Y, and start again. 
    if (flag == x) 
    {
      Y.clear(); 
      flag = y;
    }
    else 
    {
      X.clear(); 
      flag = x;
    } 
}

250 KHz samples every 100 ms means that you could have 25000 samples in one bin.
So the X and Y histogram could be a histogram16 type, where Z must be an histogram type

@RobTillaart
Copy link
Owner

Created a new library - https://github.com/RobTillaart/Kurtosis (under development)

The calculation of the kurtosis of a histogram is more work as you need to add every entry, e.g. if bin 13 has a count of 123, you have to add 123 times the value bin 13 represents. I can implement an add(value, times) but the only gain is that you have no function call overhead. Might still be worth it

So it will easier to have one histogram and add all measurements to the Kurtosis object directly after making the measurement.

@RobTillaart
Copy link
Owner

@david-res

Released the version 0.1.0 of the kurtosis library today.

Furthermore in the new 0.1.1 develop branch I have added add(double value, uint8_t times).
This is code under investigation and looks very promising.
There are some overflow risks that can be tackled by using uint64_t for the internal counter.
Another option is to refactor the code in an overflow preventing way.

This add(double value, uint8_t times).allows to add a single value multiple times in one call.
Current implementation is more efficient from times >= 3 and has a near constant runtime as far as tested.
So it is ideal for going through a histogram to determine the skewness and

(need to find time)

@david-res
Copy link
Author

@RobTillaart that's amazing, thanks!!

Could you help me put together a simple example of a 1024 bin histogram (I saw you need to declare the bin list in the class, and it can't be set after - so how would I do over 1k bins without looping over b[] to assign a value to each bin?)
Then use the Kurtosis to calculate the value based on the histogram?

Once I have something basic in front of me, I can scale it up for the double or triple buffers etc

@RobTillaart
Copy link
Owner

RobTillaart commented May 23, 2024

@david-res
I'm sorry It seems that Iwas too optimistic,
in my first test I created a Kurtosis object. filled it with N calls to add(value) and then I called add(value, 100); and it gave the roughly same metrics (within float accuracy) as when I would add the 100 values in a loop.

This morning I did a second test - code below so you can confirm - in which I filled an array with random values.
I calculated the kurtosis in that loop to get the metrics when calling single add(value) for every new value.

Then I filled a 2nd kurtosis object with add(value, times) from the histogram. The metrics failed to be calculated. Reason is that when you start with e.g. 10x value 4 the average is 4 but the variance etc is all zero. Apparently the math is such that when adding the variances it stays zero. This seems to be also affecting the other metrics.

TODO: check if adding order affects metrics.

There might be a (time consuming) trick to handle this, I will dive into it right away

//
//    FILE: kurtosis_histogram.ino
//  AUTHOR: Rob Tillaart
//    DATE: 2024-05-21
// PURPOSE: demo determination of skewness and kurtosis of histogram
//     URL: https://github.com/RobTillaart/Kurtosis

#include "Kurtosis.h"

Kurtosis K1;
Kurtosis K2;

//  simplified histogram 
uint16_t hist[100];

void setup()
{
  Serial.begin(115200);
  Serial.println(__FILE__);
  Serial.print("KURTOSIS_LIB_VERSION: ");
  Serial.println(KURTOSIS_LIB_VERSION);
  Serial.println();

  K1.reset();
  //  fill the histogram with 10000 measurements
  for (int i = 0; i < 10000; i++)
  {
    int x = random(100);  //  simulate measurement
    hist[x]++;
    K1.add(x);
  }

  // print statistics
  Serial.println("K1");
  Serial.print("COUNT:\t");
  Serial.println(K1.count());
  Serial.print("MEAN:\t");
  Serial.println(K1.mean());
  Serial.print("VAR:\t");
  Serial.println(K1.variance());
  Serial.print("STDDEV:\t");
  Serial.println(K1.stddev());
  Serial.print("SKEW:\t");
  Serial.println(K1.skewness());
  Serial.print("KURT:\t");
  Serial.println(K1.kurtosis());
  Serial.println();


  K2.reset();
  for (int i = 0; i < 100; i++)
  {
    K2.add(i, hist[i]);
  }

  // print statistics
  Serial.println("K2");
  Serial.print("COUNT:\t");
  Serial.println(K2.count());
  Serial.print("MEAN:\t");
  Serial.println(K2.mean());
  Serial.print("VAR:\t");
  Serial.println(K2.variance());
  Serial.print("STDDEV:\t");
  Serial.println(K2.stddev());
  Serial.print("SKEW:\t");
  Serial.println(K2.skewness());
  Serial.print("KURT:\t");
  Serial.println(K2.kurtosis());
  Serial.println();
}


void loop()
{

}

//  -- END OF FILE --

output

K1
COUNT:	10000
MEAN:	49.72
VAR:	837.70
STDDEV:	28.94
SKEW:	-0.02
KURT:	-1.21

K2
COUNT:	10000
MEAN:	49.72
VAR:	0.00
STDDEV:	0.00
SKEW:	-0.02
KURT:	4251.64

only skewness seems to be correct

@RobTillaart
Copy link
Owner

There might be a (time consuming) trick to handle this, I will dive into it right away

Trick failed, 😒, learned (again) that adding distributions is different from adding values.

This means that the add(value, times) will be removed from the 0.1.1 API.
So the only way is just add the values when they are produced by the ADC.

Sorry I could not solve your question. For me the investigations will stop for now as I am out of ideas and time.
I will finish the 0.1.1 version of the library asap.

@david-res
Copy link
Author

david-res commented May 23, 2024

@RobTillaart so today what I do is I fill a buffer(histogram) with a count increment based on the ADC value as each bin represents an ADC value, something like this:

uint32_t histogramBuffer[1024] = {0};

void adc0_isr(void) {
    uint16_t measurement = adc->adc0->analogReadContinuous();
    // Increment the corresponding histogram bin
    histogramBuffer[measurement] += 1;
}

//called after switching histogram buffers
void processData(){
    for(int i=0;i>1024;i++){ //Loop over each bin
        uint32_t count = histogramBuffer[i]; //get bin count value
        for(int j=0;j<count;j++){
            K1.add(i);
        }
    }
    Serial.print("KURT:\t");
    Serial.println(K1.kurtosis());
}

My only concern is the Kurtosis is not aware of bins that have a value of 0 - would that not put off the calculation?

@RobTillaart
Copy link
Owner

Use an if statement to filter them

If ("bin not empty") k.add(bin);

@david-res
Copy link
Author

The for loop is doing that already. If count eq 0 then the loop won't run.
But then the mean of the histogram will be off if we are not reporting bins with a value of 0.
That's why its better that the add function have two argument - the bin number and it's value.

What do you think?

@RobTillaart
Copy link
Owner

I see now zero bins are handled correctly.
What is a possible concern is that the inner loop uses an int for j which should be an uint32 to prevent overflow (if int is 16 bits it could happen)

@david-res
Copy link
Author

david-res commented May 23, 2024

Good catch, I missed that for int j.

But what about my comment regarding the mean calculation?

I think I could increment a variable each time the bin value eq 0. Then call K1.add(0) x variable count.
Or if else with the second loop inside the if
This will assure I have the right mean calculated.

@RobTillaart
Copy link
Owner

Simple example
1024 bins, 1 measurements e.g. 654.
If you add zero for every bin that is zero the internal count would be 1024 and the average 0.65... while the average of the measurements is 654.

Bins with value 0 should therefor not be added as zero.

@RobTillaart
Copy link
Owner

K1.add(0)
Means that you want to add the measurements with the value zero. (E.g the ADC reads 0Volts)

@david-res
Copy link
Author

Right right, that makes total sense now.
I was looking at if from the aspect of values and not count of values

@david-res
Copy link
Author

@RobTillaart I got the Kurtosis class working, thanks!

I have another question - how can I create a pointer array to two Histogram objects? This is so that I can swap between them every 100ms in my existing intervalTimer interrupt.

@RobTillaart
Copy link
Owner

Tomorrow I might have some tome

@david-res
Copy link
Author

david-res commented May 25, 2024

I ended up creating two histograms object: H1 and H2

I them created a class pointer array:
Histogram * histogram[2] = {&H1, &H2};

it compiles, but I have not run it yet

@RobTillaart
Copy link
Owner

Looks like nice pointers which will work.
However if your application runs multitasking you have to take some precautions to make swapping them atomic.

It is easier to have an array of histograms and adjust the (byte size) index one may write to.. this index is either 0 or 1 and can only have one value, so atomic by design (note it should be byte sized).

@RobTillaart
Copy link
Owner

RobTillaart commented May 25, 2024

Depending on board swapping two pointers can be made atomic enough.

You have one process that writes to a H1 and a process that reads H2. When the reading process is done it clears H2 and changes the pointer if the write process to pint to H2. (Both pointers point to H2 now however the read process is dine).
The write process continues writing and the read process changes its pointer to H1.

So it is possible if the reader is in control of the swapping.

@RobTillaart
Copy link
Owner

RobTillaart commented May 25, 2024

//
//    FILE: hist_pointer.ino
//  AUTHOR: Rob Tillaart
// PURPOSE: demo histogram pointer
//     URL: https://github.com/RobTillaart/Histogram


#include "histogram.h"

//  boundaries does not need to be equally distributed.
float a[] = {
  0, 50, 125, 200, 250, 350, 500
};
float b[] = {
  0, 100, 200, 300, 325, 350, 375, 400
};

Histogram histA(7, a);
Histogram histB(8, b);
Histogram *phist;


void setup()
{
  Serial.begin(115200);
  Serial.println(__FILE__);
  Serial.print("HISTOGRAM_LIB_VERSION: ");
  Serial.println(HISTOGRAM_LIB_VERSION);
  Serial.println();

  phist = &histA;

}


void loop()
{
  phist->clear();
  for (int i = 0; i < 10000; i++)
  {
    int x = random(600) - 50;
    phist->add(x);
  }

  Serial.print(phist->count());
  for (uint16_t i = 0; i < phist->size(); i++)
  {
    Serial.print("\t");
    Serial.print(phist->frequency(i), 2);
  }
  Serial.println();

  //  swap
  if (phist == &histA) phist = &histB;
  else phist = &histA;
}


//  -- END OF FILE --

compiles and runs.

Sample output shows the different length of the 2 histograms.

...Histogram\examples\hist_pointer\hist_pointer.ino
HISTOGRAM_LIB_VERSION: 0.3.6

10000	0.08	0.08	0.12	0.13	0.08	0.16	0.25	0.09
10000	0.09	0.17	0.16	0.17	0.04	0.04	0.04	0.04	0.24
10000	0.09	0.08	0.13	0.13	0.08	0.16	0.25	0.09
10000	0.08	0.17	0.16	0.17	0.04	0.04	0.04	0.04	0.25
10000	0.09	0.09	0.12	0.12	0.08	0.17	0.25	0.09
10000	0.08	0.17	0.17	0.17	0.04	0.04	0.04	0.04	0.24
10000	0.08	0.08	0.13	0.13	0.08	0.17	0.25	0.08
10000	0.09	0.17	0.17	0.16	0.04	0.04	0.04	0.04	0.24
10000	0.09	0.08	0.13	0.12	0.08	0.17	0.25	0.09

will update the library with the example soon.

@RobTillaart
Copy link
Owner

Created develop branch with pointer and array example + reference to new kurtosis library.

Think the original issue of kurtosis math is solved now.

RobTillaart added a commit that referenced this issue May 26, 2024
- add examples **hist_pointer.ino** and **hist_array.ino**
- issue #10 spin off library https://github.com/RobTillaart/Kurtosis
- minor edits
@RobTillaart
Copy link
Owner

@david-res
Develop branch is merged, you may close the issue

@RobTillaart
Copy link
Owner

Note to self:
thinking about the Kurtosis of a histogram.

(thinking out loud modus)
if a bin representing a value V has a count of C, does in mean it has a C values added in the range R = <V-delta, V+delta>
so when adding a bin to do some statistics me could / should / might add some noise so that we add C values in the range R.

So we do not add C times V but V +-delta, which averages to V but could represent the original values better.
Of course if a bin only handles integer values, one should only regenerate integer values within the range R.

As said before, no statistician, however the idea would recreate the possible spread of the original values.
It might be closer to reality.

@david-res
Copy link
Author

Cheers @RobTillaart!
Already got the Kurtosis implemented with my current histogram.
For some reason when I use your histogram class it stops my UI from rendering, but everything else keeps working - will investigate it soon.

@RobTillaart
Copy link
Owner

For some reason when I use your histogram class it stops my UI from rendering, but everything else keeps working - will investigate it soon.

Please open a new issue if it is a histogram issue.

FYI added a standalone kurtosis and skewness function to https://github.com/RobTillaart/kurtosis develop branch.
Works only if you have a full dataset available as it needs to go through it twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants