Cumulative distribution function #3905

dagguh · 2015-05-21T00:04:16Z

Cumulative distribution function, e.g.:

This function is invertible, e.g. you can swap the axes:

I am unable to visualise neither function in Kibana build 6998, commit d029b34. I understand that the axes depend on each other, ie. must be about the same field and must be a pair of percentiles and percentile ranks. I'm aware of the fact that e.g. the Line Chart visualisation isolates the axes from each other.

This is why I propose a new visualisation type: Cumulative Distribution. This is very similar to #2704 which would also need a separate visualisation type. Maybe it can be generalised into a Distribution visualisation type.
Both of them only need a single field as an input. Both would benefit greatly from Split Lines and Split Charts.

If ElasticSearch doesn't give such capabilities, please let me know, I'll raise an issue there.

The text was updated successfully, but these errors were encountered:

rashidkpc · 2015-05-21T14:53:11Z

This is not something we want to introduce a new visualization for, rather this is a transformation on existing data as applied to a line chart. Can you explain some use cases? Give some examples on where you'd use this? Concrete questions it would solve?

dagguh · 2015-05-22T13:26:21Z

E.g. A/B testing. These are actual comparisons we did using JMeter:

We need to compare response times for different configs/versions. Quantiles are the most meaningful.
We care about the entire spectrum, so picking a single percentile is not enough. We might give up some of the completeness and only pick a subset, e.g. P₁, P₂₅, P₅₀, P₇₅, P₉₀, P₉₉, but for 4 splits (per config/version) it would result in 6×4 = 24 lines, which would be absolutely unreadable.

sachinzgupta · 2017-06-15T15:58:34Z

Do we have any update or method or plugin to plot Cumulative distribution function (CDF) or probability density function(PDF) plot for the KPI?

agirbal · 2019-09-12T17:58:50Z

+1. Much of the analysis we do is based on percentile distribution, exactly like @dagguh shows. Basically a histogram where the X-Axis are the bucketed percentiles (e.g. p25, p50, p75) of a field, and the Y-Axis uses some number function like average of that same field or median of some other field (counts would be equal between percentiles). This lets you answer questions like "what is the gain on the 25% of users who have the worst latency to our service." It'd be super powerful.

This is older ticket, any chance this is now doable with pipeline agg, and maybe Vega / Canvas visualizations?

polyfractal · 2020-01-03T16:50:39Z

I believe a CDF chart should be doable with the Percentiles aggregation in Elasticsearch. A CDF is just the "continuous" function describing percentiles at any arbitrary position.

So Kibana could ask the percentiles agg for 0-100 percentile in small increments (0, 5, 10, 15, ... 100) that will approximate the CDF. Smaller increments == better approximation. Asking for more percentiles is essentially free other than some minor computation and a larger response size. The percentile sketches collect all the information from the shards, and when we construct the response Elasticsearch interrogates the CDF of the sketches to generate specific percentiles. So asking for more percentiles just interrogates the sketch a bit more, which is mostly neglibible (within reason) compared to building the sketch itself.

It could also be done with PercentileRanks agg (which is basically the inverted chart), but that requires you to know the extents of data ahead of time. Would be easier to use Percentiles since you know the data is always 0-100, then invert the graph client-side if desired.

I agree that a complete plot of the full CDF is very useful in many analysis.

agirbal · 2020-02-20T22:28:58Z

@polyfractal cc @AlonaNadler the issues with the bare Percentiles agg is that Kibana would still need to do a 2 step request to ES, since it applies to your X-axis value X. The Y-axis is just getting the average of the Y value for that percentile interval of X (however granular it is).

Would there be a way for Kibana to choose to do a "quantile histogram", give a few parameters like the granularity or specific percentile value to bucket at, and then ES would do the full aggregation in 1 go?

polyfractal · 2020-02-21T03:36:41Z

I'm not sure I follow? A request like this basically gives you the CDF:

GET /test/_search
{
  "size": 0,
  "aggs": {
    "cdf": {
      "percentiles": {
        "field": "value",
        "percents": [ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ]
      }
    },
    "stats": {
      "stats": {
        "field": "value"
      }
    }
  }
}

{
  "aggregations" : {
    "cdf" : {
      "values" : {
        "10.0" : 2.6,
        "20.0" : 8.0,
        "30.0" : 15.0,
        "40.0" : 15.0,
        "50.0" : 15.0,
        "60.0" : 19.499999999999996,
        "70.0" : 20.8,
        "80.0" : 41.300000000000004,
        "90.0" : 67.99999999999999,
        "100.0" : 80.0
      }
    },
    "stats" : {
      "count" : 9,
      "min" : 1.0,
      "max" : 80.0,
      "avg" : 24.666666666666668,
      "sum" : 222.0
    }
  }
}

All Kibana needs to do is convert that into a line chart. E.g. a point at (10.0, 2.6), (20.0, 8.0), etc It doesn't work with the current kibana visualization setup because Vizualization assumes you have to build the X axis out of bucketing aggs (which is accurate in most cases, just not here). I don't know the internal details about how hard that would be to adjust, but all the data is available in the percentiles response to build a CDF.

We can't make a "bucket" version of percentiles because it's one of those operations that you don't know the real percentile values until all the shards have been merged together. And at that point it's too late to collect documents into buckets because we're merging on the coordinating node. If we had multi-pass aggs it is theoretically possible, but would still require two passes (it'd just happen in ES)

If a "bucketed" percentiles are needed today, it could be done by Kibana with two passes: one to get the percentiles, second to setup a range agg on those returned percentiles. But that's no longer really a CDF imo :)

agirbal · 2020-02-24T18:42:25Z

@polyfractal right your last description is what I mean. Drawing a pure CDF is one thing and you are right that it would answer the original premise of this ticket. But I think it'd be very limiting in what you can do with it - I attempted to describe a more generic approach that would let you do more interesting things here elastic/elasticsearch#50386

You could draw the CDF 2 ways:
A) as you describe: get a whole bunch of percentile points for the value and extrapolate into a line. Your Y-axis would probably select "avg of field A" and then X-axis a new "percentile histogram" that does not select a field since it doesn't need one (just 0-100).
B) allow to select what you want on Y-axis, say "avg of field A" and then on X-axis "Histogram" with a new "percentile of field B" option (instead of typical range). With this solution you can achieve CDF too (by picking same field for both) but it's much more interesting because it lets you do any histogram as you would normally do, but with values that are not X-axis friendly due to their distribution (typical long tail prod system metrics).

wylieconlon · 2021-06-04T16:45:47Z

For anyone who is trying to get this type of chart in Kibana, I have a workaround using Vega. As mentioned earlier in this thread, Elasticsearch already supports the most basic level of fetching data that we can use to render a chart. Vega can do the calculation in your browser and render the chart. Here's my example.

Full Vega-Lite spec

{
  $schema: https://vega.github.io/schema/vega-lite/v5.json
  title: "Cumulative distribution of bytes"
  data: {
    url: {
      %context%: true
      %timefield%: timestamp
      index: kibana_sample_data_logs
      body: {
        aggs: {
          "terms": {
            "terms": {
              "field": "geo.dest"
              size: 3
            },
            "aggs": {
              "cdf": {
                "percentiles": {
                  "field": "bytes",
                  "percents": [ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99, 100 ],
                  "keyed": true
                }
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.terms.buckets"}
  }
  
  transform: [
    {
      fold: [
        "cdf.values['0.0']"
        "cdf.values['10.0']"
        "cdf.values['20.0']"
        "cdf.values['30.0']"
        "cdf.values['40.0']"
        "cdf.values['50.0']"
        "cdf.values['60.0']"
        "cdf.values['70.0']"
        "cdf.values['80.0']"
        "cdf.values['90.0']"
        "cdf.values['95.0']"
        "cdf.values['99.0']"
        "cdf.values['100.0']"
      ]
      as: ["bytes", "value"]
    }
    {
      calculate: 'toNumber(substring(datum.bytes, 12, lastindexof(datum.bytes, "\'"))) / 100'
      as: percentile
    }
  ]

  mark: {
    type: line
    point: true
    tooltip: true
  }

  encoding: {
    x: {
      field: value
      type: quantitative
      axis: {
        title: false
      }
    }
    y: {
      field: percentile
      type: quantitative
      axis: {
        title: null
        format: "0%"
      }
    }
    color: {
      field: key
      type: nominal
      axis: {
        title: null
      }
    }
  }
}

timductive · 2024-03-21T20:28:22Z

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

rashidkpc added the feedback_needed label May 21, 2015

rashidkpc added release_note:enhancement and removed feedback_needed labels Jun 2, 2015

tbragin added Feature:Aggregations Aggregation infrastructure (AggConfig, esaggs, ...) Feature:Visualizations Generic visualization features (in case no more specific feature label is available) Feature:elasticsearch labels Nov 15, 2016

monfera mentioned this issue Jun 25, 2019

Automatically generate ranges for range/histogram agg #3757

Open

agirbal mentioned this issue Sep 12, 2019

Add a new auto_histogram aggregation for numeric fields elastic/elasticsearch#31828

Open

agirbal mentioned this issue Dec 19, 2019

Add a new quantile histogram aggregation for numeric fields elastic/elasticsearch#50386

Open

wylieconlon mentioned this issue Jan 15, 2020

Bell Curve Histogram #2704

Closed

rayafratkina mentioned this issue Feb 26, 2020

[Meta] Kibana support for ES aggregations #58628

Closed

7 tasks

This was referenced Sep 16, 2020

[Lens] Add cumulative sum aggregation #61776

Closed

[Lens] Allow a single operation to create multiple user-facing operations #77949

Closed

AlonaNadler mentioned this issue Nov 11, 2020

[Lens] Single-value percentile metric #74574

Closed

ghudgins mentioned this issue Jul 7, 2021

Pareto charts #96488

Closed

timductive mentioned this issue Mar 21, 2024

[Icebox] Lens New Visualization Types #179199

Open

timductive closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cumulative distribution function #3905

Cumulative distribution function #3905

dagguh commented May 21, 2015

rashidkpc commented May 21, 2015

dagguh commented May 22, 2015

sachinzgupta commented Jun 15, 2017

agirbal commented Sep 12, 2019 •

edited

Loading

polyfractal commented Jan 3, 2020

agirbal commented Feb 20, 2020 •

edited

Loading

polyfractal commented Feb 21, 2020

agirbal commented Feb 24, 2020

wylieconlon commented Jun 4, 2021 •

edited

Loading

timductive commented Mar 21, 2024

Cumulative distribution function #3905

Cumulative distribution function #3905

Comments

dagguh commented May 21, 2015

rashidkpc commented May 21, 2015

dagguh commented May 22, 2015

sachinzgupta commented Jun 15, 2017

agirbal commented Sep 12, 2019 • edited Loading

polyfractal commented Jan 3, 2020

agirbal commented Feb 20, 2020 • edited Loading

polyfractal commented Feb 21, 2020

agirbal commented Feb 24, 2020

wylieconlon commented Jun 4, 2021 • edited Loading

timductive commented Mar 21, 2024

agirbal commented Sep 12, 2019 •

edited

Loading

agirbal commented Feb 20, 2020 •

edited

Loading

wylieconlon commented Jun 4, 2021 •

edited

Loading