GitHub - odeheurles/HdrHistogram: A High Dynamic Range (HDR) Histogram

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
lib/test		lib/test
src		src
.gitignore		.gitignore
COPYING.txt		COPYING.txt
LICENSE.txt		LICENSE.txt
README		README
Repository files navigation

HdrHistogram: A High Dynamic Range (HDR) Histogram
[excerpt from the Histogram JavaDoc]

Histogram supports the recording and analyzing sampled data value counts across a
configurable integer value range with configurable value precision within the range.
Value precision is expressed as the number of significant digits in the value
recording, and provides control over value quantization behavior across the value
range and the subsequent value resolution at any given level.

For example, a Histogram could be configured to track the counts of observed integer
values between 0 and 3,600,000,000 while maintaining a value precision of 3 significant
digits across that range. Value quantization within the range will thus be no larger
than 1/1,000th (or 0.1%) of any value. This example Histogram could be used to track
and analyze the counts of observed response times ranging between 1 microsecond and
1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1
millisecond, a resolution of 1 millisecond (or better) up to one second, and a
resolution of 1 second (or better) up to 1,000 seconds. At it's maximum tracked value
(1 hour), it would still maintain a resolution of 3.6 seconds (or better).

Histogram maintains a fixed cost in both space and time. A Histogram's memory footprint
is constant, with no allocation operations involved in recording data values or in
iterating through them. The memory footprint is fixed regardless of the number of data
value samples recorded, and depends solely on the dynamic range and precision chosen.
The amount of work involved in recording a sample is constant, and directly computes
storage index locations such that no iteration or searching is ever involved in
recording data values.

A combination of high dynamic range and precision is useful for collection and accurate
post-recording analysis of sampled value data distribution in various forms. Whether
it's calculating or plotting arbitrary percentiles, iterating through and summarizing
values in various ways, or deriving mean and standard deviation values, the fact that
the recorded data information is kept in high resolution allows for accurate
post-recording analysis with low [and ultimately configurable] loss in accuracy when
compared to performing the same analysis directly on the potentially infinite series
of sourced data values samples.

Internally, Histogram data is maintained using a concept somewhat similar to that of
floating point number representation: Using a an exponent and a (non-normalized) mantissa
to support a wide dynamic range at a high but varying (by exponent value) resolution.
Histogram uses exponentially increasing bucket value ranges (the parallel of the exponent
portion of a floating point number) with each bucket containing a fixed number (per bucket)
set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating
point number). Both dynamic range and resolution are configurable, with
highestTrackableValue controlling dynamic range, and largestValueWithSingleUnitResolution
controlling resolution.

An common use example of an HDR Histogram would be to record response times in units of
microseconds across a dynamic range stretching from 1 usec to over an hour, with a good
enough resolution to support later performing post-recording analysis on the collected
data. Analysis can including computing, examining, and reporting of distribution by
percentiles, linear or logarithmic value buckets, mean and standard deviation, or by any
other means that can can be easily added by using the various iteration techniques
supported by the Histogram. In order to facilitate the accuracy needed for various
post-recording anaylsis techniques, this example can maintain where a resolution of ~1
usec or better for times ranging to ~2 msec in magnitude, while at the same time
maintaining a resolution of ~1 msec or better for times ranging to ~2 sec, and a
resolution of ~1 second or better for values up to 2,000 seconds. This sort of example
resolution can be thought of as "always accurate to 3 decimal points." Such an example
Histogram would simply be created with a highestTrackableValue of 3,600,000,000, and a
largestValueWithSingleUnitResolution of 2,000, and would occupy a fixed, unchanging
memory footprint of around 369KB (see "Footprint estimation" below).

Synchronization and concurrent access

In the interest of keeping value recording cost to a minimum, Histogram is NOT internally
synchronized, and does NOT use atomic variables. Callers wishing to make potentially
concurrent, multi-threaded updates or queries against the the Histogram object should
take care to externally synchronize and/or order their access.

Iteration

Histogram supports multiple convenient forms of iterating through the histogram data set,
including linear, logarithmic, and percentile iteration mechanisms, as well as means for
iterating through each recorded value or each possible value level. Identical iteration
mechanisms are available for the histogram's default (corrected) data set and for it's
raw data (see "Corrected and Raw data sets" below) set for the HistogramData available
through either getHistogramData() or getRawHistogramData(). The iteration mechanisms all
provide HistogramIterationValue data points along the histogram's iterated data set, and
are available for the default (corrected) histogram data set via the following
HistogramData methods:

percentiles : An Iterable<HistogramIterationValue> through the histogram using aPercentileIterator
linearBucketValues : An Iterable<HistogramIterationValue> through the histogram using a LinearIterator
logarithmicBucketValues : An Iterable<HistogramIterationValue> through the histogram using a LogarithmicIterator
recordedValues : An Iterable<HistogramIterationValue> through the histogram using a RecordedValuesIterator
allValues : An Iterable<HistogramIterationValue> through the histogram using a AllValuesIterator
Iteration is typically done with a for-each loop statement. E.g.: 

 for (HistogramIterationValue v : histogram.getHistogramData().percentiles(percentileTicksPerHalfDistance)) {
     ...
 }
 
or 
  for (HistogramIterationValue v : histogram.getRawHistogramData().linearBucketValues(valueUnitsPerBucket)) {
     ...
 }
 
The iterators associated with each iteration method are resettable, such that a caller
that would like to avoid allocating a new iterator object for each iteration loop can
re-use an iterator to repeatedly iterate through the histogram. This iterator re-use
usually takes the form of a traditional for loop using the Iterator's hasNext() and
next() methods: to avoid allocating a new iterator object for each iteration loop: 

PercentileIterator iter = histogram.getHistogramData().percentiles().iterator(percentileTicksPerHalfDistance);
 ...
 iter.reset(percentileTicksPerHalfDistance);
 for (iter.hasNext() {
     HistogramIterationValue v = iter.next();
     ...
 }
 
Equivalent Values and value ranges

Due to the finite (and configurable) resolution of the histogram, multiple adjacent
integer data values can be "equivalent". Two values are considered "equivalent" if
samples recorded for both are always counted in a common total count due to the
histogram's resolution level. Histogram provides methods for determining the lowest
and highest equivalent values for any given value, as we as determining whether two
values are equivalent, and for finding the next non-equivalent value for a given
value (useful when looping through values, in order to avoid double-counting count).

Corrected and Raw data sets

In order to support a common use case needed when histogram values are used to track
response time distribution, Histogram collects both raw and corrected (weighted)
histogram results by supporting an optional expectedIntervalBetweenValueSamples
parameter to the recordValue method. When a value recorded in the histogram exceeds
the expectedIntervalBetweenValueSamples parameter, the raw histogram data will reflect
only the single reported result, while the default (corrected) histogram data will
reflect an appropriate number of additional results with linearly decreasing values
(down to the last value that would still be higher than
expectedIntervalBetweenValueSamples).

To illustrate why this corrective behavior is critically needed in order to accurately
represent value distribution when large value measurements may lead to missed samples,
imagine a system for which response times samples are taken once every 10 msec to
characterize response time distribution. The hypothetical system behaves "perfectly"
for 100 seconds (10,000 recorded samples), with each sample showing a 1msec response
time value. at each sample for 100 seconds (10,000 logged samples at 1msec each). The
hypethetical system then encounters a 100 sec pause during which only a single sample
is recorded (with a 100 second value). The raw data histogram collected for such a
hypothetical system (over the 200 second scenario above) would show ~99.99% of results
at 1msec or below, which is obviously "not right". The same histogram, corrected with
the knowledge of an expectedIntervalBetweenValueSamples of 10msec will correctly
represent the response time distribution. Only ~50% of results will be at 1msec or
below, with the remaining 50% coming from the auto-generated value records covering
the missing increments spread between 10msec and 100 sec.

The raw and default (corrected) data sets will differ only if at least one value
recorded with the recordValue method was greater than it's associated
expectedIntervalBetweenValueSamples parameter. The raw and default (corrected) data
set will be identical in contents if all values recorded via the recordValue were
smaller than their associated (and optional) expectedIntervalBetweenValueSamples
parameters.

While both the raw and corrected histogram data are tracked and accessible, it is the
(default) corrected numbers that would typically be consulted and reported. When used
for response time characterization, the default (corrected) data set will tend to much
more accurately reflect the response time distribution that a random, uncoordinated
request would have experienced.

Footprint estimation

Due to it's dynamic range representation, Histogram is relatively efficient in memory
space requirements given the accuracy and dynamic range it covers. Still, it is useful
to be able to estimate the memory footprint involved for a given highestTrackableValue
and largestValueWithSingleUnitResolution combination. Beyond a relatively small
fixed-size footprint used for internal fields and stats (which can be estimated as
"fixed at well less than 1KB"), the bulk of a Histogram's storage is taken up by it's
data value recording counts array. The total footprint can be estimated by:

     largestValueWithSingleUnitResolution = 2 * (10 ^ numberOfSignificantValueDigits);

     expectedHistogramFootprintInBytes = 1024 +
          8 * (log2RoundedUp((1.0 * highestTrackableValue) / largestValueWithSingleUnitResolution) + 2) *
           roundedUpToNearestPowerOf2(largestValueWithSingleUnitResolution)

 
A conservative (high) estimate of a Histogram's footprint in bytes is available via
the getEstimatedFootprintInBytes() method.