Rework quantile iteration logic #67

marshallpierce · 2017-10-13T21:55:29Z

See #66.

I've gotta run at the moment but I wanted to get this in front of people. I don't think the quantile iterator needs a nontrivial more(); I'm guessing that was an expedient tweak at some point in the Java impl's past but I don't think it really makes mathematical sense.

Also, in (1.0 / (1.0 - self.quantile_to_iterate_to)).log2() as u32, the floating point calculations were yielding inf at quantile 1.0, which became 0 as a u32. So, now we short circuit before that logic.

marshallpierce · 2017-10-13T23:52:05Z

src/lib.rs

@@ -218,7 +219,7 @@ const ORIGINAL_MAX: u64 = 0;
 /// Partial ordering is used for threshholding, also usually in the context of quantiles.
 pub trait Counter
    : num::Num + num::ToPrimitive + num::FromPrimitive + num::Saturating + num::CheckedSub
-    + num::CheckedAdd + Copy + PartialOrd<Self> {
+    + num::CheckedAdd + Copy + PartialOrd<Self> + fmt::Debug {


This is to allow the use of assert_eq! with a T. Seemed pretty harmless

marshallpierce · 2017-10-13T23:52:52Z

src/iterators/mod.rs

@@ -163,7 +165,7 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
                    // if we've seen all counts, no other counts should be non-zero
                    if self.total_count_to_index == total {
                        // TODO this can fail when total count overflows
-                        assert!(count == T::zero());
+                        assert_eq!(count, T::zero());


Got tired of IntelliJ warning me that this could be assert_eq

marshallpierce · 2017-10-14T00:04:54Z

src/iterators/quantile.rs

-            true
-        } else {
-            false
-        }


In the Java impl, self.quantile_to_iterate_to ends up being what is exposed as the quantile iterated to, rather than calculating (accumulated count) / (total count) at each iteration point as the Rust impl does (and also the Java impl for all iterators other than percentile). Thus, in the Java impl, you would end up at quantile 0.998... or similar when you ended up at the last nonzero bucket, and it was aesthetically pleasing to nudge the iterator one more slot forward to get to 1.0.

Hmm... maybe the Java way is better and we should expose that as well? Java uses getPercentile() vs getPercentileIteratedTo() to allow consumers to differentiate. We could expose an extra field as well in IterationValue, or apply type system shenanigans to have per-iterator value types (<V extends IterationValue>?) or bolted-on extra data (IterationValue<T, V> for some per-iterator associated type V). The upside is that it lets quantile iteration like in the dump-to-stdout example show the internal quantile value's small changes to make it clear that forward progress is happening, even when we stay at the same value for a while.

I think having another value in IterationValue sounds perfectly sensible.
Not sure it'll be accessed much, but doesn't hurt to have it there.

OK, I'll go ahead and add it. It will make the quantile iteration output a little easier to visually comprehend.

jonhoo · 2017-10-14T15:23:47Z

examples/cli.rs

            .get_matches();

+    let stdin = std::io::stdin();
+    let stdin_handle = stdin.lock();


Any particular reason you don't just want let stdin = std::io::stdin().lock();?

The compiler gets grumpy about the lifetime of stdin() if I remove the intermediate let. Perhaps there is a better way though?

Ah.. This is at least a little nicer:

let stdin = std::io::stdin(); let stdin = stdin.lock();

jonhoo · 2017-10-14T15:30:34Z

src/iterators/mod.rs

@@ -142,7 +142,9 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
                return None;
            }

-            // have we yielded all non-zeros in the histogram?
+            // TODO should check if we've reached max, not count, to avoid early termination
+            // on histograms with very large counts whose total would exceed u64::max_value()


Yeah, that's a good point.

Though doesn't running_total have the same issue?

It absolutely does, but we could limit the damage to only saturating total_count_to_index and still continuing to iterate until we've reached the max. I've got another branch I'm working on that does this.

…ile iterated to

marshallpierce · 2017-10-14T19:03:09Z

Now it's pretty clear what's going on in quantile output, I think:

       Value          QuantileValue      QuantileIteration TotalCount 1/(1-Quantile)

           0 0.00010000000000000000 0.00000000000000000000          1           1.00
   999292927 0.10000000000000000555 0.10000000000000000555       1000           1.11
  1999634431 0.20000000000000001110 0.20000000000000001110       2000           1.25
  3001024511 0.30009999999999997788 0.30000000000000004441       3001           1.43
  4001366015 0.40010000000000001119 0.40000000000000002220       4001           1.67
...
  9982443519 0.99819999999999997620 0.99765625000000002220       9982         555.56
  9982443519 0.99819999999999997620 0.99804687500000000000       9982         555.56
  9990832127 0.99899999999999999911 0.99824218750000004441       9990        1000.00
  9990832127 0.99899999999999999911 0.99843750000000008882       9990        1000.00
  9990832127 0.99899999999999999911 0.99863281250000013323       9990        1000.00
  9990832127 0.99899999999999999911 0.99882812500000017764       9990        1000.00
  9999220735 0.99990000000000001101 0.99902343750000022204       9999       10000.00
  9999220735 0.99990000000000001101 0.99912109375000024425       9999       10000.00
...
  9999220735 0.99990000000000001101 0.99982910156250004441       9999       10000.00
  9999220735 0.99990000000000001101 0.99985351562500002220       9999       10000.00
  9999220735 0.99990000000000001101 0.99987792968750000000       9999       10000.00
  9999220735 0.99990000000000001101 0.99989013671875004441       9999       10000.00
 10007609343 1.00000000000000000000 0.99990234375000008882      10000              ∞
#[Mean       = 5000000916.51, StdDeviation   = 2887040392.98]
#[Max        =  10007609343, Total count    =        10000]
#[Buckets    =           54, SubBuckets     =        56320]

You can see both the pleasing 1.0 quantile when you hit the max value, and also see that the iteration quantile is ticking slowly up during the part where you stay at value 9999220735 for a while.

jonhoo · 2017-10-14T20:02:27Z

examples/cli.rs

@@ -109,9 +109,10 @@ fn quantiles<R: BufRead, W: Write>(mut reader: R, mut writer: W, quantile_precis

    writer.write_all(
        format!(
-            "{:>12} {:>quantile_precision$} {:>10} {:>14}\n\n",
+            "{:>12} {:>quantile_precision$} {:>quantile_precision$} {:>10} {:>14}\n\n",


Neat. I didn't know about this trick.

marshallpierce · 2017-10-14T20:09:14Z

@algermissen can you take a look at the CLI tool's quantile output and see if it makes sense for your data?

algermissen · 2017-10-22T09:58:21Z

@marshallpierce Sorry for the delay - yes, this clarifies the start at '0' (which is just due to lost precision in the output).

However, I am still getting the doubled last line with +Inf at the 1/(1-Quantile) column. But this does not affect the histogram plots. So I am good.

Thanks for doing this so thoroughly.

marshallpierce · 2017-10-23T13:25:52Z

@algermissen "thoroughly" is my favorite way to do things. :)

There are more tweaks to iteration coming down the pipe, including a further adjustment of the logarithmic iterator, but I suspect what is going on is actually correct behavior. Can you paste a base64 of your serialized histogram somewhere so I can inspect it?

Anyway, an example of what could produce multiple lines of 1.0 quantile: suppose you had a count of 1 at value 1, and a count of 1_000_000_000 at value 1000. Almost every quantile beyond 0 would put you at the large value, so the "quantile at the value" would be 1.0 as the "quantile we're iterating to" continues from just above 0 to 1.0.

The change I'm working on is to make it so that as soon as the quantile iterator reaches a value that yields quantile 1.0, do one more iteration to make the quantile iterated to be 1.0 as well. This will allow the iteration to reach its natural end point, but also skip any intermediate points that would also be at value quantile 1.0.

This release has a couple of backwards-incompatible changes: - the old `len()` is now `distinct_values()` - the new `len()` is the old `count()` (which is deprecated) - `IterationValue::value` became `value_iterated_to` Some other API changes: - iterator values gained `quantile_iterated_to()` - `Histogram` gained `is_empty()` Behind the scenes: - #67 and #68 landed a number of fixes to iterators such that the produced values are more correct and sensible. - errors were moved into their own module.

marshallpierce added 2 commits October 13, 2017 09:10

Add quantile iteration output to CLI example

2da9f0a

Tweak quantile iteration logic

2804a9f

marshallpierce commented Oct 13, 2017

View reviewed changes

marshallpierce commented Oct 14, 2017

View reviewed changes

jonhoo reviewed Oct 14, 2017

View reviewed changes

Make quantile iterator tool show both quantile at the value and quant…

3b1b3ba

…ile iterated to

Fix up doc tests that instantiate iteration values

0c5e264

jonhoo reviewed Oct 14, 2017

View reviewed changes

jonhoo approved these changes Oct 14, 2017

View reviewed changes

marshallpierce merged commit ba4eb03 into master Oct 18, 2017

marshallpierce mentioned this pull request Oct 18, 2017

Strange behaviour of percentile iterator #66

Closed

marshallpierce deleted the quantile-iter-end branch November 1, 2017 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework quantile iteration logic #67

Rework quantile iteration logic #67

marshallpierce commented Oct 13, 2017

marshallpierce Oct 13, 2017

marshallpierce Oct 13, 2017

marshallpierce Oct 14, 2017

marshallpierce Oct 14, 2017

jonhoo Oct 14, 2017

marshallpierce Oct 14, 2017

jonhoo Oct 14, 2017

marshallpierce Oct 14, 2017

jonhoo Oct 14, 2017

jonhoo Oct 14, 2017

jonhoo Oct 14, 2017

marshallpierce Oct 14, 2017

marshallpierce commented Oct 14, 2017

jonhoo Oct 14, 2017

marshallpierce commented Oct 14, 2017

algermissen commented Oct 22, 2017

marshallpierce commented Oct 23, 2017

Rework quantile iteration logic #67

Rework quantile iteration logic #67

Conversation

marshallpierce commented Oct 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marshallpierce commented Oct 14, 2017

Choose a reason for hiding this comment

marshallpierce commented Oct 14, 2017

algermissen commented Oct 22, 2017

marshallpierce commented Oct 23, 2017