Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework quantile iteration logic #67

Merged
merged 4 commits into from
Oct 18, 2017
Merged

Rework quantile iteration logic #67

merged 4 commits into from
Oct 18, 2017

Conversation

marshallpierce
Copy link
Collaborator

See #66.

I've gotta run at the moment but I wanted to get this in front of people. I don't think the quantile iterator needs a nontrivial more(); I'm guessing that was an expedient tweak at some point in the Java impl's past but I don't think it really makes mathematical sense.

Also, in (1.0 / (1.0 - self.quantile_to_iterate_to)).log2() as u32, the floating point calculations were yielding inf at quantile 1.0, which became 0 as a u32. So, now we short circuit before that logic.

@@ -218,7 +219,7 @@ const ORIGINAL_MAX: u64 = 0;
/// Partial ordering is used for threshholding, also usually in the context of quantiles.
pub trait Counter
: num::Num + num::ToPrimitive + num::FromPrimitive + num::Saturating + num::CheckedSub
+ num::CheckedAdd + Copy + PartialOrd<Self> {
+ num::CheckedAdd + Copy + PartialOrd<Self> + fmt::Debug {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to allow the use of assert_eq! with a T. Seemed pretty harmless

@@ -163,7 +165,7 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
// if we've seen all counts, no other counts should be non-zero
if self.total_count_to_index == total {
// TODO this can fail when total count overflows
assert!(count == T::zero());
assert_eq!(count, T::zero());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got tired of IntelliJ warning me that this could be assert_eq

true
} else {
false
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Java impl, self.quantile_to_iterate_to ends up being what is exposed as the quantile iterated to, rather than calculating (accumulated count) / (total count) at each iteration point as the Rust impl does (and also the Java impl for all iterators other than percentile). Thus, in the Java impl, you would end up at quantile 0.998... or similar when you ended up at the last nonzero bucket, and it was aesthetically pleasing to nudge the iterator one more slot forward to get to 1.0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... maybe the Java way is better and we should expose that as well? Java uses getPercentile() vs getPercentileIteratedTo() to allow consumers to differentiate. We could expose an extra field as well in IterationValue, or apply type system shenanigans to have per-iterator value types (<V extends IterationValue>?) or bolted-on extra data (IterationValue<T, V> for some per-iterator associated type V). The upside is that it lets quantile iteration like in the dump-to-stdout example show the internal quantile value's small changes to make it clear that forward progress is happening, even when we stay at the same value for a while.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having another value in IterationValue sounds perfectly sensible.
Not sure it'll be accessed much, but doesn't hurt to have it there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll go ahead and add it. It will make the quantile iteration output a little easier to visually comprehend.

examples/cli.rs Outdated
.get_matches();

let stdin = std::io::stdin();
let stdin_handle = stdin.lock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you don't just want let stdin = std::io::stdin().lock();?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler gets grumpy about the lifetime of stdin() if I remove the intermediate let. Perhaps there is a better way though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. This is at least a little nicer:

let stdin = std::io::stdin();
let stdin = stdin.lock();

@@ -142,7 +142,9 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P>
return None;
}

// have we yielded all non-zeros in the histogram?
// TODO should check if we've reached max, not count, to avoid early termination
// on histograms with very large counts whose total would exceed u64::max_value()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though doesn't running_total have the same issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It absolutely does, but we could limit the damage to only saturating total_count_to_index and still continuing to iterate until we've reached the max. I've got another branch I'm working on that does this.

@marshallpierce
Copy link
Collaborator Author

Now it's pretty clear what's going on in quantile output, I think:

       Value          QuantileValue      QuantileIteration TotalCount 1/(1-Quantile)

           0 0.00010000000000000000 0.00000000000000000000          1           1.00
   999292927 0.10000000000000000555 0.10000000000000000555       1000           1.11
  1999634431 0.20000000000000001110 0.20000000000000001110       2000           1.25
  3001024511 0.30009999999999997788 0.30000000000000004441       3001           1.43
  4001366015 0.40010000000000001119 0.40000000000000002220       4001           1.67
...
  9982443519 0.99819999999999997620 0.99765625000000002220       9982         555.56
  9982443519 0.99819999999999997620 0.99804687500000000000       9982         555.56
  9990832127 0.99899999999999999911 0.99824218750000004441       9990        1000.00
  9990832127 0.99899999999999999911 0.99843750000000008882       9990        1000.00
  9990832127 0.99899999999999999911 0.99863281250000013323       9990        1000.00
  9990832127 0.99899999999999999911 0.99882812500000017764       9990        1000.00
  9999220735 0.99990000000000001101 0.99902343750000022204       9999       10000.00
  9999220735 0.99990000000000001101 0.99912109375000024425       9999       10000.00
...
  9999220735 0.99990000000000001101 0.99982910156250004441       9999       10000.00
  9999220735 0.99990000000000001101 0.99985351562500002220       9999       10000.00
  9999220735 0.99990000000000001101 0.99987792968750000000       9999       10000.00
  9999220735 0.99990000000000001101 0.99989013671875004441       9999       10000.00
 10007609343 1.00000000000000000000 0.99990234375000008882      10000              ∞
#[Mean       = 5000000916.51, StdDeviation   = 2887040392.98]
#[Max        =  10007609343, Total count    =        10000]
#[Buckets    =           54, SubBuckets     =        56320]

You can see both the pleasing 1.0 quantile when you hit the max value, and also see that the iteration quantile is ticking slowly up during the part where you stay at value 9999220735 for a while.

examples/cli.rs Outdated
@@ -109,9 +109,10 @@ fn quantiles<R: BufRead, W: Write>(mut reader: R, mut writer: W, quantile_precis

writer.write_all(
format!(
"{:>12} {:>quantile_precision$} {:>10} {:>14}\n\n",
"{:>12} {:>quantile_precision$} {:>quantile_precision$} {:>10} {:>14}\n\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. I didn't know about this trick.

@marshallpierce
Copy link
Collaborator Author

@algermissen can you take a look at the CLI tool's quantile output and see if it makes sense for your data?

@algermissen
Copy link

@marshallpierce Sorry for the delay - yes, this clarifies the start at '0' (which is just due to lost precision in the output).

However, I am still getting the doubled last line with +Inf at the 1/(1-Quantile) column. But this does not affect the histogram plots. So I am good.

Thanks for doing this so thoroughly.

@marshallpierce
Copy link
Collaborator Author

@algermissen "thoroughly" is my favorite way to do things. :)

There are more tweaks to iteration coming down the pipe, including a further adjustment of the logarithmic iterator, but I suspect what is going on is actually correct behavior. Can you paste a base64 of your serialized histogram somewhere so I can inspect it?

Anyway, an example of what could produce multiple lines of 1.0 quantile: suppose you had a count of 1 at value 1, and a count of 1_000_000_000 at value 1000. Almost every quantile beyond 0 would put you at the large value, so the "quantile at the value" would be 1.0 as the "quantile we're iterating to" continues from just above 0 to 1.0.

The change I'm working on is to make it so that as soon as the quantile iterator reaches a value that yields quantile 1.0, do one more iteration to make the quantile iterated to be 1.0 as well. This will allow the iteration to reach its natural end point, but also skip any intermediate points that would also be at value quantile 1.0.

jonhoo added a commit that referenced this pull request Oct 24, 2017
This release has a couple of backwards-incompatible changes:

 - the old `len()` is now `distinct_values()`
 - the new `len()` is the old `count()` (which is deprecated)
 - `IterationValue::value` became `value_iterated_to`

Some other API changes:

 - iterator values gained `quantile_iterated_to()`
 - `Histogram` gained `is_empty()`

Behind the scenes:

 - #67 and #68 landed a number of fixes to iterators such that the
   produced values are more correct and sensible.
 - errors were moved into their own module.
@marshallpierce marshallpierce deleted the quantile-iter-end branch November 1, 2017 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants