-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework quantile iteration logic #67
Conversation
@@ -218,7 +219,7 @@ const ORIGINAL_MAX: u64 = 0; | |||
/// Partial ordering is used for threshholding, also usually in the context of quantiles. | |||
pub trait Counter | |||
: num::Num + num::ToPrimitive + num::FromPrimitive + num::Saturating + num::CheckedSub | |||
+ num::CheckedAdd + Copy + PartialOrd<Self> { | |||
+ num::CheckedAdd + Copy + PartialOrd<Self> + fmt::Debug { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to allow the use of assert_eq!
with a T
. Seemed pretty harmless
@@ -163,7 +165,7 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P> | |||
// if we've seen all counts, no other counts should be non-zero | |||
if self.total_count_to_index == total { | |||
// TODO this can fail when total count overflows | |||
assert!(count == T::zero()); | |||
assert_eq!(count, T::zero()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got tired of IntelliJ warning me that this could be assert_eq
true | ||
} else { | ||
false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Java impl, self.quantile_to_iterate_to
ends up being what is exposed as the quantile iterated to, rather than calculating (accumulated count) / (total count)
at each iteration point as the Rust impl does (and also the Java impl for all iterators other than percentile). Thus, in the Java impl, you would end up at quantile 0.998...
or similar when you ended up at the last nonzero bucket, and it was aesthetically pleasing to nudge the iterator one more slot forward to get to 1.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... maybe the Java way is better and we should expose that as well? Java uses getPercentile()
vs getPercentileIteratedTo()
to allow consumers to differentiate. We could expose an extra field as well in IterationValue, or apply type system shenanigans to have per-iterator value types (<V extends IterationValue>?
) or bolted-on extra data (IterationValue<T, V>
for some per-iterator associated type V). The upside is that it lets quantile iteration like in the dump-to-stdout example show the internal quantile value's small changes to make it clear that forward progress is happening, even when we stay at the same value for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having another value in IterationValue
sounds perfectly sensible.
Not sure it'll be accessed much, but doesn't hurt to have it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll go ahead and add it. It will make the quantile iteration output a little easier to visually comprehend.
examples/cli.rs
Outdated
.get_matches(); | ||
|
||
let stdin = std::io::stdin(); | ||
let stdin_handle = stdin.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you don't just want let stdin = std::io::stdin().lock();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler gets grumpy about the lifetime of stdin()
if I remove the intermediate let
. Perhaps there is a better way though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah.. This is at least a little nicer:
let stdin = std::io::stdin();
let stdin = stdin.lock();
@@ -142,7 +142,9 @@ impl<'a, T: 'a, P> Iterator for HistogramIterator<'a, T, P> | |||
return None; | |||
} | |||
|
|||
// have we yielded all non-zeros in the histogram? | |||
// TODO should check if we've reached max, not count, to avoid early termination | |||
// on histograms with very large counts whose total would exceed u64::max_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though doesn't running_total
have the same issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It absolutely does, but we could limit the damage to only saturating total_count_to_index
and still continuing to iterate until we've reached the max. I've got another branch I'm working on that does this.
Now it's pretty clear what's going on in quantile output, I think:
You can see both the pleasing |
examples/cli.rs
Outdated
@@ -109,9 +109,10 @@ fn quantiles<R: BufRead, W: Write>(mut reader: R, mut writer: W, quantile_precis | |||
|
|||
writer.write_all( | |||
format!( | |||
"{:>12} {:>quantile_precision$} {:>10} {:>14}\n\n", | |||
"{:>12} {:>quantile_precision$} {:>quantile_precision$} {:>10} {:>14}\n\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. I didn't know about this trick.
@algermissen can you take a look at the CLI tool's quantile output and see if it makes sense for your data? |
@marshallpierce Sorry for the delay - yes, this clarifies the start at '0' (which is just due to lost precision in the output). However, I am still getting the doubled last line with +Inf at the 1/(1-Quantile) column. But this does not affect the histogram plots. So I am good. Thanks for doing this so thoroughly. |
@algermissen "thoroughly" is my favorite way to do things. :) There are more tweaks to iteration coming down the pipe, including a further adjustment of the logarithmic iterator, but I suspect what is going on is actually correct behavior. Can you paste a base64 of your serialized histogram somewhere so I can inspect it? Anyway, an example of what could produce multiple lines of 1.0 quantile: suppose you had a count of 1 at value 1, and a count of 1_000_000_000 at value 1000. Almost every quantile beyond 0 would put you at the large value, so the "quantile at the value" would be 1.0 as the "quantile we're iterating to" continues from just above 0 to 1.0. The change I'm working on is to make it so that as soon as the quantile iterator reaches a value that yields quantile 1.0, do one more iteration to make the quantile iterated to be 1.0 as well. This will allow the iteration to reach its natural end point, but also skip any intermediate points that would also be at value quantile 1.0. |
This release has a couple of backwards-incompatible changes: - the old `len()` is now `distinct_values()` - the new `len()` is the old `count()` (which is deprecated) - `IterationValue::value` became `value_iterated_to` Some other API changes: - iterator values gained `quantile_iterated_to()` - `Histogram` gained `is_empty()` Behind the scenes: - #67 and #68 landed a number of fixes to iterators such that the produced values are more correct and sensible. - errors were moved into their own module.
See #66.
I've gotta run at the moment but I wanted to get this in front of people. I don't think the quantile iterator needs a nontrivial
more()
; I'm guessing that was an expedient tweak at some point in the Java impl's past but I don't think it really makes mathematical sense.Also, in
(1.0 / (1.0 - self.quantile_to_iterate_to)).log2() as u32
, the floating point calculations were yieldinginf
at quantile 1.0, which became0
as au32
. So, now we short circuit before that logic.