-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighted choice algorithms #532
Comments
Oh, neat! I hadn't thought of the third option at all! Very cool! One thing to also keep in mind is that option 1 comes in two flavors: Either the caller providing the total weight, or where we need to calculate it. For repeated sampling, a CDF seems indeed like the best option. For single sampling the equation does seem more complicated. It's a tradeoff between how fast the iteraterator is, how fast the RNG is, and even how fast For non-weighted sampling we're providing all options, and let the caller worry about picking the one that's faster or more convenient for their case. But I don't think that makes sense for weighted sampling given that it's likely much less commonly used. My thinking as of late has been to not worry about performance for single sampling. Generally performance of operations done once rarely matters. The only case I could think of where it matters is if someone does repeated sampling, but where the weights can change between each sampling. But this seems even more rare. And even in that case the optimal solution depends on if the caller can easily maintain a list of cumulative weights and if the total weight changes or not. In short, for single sampling it feels like the design space is huge, and the performance often does not matter. So my suggestion is to just provide performance-optimized API for repeated sampling. Single sampling can then use that same API and just sample once. If that doesn't provide good enough performance they can implement whatever solution fits their constraints the best. But then also provide APIs optimized for convenience, since often that's at least as useful to callers as providing perf-optimized solutions. |
Good reasoning; I agree with you on that. Some single-usage stuff gets used a lot (e.g. |
Yeah, I suspect |
Should we close this now that #518 has landed? |
See also: dhardy#82, #518
Fundamentally I see three types of weighted-choice algorithm:
weight_sum
, takesample = rng.gen_range(0, weight_sum)
, iterate over elements until cumulative weight exceedssample
then take the previous item.sample
as above, then find item by binary search; look up element from the indexWhere one wants to sample from the same set of weights multiple times, calculating a CDF is the obvious choice since the CDF should require no more memory than the original weights themselves.
Where one wants to sample a single time from a slice, one of the first two choices makes the most sense; since calculating the total weight requires all the work of calculating the CDF except storing the results, using the CDF may often be the best option but this isn't guaranteed.
Where one wants to sample a single time from an iterator, any of the above can be used, but the first two options require either cloning the iterator and iterating twice (not always possible and slightly expensive) or collecting all items into a temporary vector while calculating the sum/CDF, then selecting the required item. In this case the last option may be attractive, though of course sampling the RNG for every item has significant overhead (so probably is only useful for large elements or no allocator).
Which algorithm(s) should we include in Rand?
The method calculating the CDF will often be preferred, so should be included. Unfortunately it requires an allocator (excepting if weighs are provided via mutable reference to a slice), but we should probably not worry about this.
A convenience method to sample from weighted slices would presumably prefer to use the CDF method normally.
For a method to sample from weighted iterators it is less clear which implementation should be used. Although it will not perform well, the last algorithm (i.e. sample code above) may be a nice choice in that it does not require an allocator.
My conclusion: perhaps we should accept #518 in its current form (i.e.
WeightedIndex
distribution using CDF + binary search, and convenience wrappers for slices), plus consider adding the code here to sample from iterators.The text was updated successfully, but these errors were encountered: