Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher quality (0, 1] floats #1346

Closed
vks opened this issue Oct 18, 2023 · 3 comments
Closed

Higher quality (0, 1] floats #1346

vks opened this issue Oct 18, 2023 · 3 comments
Labels
X-stale Outdated or abandoned work

Comments

@vks
Copy link
Collaborator

vks commented Oct 18, 2023

Background

Motivation: It's possible to get higher-quality floats without having to add a loop.

Application: I don't have a concrete application, but this approach is able to generate floats < 2^-53, and does not generate 0, which should have a probability of 2^-1075. It can also generate more distinct floats than our current approach.

Feature request

Implement another (0, 1] distribution.

@dhardy
Copy link
Member

dhardy commented Oct 18, 2023

So it uses a maximum of two steps, a bit like Canon's method. Might be generally preferable to #531, but probably still has a significant cost overhead?

At any rate, it may be worth investigating (implementing and benchmarking at least), but not something I'm going to put on my to-do list.

@josephlr
Copy link
Member

josephlr commented Oct 24, 2023

Initial benchmarks for f64 on the OpenClosed01 distrubution (test distr_openclosed01_f64):

  • Existing int cast + multiply: 1,089 ns/iter (+/- 7) = 7346 MB/s
  • Implementation in the article: 1,528 ns/iter (+/- 17) = 5235 MB/s
  • Implementation w/o resampling the exponent: 1,312 ns/iter (+/- 15) = 6097 MB/s
  • Control, just casing u64 to f64: 991 ns/iter (+/- 10) = 8072 MB/s

EDIT: testing done on a Zen3 x86_64 processor, but I didn't pass -C target-cpu=native, so rep bsf was being used instead of tzcnt. Rerunning with -C target-cpu=native seemed to make all the microbenchmarks slower, even the existing implementation, which is odd.

  • Existing int cast + multiply: 1,217 ns/iter (+/- 25) = 6573 MB/s
  • Implementation in the article: 1,617 ns/iter (+/- 100) = 4947 MB/s
  • Implementation w/o resampling the exponent: 1,458 ns/iter (+/- 20) = 5486 MB/s
  • Control, just casing u64 to f64: 963 ns/iter (+/- 10) = 8307 MB/s

@dhardy
Copy link
Member

dhardy commented Oct 24, 2023

Thanks. Overhead there is not negligible but is small enough that it could be offered as an alternative to the current implementation under a feature flag, if there is genuine interest in using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-stale Outdated or abandoned work
Projects
None yet
Development

No branches or pull requests

3 participants