Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal pseudo random number generation #1533

Merged
merged 16 commits into from
Jan 22, 2024
Merged

Conversation

ivarflakstad
Copy link
Member

Hybrid Tausworthe and LCG random number generator, using Box-Muller transform for gaussian normal distribution.

Paper
Cuda ref

Benchmarks:

Completed 32767 iterations in 5423623461 nanoseconds, estimated execution time is 165520.9039887692 ns
Benchmarking metal_random_uniform/iter: Collecting 100 samples in estimated 5.0153 s (30k iterations)
Benchmarking metal_random_uniform/iter: Analyzing
metal_random_uniform/iter
                        time:   [167.02 µs 167.86 µs 168.90 µs]
                        thrpt:  [23.127 GiB/s 23.271 GiB/s 23.388 GiB/s]
                 change:
                        time:   [-99.732% -99.726% -99.719%] (p = 0.00 < 0.05)
                        thrpt:  [+35541% +36427% +37237%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  12 (12.00%) high mild
  8 (8.00%) high severe


Completed 32767 iterations in 5742029208 nanoseconds, estimated execution time is 175238.17279580067 ns
Benchmarking metal_random_normal/iter: Collecting 100 samples in estimated 5.3097 s (30k iterations)
Benchmarking metal_random_normal/iter: Analyzing
metal_random_normal/iter
                        time:   [173.93 µs 174.08 µs 174.28 µs]
                        thrpt:  [22.414 GiB/s 22.439 GiB/s 22.459 GiB/s]
                 change:
                        time:   [-99.727% -99.723% -99.716%] (p = 0.00 < 0.05)
                        thrpt:  [+35121% +35956% +36581%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

Randomness verified with ent.
Chi square is obviously not good enough for cryptography, but I think it's good enough for ML.

Uniform:

Entropy = 7.106619 bits per byte.

Optimum compression would reduce the size
of this 16384 byte file by 11 percent.

Chi square distribution for 16384 samples is 105515.09, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 105.7958 (127.5 = random).
Monte Carlo value for Pi is 3.607326007 (error 14.82 percent).
Serial correlation coefficient is -0.027706 (totally uncorrelated = 0.0).

Normal:

Entropy = 6.956191 bits per byte.

Optimum compression would reduce the size
of this 16384 byte file by 13 percent.

Chi square distribution for 16384 samples is 153549.00, and randomly
would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 111.0774 (127.5 = random).
Monte Carlo value for Pi is 3.529670330 (error 12.35 percent).
Serial correlation coefficient is -0.046988 (totally uncorrelated = 0.0).

@ivarflakstad ivarflakstad self-assigned this Jan 7, 2024
@Narsil
Copy link
Collaborator

Narsil commented Jan 12, 2024

There's a big issue with this implementation:

You never update the seed, meaning all tensors will be generated the exact same, this is not OK, we definitely need update the seed from the output of the kernels.

Ideally we wouldn't require to sync and just put a real buffer on the device containing the seed (modifying the API in candle-metal-kernels, to force users to provide a buffer instead of a real cpu number).
That way we can keep mostly the same API, we don't require any sync, yet calling multiple times the random functions, we'd get different results.

For the entropy, I'm getting similar numbers as the CPU generated numbers, therefore even though 7 seems quite low I think it's ok. (I got 7.3 by generating for the entire f32 range, saving to safetensors and calling ent on just the buffer part. My guess is that it's linked to the actual f32 representation. Getting uniform sampling on the f32 range means you definitely will oversample some number (given there's only a handful of representations for large numbers). If we really wanted to check entropy, we'd have to really generate random bytes (u8) and then calling the entropy on it.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better

candle-metal-kernels/src/random.metal Outdated Show resolved Hide resolved
candle-metal-kernels/src/random.metal Outdated Show resolved Hide resolved
candle-metal-kernels/src/lib.rs Outdated Show resolved Hide resolved
candle-core/src/metal_backend.rs Outdated Show resolved Hide resolved
candle-core/src/metal_backend.rs Outdated Show resolved Hide resolved
* set_seed via buffer content pointer copy + did_modify_range

* ensure random.metal kernel does not write outside of buffer range when tid==0
@ivarflakstad ivarflakstad force-pushed the ivarflakstad/metal-prng branch from 13dfe29 to db92351 Compare January 17, 2024 17:04
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ivarflakstad ivarflakstad merged commit fd7c856 into main Jan 22, 2024
12 checks passed
@ivarflakstad ivarflakstad deleted the ivarflakstad/metal-prng branch January 22, 2024 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants