-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize opencl and make it default gpu feature. #1420
Conversation
755f0ba
to
3faf8ec
Compare
3faf8ec
to
e264a41
Compare
let encoded_data = last_layer_labels | ||
.read_range(start..end) | ||
.expect("failed to read layer range") | ||
let mut layer_bytes = vec![0u8; (end - start) * std::mem::size_of::<Fr>()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the code is being optimized, I'm concerned about keeping this large allocation around. I know it's not worse than what was there previously, but I suspect we can both reduce it and gain performance by using mmap sort of like this: https://github.com/filecoin-project/bellperson/blob/master/src/groth16/mapped_params.rs#L141
I think the lifecycle test I ran for this was misconfigured and failed to exercise it — and that this is actually giving the wrong result. I'll try to fix next week. NOTE: it's not great this was not caught by CI. We should probably at least run the test that would have caught this in CI. I'll adjust the env vars on the relevant CI job too. |
6cbdb5f
to
ff3152b
Compare
I changed this to now just call I tried modifying the CI config to use the GPU tree and column builders when running a lifecycle test using the GPU, but CI is not configured to use GPUs at all, apparently. Lifecycle test is passing locally, though. Here's a benchmark showing a total running time of 11:21 for tree building. That's ten seconds slower than before. We are indeed doing more work with the added conversion, so a small penalty is not surprising.
Here's a passing lifecycle test run on a machine with a GPU:
|
ff3152b
to
9ff7acb
Compare
UPDATE: I tried adding a CI test of tree building running on an actual GPU instance. Let's see whether that works. |
3d4b663
to
6f33252
Compare
9f82f1c
to
17364e7
Compare
17364e7
to
6d8259b
Compare
I am closing this in favor of #1422. We may want to use this branch as a starting point or just reopen the PR when eventually ready to eliminate the |
#1397 Added support for neptune's
opencl
feature, but did not demonstrate the expected performance gain. (Expectation was ~2x. See: argumentcomputer/neptune#78.)Apparently the bottleneck was the
read_range
provided bymerkletree::store::disk
, which ended up callingPoseidonDomain::from_slice
on everyFr
element when reading from disk. The cleanest way to fix this is probably to optimizeread_range
to perform better. However, I was not able to find a simple way to make this change in the face of the layers of generic types.Instead, I took advantage of definite knowledge that the underlying data is
Fr
and use an unsafe transmute from bytes.As previously discussed, this PR makes
neptune/opencl
the default feature when thegpu
feature is active — and removes thegpu2
feature. Whether this should be merged immediately or wait until thegpu2
flag has been released and widely tested depends on whether we already have confidence in it. As far as I know, there is no reason to believeneptune/opencl
is problematic: it's dramatically simpler and performs better — but I will let @cryptonemo and/or @dignifiedquire make the decision.Shown below, now column tree building takes about 71 seconds, and regular tree-building takes about 11 seconds. This is compared with 75 and 10 seconds on the isolated neptune benchmark (gbench). Total time for building both trees is now just over 11 minutes, which is as originally expected.