Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize opencl and make it default gpu feature. #1420

Closed
wants to merge 2 commits into from

Conversation

porcuquine
Copy link
Collaborator

@porcuquine porcuquine commented Feb 25, 2021

#1397 Added support for neptune's opencl feature, but did not demonstrate the expected performance gain. (Expectation was ~2x. See: argumentcomputer/neptune#78.)

Apparently the bottleneck was the read_range provided by merkletree::store::disk, which ended up calling PoseidonDomain::from_slice on every Fr element when reading from disk. The cleanest way to fix this is probably to optimize read_range to perform better. However, I was not able to find a simple way to make this change in the face of the layers of generic types.

Instead, I took advantage of definite knowledge that the underlying data is Fr and use an unsafe transmute from bytes.

As previously discussed, this PR makes neptune/opencl the default feature when the gpu feature is active — and removes the gpu2 feature. Whether this should be merged immediately or wait until the gpu2 flag has been released and widely tested depends on whether we already have confidence in it. As far as I know, there is no reason to believe neptune/opencl is problematic: it's dramatically simpler and performs better — but I will let @cryptonemo and/or @dignifiedquire make the decision.

Shown below, now column tree building takes about 71 seconds, and regular tree-building takes about 11 seconds. This is compared with 75 and 10 seconds on the isolated neptune benchmark (gbench). Total time for building both trees is now just over 11 minutes, which is as originally expected.

2021-02-25T01:43:13.825 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree c using the GPU
2021-02-25T01:43:13.825 INFO storage_proofs_porep::stacked::vanilla::proof > Building column hashes
2021-02-25T01:43:13.940 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f914809c1c0)), device: Device(DeviceId(0x7f91480b60e0)) }
2021-02-25T01:44:26.966 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2021-02-25T01:45:38.013 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2021-02-25T01:46:49.541 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2021-02-25T01:48:01.341 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2021-02-25T01:49:12.919 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2021-02-25T01:50:24.656 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2021-02-25T01:51:35.644 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2021-02-25T01:52:46.195 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2021-02-25T01:52:51.734 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2021-02-25T01:52:53.150 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f914809c1c0)), device: Device(DeviceId(0x7f91480b60e0)) }
2021-02-25T01:52:55.778 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2021-02-25T01:53:06.627 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2021-02-25T01:53:17.852 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2021-02-25T01:53:30.120 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2021-02-25T01:53:41.409 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2021-02-25T01:53:53.343 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2021-02-25T01:54:02.641 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2021-02-25T01:54:11.945 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2021-02-25T01:54:21.940 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done

@porcuquine porcuquine force-pushed the feat/optimize-tree-building branch 2 times, most recently from 755f0ba to 3faf8ec Compare February 25, 2021 03:30
@porcuquine porcuquine marked this pull request as ready for review February 25, 2021 03:33
let encoded_data = last_layer_labels
.read_range(start..end)
.expect("failed to read layer range")
let mut layer_bytes = vec![0u8; (end - start) * std::mem::size_of::<Fr>()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the code is being optimized, I'm concerned about keeping this large allocation around. I know it's not worse than what was there previously, but I suspect we can both reduce it and gain performance by using mmap sort of like this: https://github.com/filecoin-project/bellperson/blob/master/src/groth16/mapped_params.rs#L141

@porcuquine
Copy link
Collaborator Author

I think the lifecycle test I ran for this was misconfigured and failed to exercise it — and that this is actually giving the wrong result. I'll try to fix next week.

NOTE: it's not great this was not caught by CI. We should probably at least run the test that would have caught this in CI. I'll adjust the env vars on the relevant CI job too.

@porcuquine porcuquine force-pushed the feat/optimize-tree-building branch 2 times, most recently from 6cbdb5f to ff3152b Compare March 2, 2021 02:36
@porcuquine
Copy link
Collaborator Author

I changed this to now just call bytes_into_fr — since we do indeed need that transformation. The first pass was not fast enough. It showed results somewhere between the starting point and the initial benchmark posted for this PR above. By switching to do the conversion in parallel, I got performance to be comparable to the first version, so I think we're good here now.

I tried modifying the CI config to use the GPU tree and column builders when running a lifecycle test using the GPU, but CI is not configured to use GPUs at all, apparently. Lifecycle test is passing locally, though.

Here's a benchmark showing a total running time of 11:21 for tree building. That's ten seconds slower than before. We are indeed doing more work with the added conversion, so a small penalty is not surprising.

2021-03-02T02:07:10.400 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree c using the GPU
2021-03-02T02:07:10.400 INFO storage_proofs_porep::stacked::vanilla::proof > Building column hashes
2021-03-02T02:07:10.551 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f819809c1c0)), device: Device(DeviceId(0x7f81980b60e0)) }
2021-03-02T02:08:26.882 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 1/8 of length 153391689
2021-03-02T02:09:40.463 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2021-03-02T02:10:54.277 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 3/8 of length 153391689
2021-03-02T02:12:08.194 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 4/8 of length 153391689
2021-03-02T02:13:20.888 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 5/8 of length 153391689
2021-03-02T02:14:31.574 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 6/8 of length 153391689
2021-03-02T02:15:42.719 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 7/8 of length 153391689
2021-03-02T02:16:53.711 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 8/8 of length 153391689
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > tree_c done
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > building tree_r_last
2021-03-02T02:16:59.212 INFO storage_proofs_porep::stacked::vanilla::proof > generating tree r last using the GPU
2021-03-02T02:17:00.716 INFO neptune::proteus::gpu > device: Device { brand: Nvidia, name: "GeForce RTX 2080 Ti", memory: 11551440896, bus_id: Some(33), platform: Platform(PlatformId(0x7f819809c1c0)), device: Device(DeviceId(0x7f81980b60e0)) }
2021-03-02T02:17:03.539 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 1/8
2021-03-02T02:17:14.632 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 2/8
2021-03-02T02:17:26.127 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 3/8
2021-03-02T02:17:38.286 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 4/8
2021-03-02T02:17:50.742 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 5/8
2021-03-02T02:18:01.777 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 6/8
2021-03-02T02:18:11.560 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 7/8
2021-03-02T02:18:21.320 INFO storage_proofs_porep::stacked::vanilla::proof > building base tree_r_last with GPU 8/8
2021-03-02T02:18:31.414 INFO storage_proofs_porep::stacked::vanilla::proof > tree_r_last done

Here's a passing lifecycle test run on a machine with a GPU:

➜  filecoin-proofs git:(feat/optimize-tree-building) ✗ FIL_PROOFS_USE_GPU_TREE_BUILDER=1 FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1 cargo test --release --features=blst,gpu --no-default-features -- --ignored lifecycle_2k
    Finished release [optimized] target(s) in 0.11s
     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/filecoin_proofs-63e948d1681f8da0

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 4 filtered out

     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/api-67208300f75bdb4d

running 2 tests
test test_seal_lifecycle_2kib_porep_id_v1_1_base_8 ... ok
test test_seal_lifecycle_2kib_porep_id_v1_base_8 ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 19 filtered out

     Running /home/porcuquine/dev/rust-fil-proofs/target/release/deps/constants-1203d7a0c6bab322

@porcuquine
Copy link
Collaborator Author

UPDATE: I tried adding a CI test of tree building running on an actual GPU instance. Let's see whether that works.

@porcuquine porcuquine force-pushed the feat/optimize-tree-building branch 17 times, most recently from 3d4b663 to 6f33252 Compare March 2, 2021 07:50
@porcuquine porcuquine force-pushed the feat/optimize-tree-building branch 6 times, most recently from 9f82f1c to 17364e7 Compare March 2, 2021 08:49
This commit makes the test actually run.
@porcuquine
Copy link
Collaborator Author

I am closing this in favor of #1422. We may want to use this branch as a starting point or just reopen the PR when eventually ready to eliminate the gpu2 feature flag, since the changes need to do so are here.

@porcuquine porcuquine closed this Mar 3, 2021
@porcuquine porcuquine deleted the feat/optimize-tree-building branch September 16, 2021 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants