-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stas/best-practice-ntt #517
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good example @svpolonsky
mkdir -p build/icicle | ||
|
||
# Configure and build Icicle | ||
cmake -S ../../../icicle/ -B build/icicle -DCMAKE_BUILD_TYPE=Release -DCURVE=bn254 -DG2=OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also add -MSM=OFF
to exclude MSM compilation, which boost the build time significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cudaMemcpyAsync(&h_out[vec_transfer][i*block_size], &d_vec[vec_transfer][i*block_size], sizeof(E)*block_size, cudaMemcpyDeviceToHost, stream_d2h); | ||
} | ||
if (i>0) { | ||
cudaMemcpyAsync(&d_vec[vec_transfer][(i-1)*block_size], &h_inp[vec_transfer][(i-1)*block_size], sizeof(E)*block_size, cudaMemcpyHostToDevice, stream_h2d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) how do you guarantee that copy from d_vec ->h_out is fully executed before h_inp->d_vec when they are both async and on different streams? Seems like a race.
(2) I think it would be better to compare it to a naive implementation both in terms of performance and correctness.
(3) since it's an example, I think additional comments you be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding (1): I sync the streams on line 118, 119
cudaStreamSynchronize(stream_d2h);
cudaStreamSynchronize(stream_h2d);
Regarding (2): I referenced existing NTT example as baseline in section "Running the example":
To compare with ICICLE baseline (i.e. non-concurrent ) NTT, you can run [this example](../ntt/README.md)
Regarding (3): I added comments to explain why we won't have races
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sync is blocking until the all tasks in the queue are completed but it doesn't prevent races between streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
## Describe the changes Icicle examples: Concurrent Data Transfer and NTT Computation This PR introduces a Best Practice series of examples in c++. Specifically, the example shows how to concurrently transfer data to/from device and execute NTT ## Linked Issues Resolves #
Describe the changes
Icicle examples: Concurrent Data Transfer and NTT Computation
This PR introduces a Best Practice series of examples in c++. Specifically, the example shows how to concurrently transfer data to/from device and execute NTT
Linked Issues
Resolves #