Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nonblocking synchronization robust to errors. #1369

Merged
merged 2 commits into from
Feb 14, 2022
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Feb 11, 2022

Our nonblocking synchronization relied on CUDA notifying an async condition, but that may never happen if the stream encounters an error. Protect against this by using a timer that periodically queries the stream in a regular way.
Fixes #1366, may reveal something in #1350.

cc @vchuravy @tkf

lib/cudadrv/stream.jl Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Feb 12, 2022

Codecov Report

Merging #1369 (bacd69a) into master (0aa7750) will increase coverage by 0.03%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1369      +/-   ##
==========================================
+ Coverage   77.98%   78.01%   +0.03%     
==========================================
  Files         121      121              
  Lines        8929     8951      +22     
==========================================
+ Hits         6963     6983      +20     
- Misses       1966     1968       +2     
Impacted Files Coverage Δ
lib/cudadrv/execution.jl 94.82% <ø> (ø)
src/pool.jl 76.72% <0.00%> (ø)
lib/cudadrv/stream.jl 93.58% <94.44%> (+0.04%) ⬆️
src/compiler/execution.jl 86.33% <100.00%> (+1.09%) ⬆️
src/compiler/gpucompiler.jl 83.87% <0.00%> (-6.46%) ⬇️
lib/cudnn/error.jl 25.00% <0.00%> (-2.28%) ⬇️
lib/cufft/error.jl 25.00% <0.00%> (-2.28%) ⬇️
lib/cublas/error.jl 25.00% <0.00%> (-2.28%) ⬇️
lib/curand/error.jl 25.00% <0.00%> (-2.28%) ⬇️
lib/cusolver/error.jl 25.00% <0.00%> (-2.28%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0aa7750...bacd69a. Read the comment docs.

lib/cudadrv/stream.jl Outdated Show resolved Hide resolved
@luraess
Copy link

luraess commented Feb 14, 2022

@maleadt just tried out this branch, as suggested, on the application that was hanging, and all seem to run fine now with Julia 1.7.1 🙂

@maleadt maleadt merged commit 4703923 into master Feb 14, 2022
@maleadt maleadt deleted the tb/async_errors branch February 14, 2022 11:00
maleadt added a commit that referenced this pull request Feb 15, 2022
Make nonblocking synchronization robust to errors.
maleadt added a commit that referenced this pull request Feb 15, 2022
Make nonblocking synchronization robust to errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vectors in customary structs make julia stuck
4 participants