-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalizer error when rapidly writing to HDF5 #1048
Comments
#1024 might be able to help with this. It might help to call Another alternative would be to use |
What version of HDF5.jl are you using? |
Also, it would be helpful if you can give a full minimal reproducible example. |
This is with v0.16.13 |
Huh, we added locking around API calls in #1021, which should be in that version. Do you still see the same issue if you remove the |
Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly... using HDF5
fid = h5open(tempname() * ".h5", "w")
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid) |
I am unable to replicate it on my Mac. I don't have a Windows machine to try it on, unfortunately. Do you see it if you run it outside VSCode? (i.e. if you just run it in a script from the command line) |
I'm also on a Mac at the moment. It also crashes from the cmd line but with a slightly different trace:
Versioninfo for this platform:
|
I still can't replicate it. What is your full Can you also try running it with |
My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop. Nonetheless, this points to a fundamental issue. We should not be locking and unlocking during finalization from a different task. |
I guess I still don't understand how this can happen after #1021. |
The problem was caused by #1021. It is because the finalizer tries to call I raised this and you responded in #1021 (comment) . So either JuliaLang/julia#38487 does not actually address this or there has been a regression. |
You're right: I was able to trigger it with |
Does |
Yes. |
So far I am unable to replicate on Windows. |
I can replicate on macOS but not consistently. It seemed to occur more frequently when I first logged in. |
Possibly related: JuliaLang/julia#47612 |
I think we need to use function try_close_finalizer(x)
# Test and test lock, see trylock docstring
# Only proceed if the finalizer thread can acquire the lock, do not wait
if(!islocked(API.liblock) && trylock(API.liblock))
try
close(x)
finally
unlock(API.liblock)
end
else
# try again later
# TODO: consider exponential backoff
finalizer(try_close_finalizer, x)
end
end |
t = Threads.@spawn begin
lock(HDF5.API.liblock)
try
for i=1:1_000_000
d[i,1] = 1
end
finally
unlock(HDF5.API.liblock)
end
end or maybe t = Threads.@spawn begin
@lock HDF5.API.liblock begin
for i=1:1_000_000
d[i,1] = 1
end
end
end |
On macOS, once I find a terminal process that causes the error, then this reliably creates a problem: $ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
using HDF5
fid = h5open(tempname() * ".h5", "w")
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
end
error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:19 [inlined]
close at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:11
unknown function (ip: 0x1170d405f)
ijl_apply_generic at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
... Meanwhile, the using $ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
using HDF5
h5open(tempname() * ".h5", "w") do fid
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
for i=1:1_000_000
d[i,1] = 1
end
end
wait(t)
end
end |
Hmm... acquiring the lock causes a lot of error message scrolling.
Using the julia> begin
using HDF5
h5open(tempname() * ".h5", "w") do fid
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn @lock HDF5.API.liblock begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
end
end |
This only avoids the error if I use |
Implementing diff --git a/src/HDF5.jl b/src/HDF5.jl
index fb95abf..9e29340 100644
--- a/src/HDF5.jl
+++ b/src/HDF5.jl
@@ -62,8 +62,24 @@ export @read,
# H5DataStore, Attribute, File, Group, Dataset, Datatype, Opaque,
# Dataspace, Object, Properties, VLen, ChunkStorage, Reference
+
+
h5doc(name) = "[`$name`](https://portal.hdfgroup.org/display/HDF5/$(name))"
+function try_close_finalizer(x)
+ if !islocked(API.liblock) && trylock(API.liblock)
+ try
+ close(x)
+ finally
+ unlock(API.liblock)
+ end
+ else
+ finalizer(try_close_finalizer, x)
+ end
+end
+#const try_close_finalizer = Base.close
+
+
include("api/api.jl")
include("properties.jl")
include("context.jl")
diff --git a/src/properties.jl b/src/properties.jl
index 1ca0033..88f16f9 100644
--- a/src/properties.jl
+++ b/src/properties.jl
@@ -104,7 +104,7 @@ macro propertyclass(name, classid)
id::API.hid_t
function $name(id::API.hid_t)
obj = new(id)
- finalizer(close, obj)
+ finalizer(try_close_finalizer, obj)
obj
end
end
diff --git a/src/types.jl b/src/types.jl
index eea17a9..980d240 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -5,6 +5,7 @@
# Supertype of HDF5.File, HDF5.Group, JldFile, JldGroup, Matlabv5File, and MatlabHDF5File.
abstract type H5DataStore end
+
# Read a list of variables, read(parent, "A", "B", "x", ...)
function Base.read(parent::H5DataStore, name::AbstractString...)
tuple((read(parent, x) for x in name)...)
@@ -41,7 +42,7 @@ mutable struct File <: H5DataStore
function File(id, filename, toclose::Bool=true)
f = new(id, filename)
if toclose
- finalizer(close, f)
+ finalizer(try_close_finalizer, f)
end
f
end
@@ -55,7 +56,7 @@ mutable struct Group <: H5DataStore
function Group(id, file)
g = new(id, file)
- finalizer(close, g)
+ finalizer(try_close_finalizer, g)
g
end
end
@@ -69,7 +70,7 @@ mutable struct Dataset
function Dataset(id, file, xfer=DatasetTransferProperties())
dset = new(id, file, xfer)
- finalizer(close, dset)
+ finalizer(try_close_finalizer, dset)
dset
end
end
@@ -84,14 +85,14 @@ mutable struct Datatype
function Datatype(id, toclose::Bool=true)
nt = new(id, toclose)
if toclose
- finalizer(close, nt)
+ finalizer(try_close_finalizer, nt)
end
nt
end
function Datatype(id, file::File, toclose::Bool=true)
nt = new(id, toclose, file)
if toclose
- finalizer(close, nt)
+ finalizer(try_close_finalizer, nt)
end
nt
end
@@ -106,7 +107,7 @@ mutable struct Dataspace
function Dataspace(id)
dspace = new(id)
- finalizer(close, dspace)
+ finalizer(try_close_finalizer, dspace)
dspace
end
end
@@ -119,7 +120,7 @@ mutable struct Attribute
function Attribute(id, file)
dset = new(id, file)
- finalizer(close, dset)
+ finalizer(try_close_finalizer, dset)
dset
end
end |
I've stumbled over this bug too. Any chance to land the fix and cut a quick bugfix release? |
Yes, the branch worked. |
Hi,
I have the following setup:
If the writes are too fast, the task/thread crashes with the following trace:
Versioninfo:
The text was updated successfully, but these errors were encountered: