-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add built-in support to write compound data types #1013
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At minimum I think this should be write_compound_dataset
since one could also write a compound type to an attribute.
My preference would be to extend HDF5.datatype
to work on arbitrary structs and named tuples so that we can just use the normal write_dataset
interface. After enabling that, then we should reevaluate if a special convenience method is still warranted.
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
f65a0dd
to
fe273a9
Compare
Let me comment at the end since I changed a few things.
so that @mkitti I am still not sure about the type freedom, so let me know what you think! |
@tamasgal, in case you don't follow what I'm talking about in the comments, I mean the following: julia> import HDF5: API, Datatype
julia> HDF5.datatype(x::T) where T = HDF5.datatype(x, Val(isstructtype(T)))
julia> function HDF5.datatype(x::T, isstruct::Val{true}) where T
dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
for (idx, fn) in enumerate(fieldnames(T))
API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
end
Datatype(dtype)
end
julia> datatype(MyStruct(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I64LE "x" : 0;
H5T_STD_I64LE "y" : 8;
}
julia> datatype((x=5, y=3.f0))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I64LE "x" : 0;
H5T_IEEE_F32LE "y" : 8;
}
|
Ah yes, I was struggling with that. Did not came up with the |
People may want to override this method still for their specific types. Your implementation relies on Julia's memory aligned version, which may take up more space than needed. julia> struct MyStruct32
x::Int32
y::Int32
end
julia> datatype(MyStruct32(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "x" : 0;
H5T_STD_I32LE "y" : 4;
}
julia> struct MyStructMixed
x::Int32
y::Int
end
julia> datatype(MyStructMixed(3,5))
HDF5.Datatype: H5T_COMPOUND {
H5T_STD_I32LE "x" : 0;
H5T_STD_I64LE "y" : 8;
}
julia> sizeof(MyStruct32)
8
julia> sizeof(MyStructMixed)
16 As for the julia> Base.@kwdef struct MyStructKW
x::Int32
y::Int
end
MyStructKW
julia> MyStructKW(;(y = 1, x = 2)...)
MyStructKW(2, 1)
julia> Base.convert(::Type{MyStructKW}, nt::NamedTuple) = MyStructKW(; nt...)
julia> convert(MyStructKW, (y = 1, x = 2))
MyStructKW(2, 1) |
You are right about the memory padding, I guess I will just use the actual sizes as offset then. Not sure if I understand regarding |
I'm talking about your comment:
If they wrote a |
We would have to test the conversion closely I think. This may make memory mapping difficult for example. |
Ah yes, understood Btw. regarding the |
I see. I don't know much about the memory mapping features of |
I am still struggling to find out where I hit the |
What is the stack trace? |
So with this it works (but of course not for writing single structs): datatype(x::AbstractArray{T}) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::AbstractArray{T}, isstruct::Val{true}) where {T}
dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
for (idx, fn) in enumerate(fieldnames(T))
API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
end
Datatype(dtype)
end and if I change the signatures to: datatype(x::T) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::T, isstruct::Val{true}) where {T} and call with the same tests, like bars = [
[Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
[Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
]
namedtuples = [(a=1, b=2.3), (a=4, b=5.6)]
fn = tempname()
h5open(fn, "w") do h5f
write_dataset(h5f, "the/bars", bars)
write_dataset(h5f, "the/namedtuples", namedtuples)
end I get
|
I also checked with bars = @SMatrix [
[Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
[Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
]
namedtuples = @SVector [(a=1, b=2.3), (a=4, b=5.6)] |
src/typeconversions.jl
Outdated
@@ -60,6 +60,25 @@ function datatype(str::VLen{C}) where {C<:CharType} | |||
Datatype(API.h5t_vlen_create(type_id)) | |||
end | |||
|
|||
# Compound types | |||
|
|||
datatype(::T) where {T} = Datatype(hdf5_type_id(T), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why toclose = false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I am not sure why this is needed, I took it over from @mkitti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just following similar methods here:
HDF5.jl/src/typeconversions.jl
Lines 197 to 199 in 987d9c8
datatype(x::ScalarType) = Datatype(hdf5_type_id(typeof(x)), false) | |
datatype(::Type{T}) where {T<:ScalarType} = Datatype(hdf5_type_id(T), false) | |
datatype(A::AbstractArray{T}) where {T<:ScalarType} = Datatype(hdf5_type_id(T), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are simply accessing pre-defined constants:
HDF5.jl/src/typeconversions.jl
Lines 115 to 127 in 259763f
hdf5_type_id(::Type{Bool}) = API.H5T_NATIVE_B8 | |
hdf5_type_id(::Type{Int8}) = API.H5T_NATIVE_INT8 | |
hdf5_type_id(::Type{UInt8}) = API.H5T_NATIVE_UINT8 | |
hdf5_type_id(::Type{Int16}) = API.H5T_NATIVE_INT16 | |
hdf5_type_id(::Type{UInt16}) = API.H5T_NATIVE_UINT16 | |
hdf5_type_id(::Type{Int32}) = API.H5T_NATIVE_INT32 | |
hdf5_type_id(::Type{UInt32}) = API.H5T_NATIVE_UINT32 | |
hdf5_type_id(::Type{Int64}) = API.H5T_NATIVE_INT64 | |
hdf5_type_id(::Type{UInt64}) = API.H5T_NATIVE_UINT64 | |
hdf5_type_id(::Type{Float32}) = API.H5T_NATIVE_FLOAT | |
hdf5_type_id(::Type{Float64}) = API.H5T_NATIVE_DOUBLE | |
hdf5_type_id(::Type{Reference}) = API.H5T_STD_REF_OBJ | |
hdf5_type_id(::Type{<:AbstractString}) = API.H5T_C_S1 |
In this case you're creating a new datatype handle on each call to the function: either they need to be cleaned up via GC, or you need some sort of caching mechanism so that subsequent invocations won't keep creating new handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also defining new hdf5_type_id
below.
Using the finalizer can get messy due to multithreading issues. What we should consider, perhaps in another pull request, is a do
method where we automatically close the Datatype
in the current thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're also defining new
hdf5_type_id
below.
What I meant is that e,g, hdf5_type_id(::Type{Int64})
is just accessing a constant, it doesn't create a new handle. The new method in this PR (https://github.com/JuliaIO/HDF5.jl/pull/1013/files#diff-bf557f1b887a640c4fe55cee7279c9f621001ff726a876b9f7bfbd7f8a7d0b44R70) calls h5t_create
, creating a new handle. They need to be cleaned up, or you'll have a memory leak.
I'm not sure I understand the threading issue: we have finalizers on lots of other objects, why would Datatypes be special in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The finalizers are problematic elsewhere as well. But we also have a do
interface in many of those cases which helps.
Quoting Jeff Benzanson in JuliaLang/julia#11207 :
Finalizers are inefficient and unpredictable. And with the new GC, it might take much longer to get around to freeing an object, therefore tying up its resources longer. Ideally releasing external resources should not be tied to how memory management works.
The problem is particularly acute here since the underlying HDF5 C library is not thread-safe. JuliaLang/julia#45272 improves the situation by making the finalizer calls eager and deterministic, making it likely that the finalizer calls H5T_close
using the same thread.
The better solution here is to either explicitly call close
which should invoke HDF5.API.h5t_close
or we provide a do
syntax to assist with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, and adding some sort of do
syntax pattern is a good idea, but that doesn't preclude having finalizers as well for when that approach isn't feasible.
Within the scope of this PR, these objects should be finalized, or otherwise you're going to get memory leaks.
I played around a bit but the problem is bigger than I thought. It basically boils down to the low level I/O where pointers are written instead of the values of the primitive types. I am not sure which what's the best approach is but one fast solution would be throwing an error if What do you think? |
We may need to dispatch on |
Yeah, I think it makes sense to error somewhere if the struct type isn't |
I have solution for mutable types. Basically, we need to convert the mutable type to a |
Thanks for jumping in @mkitti ! How should we proceed with the finalisation problem? |
For now, I've changed the second argument to Perhaps we should probably |
One thing we recently added in MPI.jl (which has a similar construction for custom datatypes) is an internal cache, which may be helpful when reading/writing multiple times: |
Should this code be here or should it be in JLD.jl or a third package? |
I am a bit confused which actions need to be taken in order to merge this. Can I do anything else to make progress? |
I'm leaning towards the merge side, but I think @musm needs to review. I think we should consider @simonbyrne suggestion of a cache, but that could be another pull request. |
src/typeconversions.jl
Outdated
|
||
# These will use finalizers. Close them eagerly to avoid issues. | ||
datatype(::T) where {T} = Datatype(hdf5_type_id(T), true) | ||
datatype(::Type{T}) where {T} = Datatype(hdf5_type_id(T), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this definition redundant with the method right above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The first works on instances of a type. The second works on the type itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to need this definition (on L67)
@@ -46,6 +46,12 @@ The default `Dataspace` used for representing a Julia object `data`: | |||
- arrays: a simple `Dataspace` | |||
- `nothing` or an `EmptyArray`: a null dataspace | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the docs here to include the fact that the new default is a scalar Dataspace
for struct types?
LGTM sans minor comment Agree cache sounds interesting but probably for seperate PR |
I will see if I can address @musm's comments and then merge. |
Sounds good! |
Co-authored-by: Mustafa M <mus-m@outlook.com>
Great work! I am wondering if it is possible to do a release soon so that this can be consumed by general users without |
done |
This PR contains a utility function (EDIT: actually after some discussions it's now realised via adding a method to
datatype()
) which I carry around my projects and I discovered that there is already an open issue which is related too, see #819It's by all means not fully fledged, but it works for
struct
s with primitive field types very well. It's not reversible, meaning thatread
will yield the usualNamedTuple
. One way to make it reversible would be to attach metadata to the dataset with e.g. the type name and when read back, check if the type is already defined in the namespace and if not, create it dynamically. Kind of what I do in https://github.com/JuliaHEP/UnROOT.jl where generating types during runtime is mandatory.I provided an example in the function docstring which demonstrates that it's fairly intuitive to use and covers the basic needs. Here is it:
The data can be reinterpreted to the original data type using
reinterpret()
: