-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise depends_on chain resolution by using h5py #253
Conversation
src/scippnexus/nxtransformations.py
Outdated
transform = parent[depends_on] | ||
transform = _locate_depends_on_target(parent, depends_on) | ||
parent = transform.parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not looked at the rest of the code yet, but can't you do more here? The parent
logic is still stepping back and forth between h5py and ScippNexus, and I don't know if transform.parent
comes "for free". Can't parent
be an h5py.Group
for the entire loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this now. But it doesn't affect performance.
b9e15b7
to
f72aff4
Compare
self.sizes = _squeezed_field_sizes(self.dataset) | ||
if self.dtype is None: | ||
self.dtype = _dtype_fromdataset(self.dataset) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check that this is sound!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feared that there might be some code that relies on field.sizes
being None
(e.g., to do custom init logic only if dims are not set yet) before calling NXobject.__init__
, but I could not find any (looking mainly at nxdata.py
). So I think it should be ok?
"""Subclasses should call this in their __init__ method, or ensure that they | ||
initialize the fields in `children` with the correct sizes and dtypes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is outdated now, at least in parts?
self.sizes = _squeezed_field_sizes(self.dataset) | ||
if self.dtype is None: | ||
self.dtype = _dtype_fromdataset(self.dataset) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feared that there might be some code that relies on field.sizes
being None
(e.g., to do custom init logic only if dims are not set yet) before calling NXobject.__init__
, but I could not find any (looking mainly at nxdata.py
). So I think it should be ok?
src/scippnexus/nxtransformations.py
Outdated
base_parts = list(cwd.parts) | ||
path_parts = [segment for segment in path.parts if segment != '.'] | ||
while path_parts[0] == '..': | ||
if not path_parts: | ||
raise ValueError(f"Relative path beyond root: '{path}'") | ||
base_parts.pop(-1) | ||
path_parts.pop(0) | ||
return PurePosixPath(*base_parts, *path_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use something like
scippnexus/src/scippnexus/field.py
Lines 39 to 42 in 349c8df
def absolute_path(self) -> str | None: | |
if self.value == '.': | |
return None | |
return posixpath.normpath(posixpath.join(self.parent, self.value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is posixpath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import posixpath
src/scippnexus/nxtransformations.py
Outdated
|
||
The returned object is equivalent to calling ``parent[depends_on]`` | ||
in the context of transformations. | ||
This function does not work in general because it does not handle NXdata attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you mention NXdata attributes? What are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is what the comment here
scippnexus/src/scippnexus/base.py
Lines 371 to 373 in 349c8df
# We cannot get the child directly from the HDF5 group, since we need to | |
# create the parent group, to ensure that fields get the correct properties | |
# such as sizes and dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but NXdata is just one case (it is a baseclass for some but not all other NX* implementations).
src/scippnexus/nxtransformations.py
Outdated
transform, base_path = _locate_depends_on_target( | ||
file, base_path, PurePosixPath(depends_on), parent.definitions | ||
) | ||
depends_on = transform.attrs['depends_on'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are loading the transform anyway, and the loaded Transform
has access to its parent via its depends_on
, do we need the base_path
returned from _locate_depends_on_target
? It seems like a duplicate control flow, unless I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to avoid accessing the parent because of #253 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was: We are loading the transform anyway (in the line below), so the parent path is just a Python string stored in a Python class? There should be no more snx.Group
stuff at the level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is it stored? Transform
objects don't track their parent path or any other path. And the object returned from _locate_depends_on_target
is Field
or Group
. And accessing its parent
gets us back to #253 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in Transform.depends_on.parent
, or is that a different one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The DependsOn
object could even make the whole implementation simpler.
But it doesn't actually work because transform[()]
can return DataArray
if the transform is broken in the file. In that case, the current code still builds a complete transformation chain that contains data arrays instead of Transform
objects. But those don't carry a depends_on
attribute. See, e.g., test_compute_transformation_warns_if_transformation_missing_vector_attr
.
So I don't see an alternative to tracking the parent path manually here that doesn't change the behaviour.
This reduces the time it takes to resolve
depends_on
chains by locating the target using h5py and a custom path resolver. It avoids creating intermediatesnx.Group
objects which is expensive.For loading the 45 BIFROST analysers using
I get these results on my machine: