-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using sisl to parse aiida-siesta output #484
Comments
Thanks for your request! Indeed this would be nice to have. I think one way to easily do this is to add a I think your option 3 would be the best direction. @pfebrer I just thought about this way of achieving this, what do you think? To test this, I would just inherit one class and add the |
Yes, we already discussed something like this, and Jonas proposed a workaround, here: #205 (comment). Recently I also proposed that it would be nice to be able to avoid the In the case of not being able to avoid using siles, probably something like StringIO and BytesIO are probably more difficult to implement, but with file handles it should be fairly easy, no? Instead of |
I have been playing around a bit with this and I have something which sort of works for the ASCII siles. The way I have implemented it now is by modifying the constructor of the While working on this I ran into the issue that some of the The usage of the iterator for some of the reading routines also leads to problems when using the sile inside a
Because Finally, another small bug I discovered is that If you want to play around with it yourself, my current implementation is here https://github.com/ahkole/sisl/tree/sile-from-file-handle . |
If you added the Lines 472 to 476 in 3f11290
If |
Perhaps you would need to modify the Lines 516 to 544 in 3f11290
I'm not sure. |
That could be an alternative to copying the file handle in Also, this does not fix the issue that I raised where the file handle gets closed when you exit an iterator loop in some of the read routines (at least I suspect that is the reason the file handle gets closed). |
I'm not sure which of the approaches is better because I don't have the whole picture in my mind, I think Nick is the only one capable of answering that question :) Anyway I have already said in some discussions, and this issue makes me yet be more confident about it, that the best solution would be to write |
Thanks for your work on this.
Why is it difficult to add new features with the classes? I mean, if you want to read an xyz some place in the middle, and something from a cube file, I imagine something like this: fh = ...
geom = xyzSile(fh).read_geometry() # this will step the file-handle
grid = cubeSile(fh).read_grid() # this will also step the file-handle |
For instance, I can get a subclassed element to work like this: class fhxyz(xyzSile):
def __init__(self, filehandle, comment=None, *args, **kwargs):
self._file = Path(filehandle.name)
self._mode = filehandle.mode
if isinstance(comment, (list, tuple)):
self._comment = list(comment)
elif not comment is None:
self._comment = [comment]
else:
self._comment = []
self._line = 0
self.fh = filehandle
def _open(self):
pass
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
# we will not close a file-handle
return False This could be part of the |
I'm not saying it would be hard to use, of course the API can be very simple. I'm more worried about the internal complexity that it adds unnecessarily. I agree that it might require some work at some points where the reading functionality is bound to the class. But is it that much? If you were to build it from scratch, how would you do it? What will be easier to extend in the future? The added internal complexity also leads to obscure behaviors like the one @ahkole was pointing, which you can't avoid unless you modify or subclass library code.
I don't see a problem with that. The user should know the position of their file handle, if it matters.
I think the bare reading and writing functions should not be able to backtrack a handle unless it is explicitly specified by the user in cases where it can make sense. |
Agreed, this is a bit worrying, however, the idea would be that this small complexity should be stable, i.e. the above class structure is a bit complex, but very small, and hence manageable. It shouldn't be changed everywhere, only in the
I don't see a huge benefit for functions over classes, I mean that many of these siles are rather simple, some are complex, but those are only the keyworded files (fdf, etc.).
My point would be that siles should generally be agnostic when it comes to file-handles or file-names, i.e. all it does is read and parse content in some buffer. So the classes should be the same, regardless.
But it matters greatly, you can't say,
Because then you have passed the supercell position. My experience is that this will be confusing to end-users. To add to my point about users, @ahkole's bug on the Sorry, I didn't respond to your bug before @ahkole, now I did. I should clarify this in the sile.
I agree. Again, I am not per see against this, if it could aid in the use of files, then by all means. I must admit I just don't see a big improvement considering the efforts required to re-implement them. And, I would be very happy with a proof of concept, so don't hold back... ;) To come back to this, here is an updated class that also handles from io import StringIO
from pathlib import Path
from sisl.io.xyz import xyzSile
fh = open("CAP.xyz", "r")
string = StringIO(fh.read())
fh.seek(0)
class handle_or_StringIO_xyz(xyzSile):
def __init__(self, filehandle, *args, **kwargs):
try:
filename = Path(filehandle.name)
except:
# this is not optimal, it will be the current directory, but one should not be able
# to write to it
filename = Path()
try:
mode = filehandle.mode
except:
# a StringIO will always be able to read *and* write
# to its buffer
mode = 'rw'
self.fh = filehandle
self._fh_init_tell = filehandle.tell()
super().__init__(filename, mode, *args, **kwargs)
def _open(self):
self.fh.seek(self._fh_init_tell)
self._line = 0
def __exit__(self, type, value, traceback):
# we will not close a file-handle
self._line = 0
return False
fh_geom = handle_or_StringIO_xyz(fh).read_geometry()
string_geom = handle_or_StringIO_xyz(string).read_geometry()
geom = xyzSile("CAP.xyz").read_geometry()
print(geom.equal(fh_geom))
print(geom.equal(string_geom)) |
Yes perhaps it's more practical to put it inside the However I think that this will make it more difficult for external libraries to rely on sisl for parsing since they have to trust the class inner workings, which can be obscure and create overheads that will not be easy to avoid. At least not in a simple, elegant way. I don't know if this is something to worry about. |
Agreed that our solution should be stable. |
Could you please have a look at the branch 484-filehandle for a first attempt... More tests should be added before final merge, probably... |
@ahkole could you please have a look and see if the current branch level would fix the issues for you? It should be able to do what you requested without copying buffers etc. (I hope ;)) |
@ahkole now it should be in its final stage. Once you approve and have tested with aiida, let me know and I will merge it into main! |
@zerothi I have tested the newest version with aiida and it seems to work very well for all the text-based output files retrieved by aiida! There is still the bug though that if you call a reading routine that does the reading by iterating over the sile object (i.e.
This is not what we want, right? I think the culprit here is the definition of the iterator object for the
If instead we don't yield from the filehandle, but read the lines ourselves and yield those then the bug of the filehandle being closed goes away, i.e. with something like this:
If I change this in the code then |
@zerothi I also had a few additional feature requests related to this.
|
You are probably right that we don't want ever to close the buffer. I'll amend. |
Please open a new issue with specific details. :) |
Using the xyz_sile_class = si.get_sile_class("xyz")
xyz = xyz_sile_class(filebuffer)
# or in one
xyz = si.get_sile_class("xyz")(filebuffer) this would be my recommended way to do it. |
Ah, yes, that is actually a much better way and works out of the box :) |
I'll try to play around with these python buffers and see if I can open a NetCDF4 Dataset like that and if it works I'll create a new issue with the details on the feature request :) |
Perhaps you should be able to pass the format string to the I.e. : I'm saying this just because two consecutive calls usually makes code seem more complex. |
This won't work because sisl tries to figure out file extension before diverting to class arguments... |
But isn't it as simple as modifying |
A PR would be welcome... However, the problem is that |
Hmm I was thinking on just changing the three lines of Lines 312 to 314 in 7b043ed
to cls = kwargs.pop('cls', None)
if isinstance(cls, string):
sile = get_sile_class(cls, *args, **kwargs)
else:
sile = get_sile_class(file, *args, cls=cls, **kwargs)
return sile(Path(str_spec(str(file))[0]), *args, **kwargs) |
I'd rather not. I don't think it becomes particularly clean... :( Also, in the future I had in mind that |
What was the problem for using binary streams instead of text streams (i.e. In particular we are encountering a situation where we would like to parse the content of siesta basis files into |
I can't recall, I guess it should be done similarly to the way the However, first we should be sure that the netcdf4 package actually provides this feature, see here: It should be there, Unidata/netcdf4-python#652 |
I see that @ahkole was going to try out a buffer with NetCDF, #484 (comment) any progress there? |
I looked into it a little bit. You can read netCDF data from a
I have not tried using this to create a binary equivalent of the This only works for netCDF binary files by the way. If you want to handle any other binary files using some sort of |
Great, so it sounds like this could be passed in the same mechanism as we did for the regular siles. Perhaps we should subclass |
I leave it to @ahkole if he wants to do it :) |
I could have a look at it next week, sure. It hopefully shouldn't be too difficult to include support for NetCDF files. |
It's not urgent :) |
Describe the feature
I am trying out AiiDA to manage my workflow with siesta. I really like what the aiida-siesta (https://docs.siesta-project.org/projects/aiida-siesta/en/latest/) has already implemented in terms of calculations and workflows. This plugin also has a simple parser for parsing some of the basic siesta output files.I would like to extend the functionality to also be able to parse information from other siesta output files. Sisl has a large collection of parsers available so it would be great if I could use these to parse the output files produced by siesta. However, I haven't found a good robust way yet to use sisl for this. AiiDA stores retrieved output files from simulations in a file repository and gives you access to these files through a
FolderData
API (https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/data_types.html#topics-data-types-core-folder). This API is able to provide a file handle object for reading or can simply read and return the entire content of the file. On the other hand, to use sisl for parsing a file you need to supply the filename of the file and sisl handles the rest. Unfortunately I don't have access to the filename, just to either a file handle or the complete file content. Is there a way to make sisl work with AiiDA retrieved files? So far I have come up with a couple of potential solutions:get_sile
to parse this temporary file and then remove it. This should work out of the box but involves a lot of unnecessary IO. This also probably becomes unnecessarily slow if you have large output files.StringIO
andBytesIO
classes to provide a file like interface to the other methods in the sile. The big disadvantage from this is that you cannot derive the filetype from the extension so you would have to supply it some other way. And this may become slow and require tons of memory (because the entire file needs to be loaded into memory) if you have large files.What do you think? Is there a way that sisl could be used for parsing files retrieved by AiiDA? And if so, do any of the above solutions look like a good one? Or do you have a better proposal? I would also be happy to help with any implementation and/or testing once we have settled on an approach.
The text was updated successfully, but these errors were encountered: