-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write out results schema to filesystem in summarize functions #1729
Write out results schema to filesystem in summarize functions #1729
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
EDIT: Probably not going to be as easy as I was expecting
How come? Because of the pickle error? See below.
As a general comment, it's unclear to me if pickle is the best approach here because the results are not easy to inspect outside of Python, and pickle is not guaranteed to work across Python versions. It's also not exactly how the results are stored in the database since it is the serialized results that are stored. All schemas are (de)serializable. As such, I would recommend the following:
from monty.serialization import dumpfn
dumpfn(task_doc, "results.json") # it will get auto-gzizpped if `GZIP_FILES: bool = True`
You can read it back in a lossless way via
from monty.serialization import loadfn
loadfn("results.json")
Make sure the results are written safely somewhere while the file database problem is not sorted
Sure, that's fair. Most users of quacc
are also MongoDB users and so we haven't had such issues, but I have long been aware that (seemingly convenient) filesystem-based stores are a large challenge for concurrent workflows in general. This is because databases that involve concurrent connections often need a client/server model, which is what most people using a filesystem-based data store are trying to avoid in the first place. This is a non-trivial issue but one I hope can be solved by others on the Materials Project team.
For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories
They will only be empty for static calculations, as relaxations will dump out the log and restart files. In any case, it is perhaps worthwhile for us to deal with this directly if it is a nuisance. One option would be to see if CREATE_UNIQUE_DIR
is True
and if directory
is blank. If so, the directory
can be removed. Additionally, we would replace the value of dir_name
in the output with a value of None
to reflect the fact that it doesn't exist on the filesystem anymore.
This is not something you need to implement (for the sake of simplicity, let's not deal with it in this PR regardless). I am just suggesting it as a viable option for later.
I'll take a look later to figure out what's going on here. |
@Andrew-S-Rosen Thanks, I thought the same. It seems that a string is finding its way in the Monty code while it should be a dict. EDIT: Thanks for the comment btw, it's always nice to learn new things, there are some aspect of python I am not familiar with. |
I am sorting this out. Seems to be related to materialsproject/emmet#914. |
I just fixed what I believe is the upstream issue. Just waiting on a new release of Edit: Okay that worked, but now we have the following to deal with: |
Note to self for later: We can circumvent these issues by |
Edited: Alright, the JSON serialization approach does work for the most part. I just had to I have sorted out the pickle errors. The only thing left now should be to uncomment the remaining |
Fix pickle-ability of output schemas, first noticed in #1729.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1729 +/- ##
=======================================
Coverage 99.37% 99.38%
=======================================
Files 81 81
Lines 3200 3227 +27
=======================================
+ Hits 3180 3207 +27
Misses 20 20 ☔ View full report in Codecov by Sentry. |
The light at the end of the tunnel: the checks have emerged (after quite a bit of refactoring in |
Thanks for taking care of that, I think that was a little bit beyond my python abilities, tests will come this weeks probably EDIT: it seems that one particular line takes a lot of time (flat 30 seconds each time I run a calculation) I will turn off get_metadata for now metadata = MoleculeMetadata().from_molecule(mol).model_dump() in quacc.schemas.atoms.py |
Is this a result of the change in the present PR, or are you just commenting more generally? Could you also provide some insight about the |
Ah yes! There is something inside the atoms.info, although it's not a lot just a simple short string. I assume this was related to this commit but it might not be you are right |
Here are the tests! Did not create new ones but made sure to test all summarise functions |
@tomdemeyere: Fantastic, thank you very much! I will proceed with the merge. Later on, if you find needing JSON is critical, let me know because I can likely make it work if needed. |
Just thought about it: Would you mind adding a reserved key name in "additional_fields" which value would be used as a name for the pickle file? {"pickle_label": "water_dimer_5.5"} This is a nice niche feature that helps visual inspection? |
Personally, I'm a bit hesitant to add in "hidden" features where we overload a given method with multiple purposes (as you likely have observed from my comments about disliking the |
Summary of Changes
This has two purposes:
Make sure the results are written safely somewhere while the file database problem is not sorted
For calculators where nothing is written to the disk (MACE, EMT, ...) this results file gives sense to the currently empty directories
What do you think?
EDIT: Probably not going to be as easy as I was expecting
Checklist
main
.black
,isort
, andruff
as described in the style guide.Notes