-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset storage needs improvement #80
Comments
I'd recommend using json over pickle here. Just because, even though json will be slower, other people might not be able to unpickle the file if they are using a different architecture. E.g. intel vs ARM processor. If the data load is really really slow, perhaps we could look into using HDF5? |
@utf have you ever ran into "OverflowError: Maximum recursion level reached" when using pandas.DataFrame.to_json? When converting any df having pmg structures to json, I get that error |
Yep. I have a solution for it in the Might be easier to import those methods? But essentially, you convert to a dict and serialize as json using the Monty encoder. |
Something I'd like to do soon is get the dataset handling transferred over to the matminer style. Once the seaborn style handling code is implemented in matminer it should be as simple as importing the loader code and defining a dictionary of file metadata. If the datasets are going stay in the release package it would also be nice to eventually get whatever our final format is stored on figshare so they don't all take up so much disk space. Assuming they will be used for examples and not part of the core package that is. |
@Doppe1g4nger that seems like the best course of action. In fact it would be good to have all these datasets common between matminer and matbench if possible |
@Doppe1g4nger and @ADA110 feel free to close when we get done migrating this data over |
closed thanks to @Doppe1g4nger and @ADA110 good work guys |
Encoding and decoding structures and compositions to/from csv causes problems with oxidation states and is also slow. Is anyone open to converting the matbench dataset format to json or pickle for less hassle loading?
The text was updated successfully, but these errors were encountered: