Dataset storage needs improvement #80

ardunn · 2018-10-01T21:30:09Z

Encoding and decoding structures and compositions to/from csv causes problems with oxidation states and is also slow. Is anyone open to converting the matbench dataset format to json or pickle for less hassle loading?

utf · 2018-10-01T21:33:38Z

I'd recommend using json over pickle here. Just because, even though json will be slower, other people might not be able to unpickle the file if they are using a different architecture. E.g. intel vs ARM processor.

If the data load is really really slow, perhaps we could look into using HDF5?

ardunn · 2018-10-01T21:44:10Z

@utf have you ever ran into "OverflowError: Maximum recursion level reached" when using pandas.DataFrame.to_json? When converting any df having pmg structures to json, I get that error

utf · 2018-10-01T21:46:35Z

Yep. I have a solution for it in the store_dataframe_as_json function in matminer: https://github.com/hackingmaterials/matminer/blob/538940afd4816e37333ae07811157328d79074a0/matminer/utils/io.py#L39

Might be easier to import those methods?

But essentially, you convert to a dict and serialize as json using the Monty encoder.

ardunn · 2018-10-01T22:15:06Z

Ok cool. Does anyone have issues with eventually converting all the data over to json? @albalu @Qi-max

Also I think we can eventually have all the data loaded seaborn style, as is a current issue on matminer right now

Doppe1g4nger · 2018-10-01T22:26:23Z

Something I'd like to do soon is get the dataset handling transferred over to the matminer style. Once the seaborn style handling code is implemented in matminer it should be as simple as importing the loader code and defining a dictionary of file metadata.

If the datasets are going stay in the release package it would also be nice to eventually get whatever our final format is stored on figshare so they don't all take up so much disk space. Assuming they will be used for examples and not part of the core package that is.

ardunn · 2018-10-01T22:46:46Z

@Doppe1g4nger that seems like the best course of action. In fact it would be good to have all these datasets common between matminer and matbench if possible

ardunn · 2018-10-19T21:41:45Z

@Doppe1g4nger and @ADA110 feel free to close when we get done migrating this data over

ardunn · 2018-11-15T00:50:33Z

closed thanks to @Doppe1g4nger and @ADA110 good work guys

ardunn changed the title ~~Datasets stored as csv are problematic~~ Dataset storage needs improvement Oct 2, 2018

ardunn assigned Doppe1g4nger and ADA110 Oct 19, 2018

ardunn mentioned this issue Nov 1, 2018

load_* functions should ensure all numeric columns #16

Closed

ardunn unassigned ADA110 Nov 2, 2018

ardunn closed this as completed Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset storage needs improvement #80

Dataset storage needs improvement #80

ardunn commented Oct 1, 2018

utf commented Oct 1, 2018

ardunn commented Oct 1, 2018 •

edited

Loading

utf commented Oct 1, 2018 •

edited

Loading

ardunn commented Oct 1, 2018

Doppe1g4nger commented Oct 1, 2018

ardunn commented Oct 1, 2018

ardunn commented Oct 19, 2018

ardunn commented Nov 15, 2018

Dataset storage needs improvement #80

Dataset storage needs improvement #80

Comments

ardunn commented Oct 1, 2018

utf commented Oct 1, 2018

ardunn commented Oct 1, 2018 • edited Loading

utf commented Oct 1, 2018 • edited Loading

ardunn commented Oct 1, 2018

Doppe1g4nger commented Oct 1, 2018

ardunn commented Oct 1, 2018

ardunn commented Oct 19, 2018

ardunn commented Nov 15, 2018

ardunn commented Oct 1, 2018 •

edited

Loading

utf commented Oct 1, 2018 •

edited

Loading