Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General problem: Featurization takes too long! #179

Closed
2 tasks done
ardunn opened this issue Feb 4, 2019 · 10 comments
Closed
2 tasks done

General problem: Featurization takes too long! #179

ardunn opened this issue Feb 4, 2019 · 10 comments

Comments

@ardunn
Copy link
Contributor

ardunn commented Feb 4, 2019

Featurizing all of MP takes at least one day. This is way, way too long.

Edit:
Easiest way to fix this is by:

  • Converting all featurizers over to inplace=False
  • ability to just cache dfs
@utf
Copy link
Member

utf commented Feb 4, 2019

Do you know what is taking the most time?

@ardunn
Copy link
Contributor Author

ardunn commented Feb 4, 2019

@utf Structure featurization.

@utf
Copy link
Member

utf commented Feb 4, 2019

Is it SiteStatsFeaturizer (with CrystalNN preset) or something else?

@ardunn
Copy link
Contributor Author

ardunn commented Feb 4, 2019

To featurize 83k structures, it is taking roughly 12+ hours on beefy lawrencium Xeon compute nodes.

@ardunn
Copy link
Contributor Author

ardunn commented Feb 4, 2019

It is many of them. SiteStatsFeaturizer is among the worst offenders

@ardunn
Copy link
Contributor Author

ardunn commented Feb 4, 2019

Try featurizing elastic tensor dataset with the autofeaturizer "best" preset and see for yourself.

@utf
Copy link
Member

utf commented Feb 4, 2019

It could be that we keep recalculating the bonding which takes a long time for big structures.

We could check by seeing if there is a big speed up using MultipleFeaturizer with the additional caching. This is not a long term solution though, as we would then lose a lot of the fidelity of timing per featurizer.

If calculating the bonding is taking the most amount of time, we could alter the featurizers to also accept a BondedStructure object, in which case the bonding would not be recalculated. And then just have a StructureToBondedStructure conversion featurizer.

@ardunn
Copy link
Contributor Author

ardunn commented Feb 5, 2019

So that is one problem. Another problem is that it is just too slow for even one SiteStatsFingerprint for many large structures. For example running it on MP towards the larger structures I get times of ~1s/sample (running on a LR4 node on Lawrencium). As far as I can tell, it is running in fully parallel mode (n_jobs = cput_count() = 24)...

@ardunn ardunn added the priority label Feb 6, 2019
@ardunn
Copy link
Contributor Author

ardunn commented Feb 20, 2019

After looking into this, not sure caching actually really improves the performance (or maybe I am just using wrong?)

@ardunn
Copy link
Contributor Author

ardunn commented Feb 21, 2019

done via 7f4f99e

@ardunn ardunn closed this as completed Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants