Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Store in the DB what you want to query for" #3714

Open
ltalirz opened this issue Jan 15, 2020 · 1 comment
Open

"Store in the DB what you want to query for" #3714

ltalirz opened this issue Jan 15, 2020 · 1 comment

Comments

@ltalirz
Copy link
Member

ltalirz commented Jan 15, 2020

We currently advise plugin developers to store in the DB what they want to query for, and the rest in the file repository. I think this is sound advice.

While analyzing the size of a production DB, I noticed some places in aiida-core where we may be violating this principle (?).

A) As it turns out, the largest rows in the db_dbnode table are from process nodes, whose attribute field is 2kB in size.
The reason for this is that we are storing the raw squeue output in the last_jobinfo->raw_data field. Is this something you want to query for? Should that not better be parsed, and then go to a log file?

B) The largest rows in the db_dblog table are also several kBs in size. The log messages can contain a potentially very long python traceback. Do we want to query for that?
I also noticed that we seem to be storing this potentially long message twice: once in the top-level message column and once in the message field of the metadata jsonb column.

The reason I'm asking these things is that for large screening studies (say, 1M materials), you are dealing with ~10M nodes at least.
E.g. we already designed the CifData class in such a way that you don't need to store the atoms in the DB, but if AiiDA then stores 10 kB of data per process node in the DB (meaning 1M processes directly imply a database of >=10GB), these savings become irrelevant.

Mentioning @sphuber and @giovannipizzi for comment

@sphuber
Copy link
Contributor

sphuber commented Jan 17, 2020

Good point. Note that the large process nodes here are really just the CalcJobNodes. I think there is certainly a case to be made to move the raw last job info to the repository. Maybe we come up with a set of most likely to be queried properties that we leave in the attributes. However, until we fix the repository and make it scale, we are most likely going to shift the problem. For my big databases, it is not really the database that is the problem but rather the repository as it is exploding the file system and it is impossible to backup. The same goes for the exceptions. If we are duplicating information, that should definitely be fixed and also there we should see if we are not better off moving things to the repository. I think many of these things are a perfect candidate for discussion during the CINECA hackathon. I will add it to the tentative agenda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants