"Store in the DB what you want to query for" #3714

ltalirz · 2020-01-15T21:14:16Z

We currently advise plugin developers to store in the DB what they want to query for, and the rest in the file repository. I think this is sound advice.

While analyzing the size of a production DB, I noticed some places in aiida-core where we may be violating this principle (?).

A) As it turns out, the largest rows in the db_dbnode table are from process nodes, whose attribute field is 2kB in size.
The reason for this is that we are storing the raw squeue output in the last_jobinfo->raw_data field. Is this something you want to query for? Should that not better be parsed, and then go to a log file?

B) The largest rows in the db_dblog table are also several kBs in size. The log messages can contain a potentially very long python traceback. Do we want to query for that?
I also noticed that we seem to be storing this potentially long message twice: once in the top-level message column and once in the message field of the metadata jsonb column.

The reason I'm asking these things is that for large screening studies (say, 1M materials), you are dealing with ~10M nodes at least.
E.g. we already designed the CifData class in such a way that you don't need to store the atoms in the DB, but if AiiDA then stores 10 kB of data per process node in the DB (meaning 1M processes directly imply a database of >=10GB), these savings become irrelevant.

Mentioning @sphuber and @giovannipizzi for comment

The text was updated successfully, but these errors were encountered:

sphuber · 2020-01-17T09:37:30Z

Good point. Note that the large process nodes here are really just the CalcJobNodes. I think there is certainly a case to be made to move the raw last job info to the repository. Maybe we come up with a set of most likely to be queried properties that we leave in the attributes. However, until we fix the repository and make it scale, we are most likely going to shift the problem. For my big databases, it is not really the database that is the problem but rather the repository as it is exploding the file system and it is impossible to backup. The same goes for the exceptions. If we are duplicating information, that should definitely be fixed and also there we should see if we are not better off moving things to the repository. I think many of these things are a perfect candidate for discussion during the CINECA hackathon. I will add it to the tentative agenda.

ramirezfranciscof added the topic/database-schema label Feb 5, 2020

ltalirz mentioned this issue Feb 5, 2020

dumps of large jsons run into MemoryError causing migrations of large databases to fail #3716

Closed

ltalirz mentioned this issue Nov 11, 2020

Why get_function_source_code return source file of the function not the source function itself? #4543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Store in the DB what you want to query for" #3714

"Store in the DB what you want to query for" #3714

ltalirz commented Jan 15, 2020 •

edited

Loading

sphuber commented Jan 17, 2020

"Store in the DB what you want to query for" #3714

"Store in the DB what you want to query for" #3714

Comments

ltalirz commented Jan 15, 2020 • edited Loading

sphuber commented Jan 17, 2020

ltalirz commented Jan 15, 2020 •

edited

Loading