Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init: Use VACUUM for large, modified SQLite files only #2377

Merged
merged 5 commits into from
Jul 27, 2022

Conversation

wenzeslaus
Copy link
Member

For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB.

The message is now only verbose and printed when the task is running.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697)
and is best visible in a bulk backport commit 73736 (f790d68).

@wenzeslaus wenzeslaus added this to the 8.4.0 milestone May 17, 2022
@wenzeslaus wenzeslaus added the enhancement New feature or request label May 17, 2022
@wenzeslaus
Copy link
Member Author

The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. This dangling subprocess may cause unexpected behavior. The new code waits for the process to finish, but vacuums only for bigger database files.

Bug fix

The issues:

  1. The vacuum subprocess in the original code prevents new process from using the mapset (the SQLite database is locked assuming that the locking is even enabled; worse things may happen if not) while the main process finished and the mapset is already unlocked.
  2. In scenarios with shared or acquired resources (remote servers, HPC, cloud), running the vacuum subprocess after the main process already finished may cause unexpected behavior such as termination of the vacuum subprocess or violation of resource usage policy.

The fix:

The new code waits for the vacuum process to finish.

Performance

When the database is small, the code now runs much faster. When the database is not small, the code is slower than before and the time depends on the size of the database. The old code has a constant speed regardless of the size of the database.

Tiny database

code time note
original 0.2003 sec +- 0.67% perf with --post "sleep 1" to give SQLite time
original 0.6240 sec +- 4.43% with additional 230+ sec of busy SQLite messages
new 0.0823 sec +- 0.37% no check, database small
new 0.5290 sec +- 1.46% code modified to check always

Repeats: 100

SQLite database size: 25MiB (27MB)

Tested command:

grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec g.mapset -p

perf command:

perf stat -r 100 ...

Smallest not small database

code time note
original 0.21 sec +- 3.64% perf with --post "sleep 4" to give SQLite time
new 4.10 sec +- 6.92% using the default threshold of 300 MiB

Repeats: 10

SQLite database size: 378MiB (397MB)

wenzeslaus added a commit to wenzeslaus/grass that referenced this pull request Jun 18, 2022
This removes automatic vacuum of SQLite database, i.e., reverts the addition of automatic vacuum when a session ends.

Doing vacuum can take long time which the current implementation hides by running the vacuum process in the background. The typical time is much longer than the 0.1 sec suggested by the sleep call.

Doing vacuum only sometimes would be less predictable for the users and doing it only for large database would mean that it will take really long time (with a lot of disk or memory usage) when it actually happens suggesting that that's something user may want to trigger manually.

Notably, SQLite doesn't do the vacuums automatically, although PostgreSQL has autovacuum on by default.

This is an alternative to OSGeo#2377 which uses VACUUM for large, modified files only.
@wenzeslaus
Copy link
Member Author

Or a simpler alternative: #2449 removes the automatic vacuum completely.

Comment on lines +457 to +461
process = gs.start_command("db.execute", sql="VACUUM")
gs.verbose(_("Cleaning up default SQLite database..."))
process.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
process = gs.start_command("db.execute", sql="VACUUM")
gs.verbose(_("Cleaning up default SQLite database..."))
process.wait()
gs.verbose(_("Cleaning up default SQLite database..."))
process = gs.run_command("db.execute", sql="VACUUM")

should do the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost the same. The message job runs in parallel. Because the message is a subprocess, it takes some tiny but measurable time to execute, so I tried to shave off some time by showing it in parallel while the vacuum is running. I tried to clarify that now in the comment.

With the greater threshold for vacuum I introduced later, the practical difference is likely negligible, but doesn't do any harm. It breaks the promise of no multiple processes, but not in a data processing context, so likely permissible. Anyway, I can remove it if it seems too strange.

@metzm
Copy link
Contributor

metzm commented Jun 22, 2022

Looks good to me. I prefer this approach over #2449 because the initial motivation for this automatic vacuum was that users may not be aware that they can save potentially GBs of disk space with db.execute sql=VACUUM. Back then we found it to be quite convenient for users that automatic cleaning is performed at the end of a session, even if it might take some time.

@@ -2110,11 +2111,11 @@ def clean_temp():
nul.close()


def clean_all():
def clean_all(*, start_time):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you like this new syntax (*, keyword), but is it really needed here? Golden rule: Keep code sufficiently simple.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary in this particular function or PR, but it contributes to both documentation and enforcement of correct use and it seems as something which is becoming standard Python now (also judging from introduction of the / to mark positional-only).

Python is using it (at least os.path in 3.10: os.path.realpath(path, *, strict=False)) and scipy does too: scipy.stats.sampling.DiscreteAliasUrn(dist, *, domain=None, urn_factor=1, random_state=None).

As documentation, it is not needed for running the code, but it seems to be increasingly used for documenting Python.

As an enforcement of using start_time as only a keyword argument, * is the minimal code needed to specify that rule.

* says "I may add more parameters in the future and I don't know how I will order them." This is quite fitting here and many other places in the Python API.

@wenzeslaus
Copy link
Member Author

Thanks for the explanation of the original motivation.

...we found it to be quite convenient for users that automatic cleaning is performed at the end of a session, even if it might take some time.

The issue is that it takes significant time (0.5s for 25MB, 4s for 380MB) to complete. With the original code, there was no visible slowdown because the session ended without waiting for the vacuum process, so with interactive session, one did not feel the cost of the vacuum back then.

With the PR, in extreme example, you pay 4s or more for deleting one row from the database. Disk space is now usually not that costly. For example, Git makes the assumption that you have plenty of disk space. We may work with larger data, but perhaps for us, too, time is more valuable than disk space (this would be different from raster compression where we save I/O time and disk space, right?).

...the initial motivation for this automatic vacuum was that users may not be aware that they can save potentially GBs of disk space with db.execute sql=VACUUM.

Maybe, the same can be addressed by a new module which does db.execute sql=VACUUM and/or turns on SQLite auto-vacuum mentioned by @tmszi in #2449. It somewhat promotes the issue and its solution, but it is opt-in (as with the default SQLite behavior) and it is managed separately (as with PostgreSQL).

@wenzeslaus wenzeslaus added bug Something isn't working Python Related code is in Python database Related to database management labels Jun 24, 2022
@metzm
Copy link
Contributor

metzm commented Jun 25, 2022

I approve the change that running time for cleaning up the db is now apparent to the user (no longer running in the background).

It is true that disk space is not that costly, but how many users are able and allowed to change the drive in their laptop for a larger one? IMHO, disk space is still an issue, as is time, and this PR is a good compromise.

@wenzeslaus
Copy link
Member Author

OSGeo4W jobs are failing now in general, but before OSGeo4W check passed.

@wenzeslaus
Copy link
Member Author

I'm ready to merge this. It improves the situation significantly when you don't edit the database. (Any edit, e.g., adding a vector map with a table, triggers the vacuum when the file is larger than MEMORYMB.)

@wenzeslaus wenzeslaus force-pushed the less-sqlite-vacuum branch from 145940f to 0ca3158 Compare July 26, 2022 18:17
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB.

The message is now only verbose and printed when the task is running.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697)
and is best visible in a bulk backport commit 73736 (f790d68).
… a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually.
… need to vacuum when it is only read or unused.
@wenzeslaus wenzeslaus force-pushed the less-sqlite-vacuum branch from 0ca3158 to 4d04c70 Compare July 26, 2022 21:42
@wenzeslaus wenzeslaus changed the title init: Use VACUUM for large files only init: Use VACUUM for large, modified SQLite files only Jul 27, 2022
@wenzeslaus wenzeslaus merged commit 8b3b000 into OSGeo:main Jul 27, 2022
@wenzeslaus wenzeslaus deleted the less-sqlite-vacuum branch July 27, 2022 13:56
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Oct 26, 2022
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB.

Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums.

Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files.

The message is now only verbose and printed in parallel while the task is running.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
ninsbl pushed a commit to ninsbl/grass that referenced this pull request Feb 17, 2023
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB.

Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums.

Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files.

The message is now only verbose and printed in parallel while the task is running.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
neteler pushed a commit to nilason/grass that referenced this pull request Nov 7, 2023
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB.

Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums.

Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files.

The message is now only verbose and printed in parallel while the task is running.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working database Related to database management enhancement New feature or request Python Related code is in Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants