init: Use VACUUM for large, modified SQLite files only #2377

wenzeslaus · 2022-05-17T18:04:40Z

For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB.

The message is now only verbose and printed when the task is running.

Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended.

The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697)
and is best visible in a bulk backport commit 73736 (f790d68).

wenzeslaus · 2022-06-17T13:42:31Z

The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. This dangling subprocess may cause unexpected behavior. The new code waits for the process to finish, but vacuums only for bigger database files.

Bug fix

The issues:

The vacuum subprocess in the original code prevents new process from using the mapset (the SQLite database is locked assuming that the locking is even enabled; worse things may happen if not) while the main process finished and the mapset is already unlocked.
In scenarios with shared or acquired resources (remote servers, HPC, cloud), running the vacuum subprocess after the main process already finished may cause unexpected behavior such as termination of the vacuum subprocess or violation of resource usage policy.

The fix:

The new code waits for the vacuum process to finish.

Performance

When the database is small, the code now runs much faster. When the database is not small, the code is slower than before and the time depends on the size of the database. The old code has a constant speed regardless of the size of the database.

Tiny database

code	time	note
original	0.2003 sec +- 0.67%	perf with `--post "sleep 1"` to give SQLite time
original	0.6240 sec +- 4.43%	with additional 230+ sec of busy SQLite messages
new	0.0823 sec +- 0.37%	no check, database small
new	0.5290 sec +- 1.46%	code modified to check always

Repeats: 100

SQLite database size: 25MiB (27MB)

Tested command:

grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec g.mapset -p

perf command:

perf stat -r 100 ...

Smallest not small database

code	time	note
original	0.21 sec +- 3.64%	perf with `--post "sleep 4"` to give SQLite time
new	4.10 sec +- 6.92%	using the default threshold of 300 MiB

Repeats: 10

SQLite database size: 378MiB (397MB)

This removes automatic vacuum of SQLite database, i.e., reverts the addition of automatic vacuum when a session ends. Doing vacuum can take long time which the current implementation hides by running the vacuum process in the background. The typical time is much longer than the 0.1 sec suggested by the sleep call. Doing vacuum only sometimes would be less predictable for the users and doing it only for large database would mean that it will take really long time (with a lot of disk or memory usage) when it actually happens suggesting that that's something user may want to trigger manually. Notably, SQLite doesn't do the vacuums automatically, although PostgreSQL has autovacuum on by default. This is an alternative to OSGeo#2377 which uses VACUUM for large, modified files only.

wenzeslaus · 2022-06-18T03:45:43Z

Or a simpler alternative: #2449 removes the automatic vacuum completely.

metzm · 2022-06-22T16:40:52Z

python/grass/script/setup.py

+    process = gs.start_command("db.execute", sql="VACUUM")
+    gs.verbose(_("Cleaning up default SQLite database..."))
+    process.wait()


Suggested change

process = gs.start_command("db.execute", sql="VACUUM")

gs.verbose(_("Cleaning up default SQLite database..."))

process.wait()

gs.verbose(_("Cleaning up default SQLite database..."))

process = gs.run_command("db.execute", sql="VACUUM")

should do the same?

Almost the same. The message job runs in parallel. Because the message is a subprocess, it takes some tiny but measurable time to execute, so I tried to shave off some time by showing it in parallel while the vacuum is running. I tried to clarify that now in the comment.

With the greater threshold for vacuum I introduced later, the practical difference is likely negligible, but doesn't do any harm. It breaks the promise of no multiple processes, but not in a data processing context, so likely permissible. Anyway, I can remove it if it seems too strange.

metzm · 2022-06-22T16:45:40Z

Looks good to me. I prefer this approach over #2449 because the initial motivation for this automatic vacuum was that users may not be aware that they can save potentially GBs of disk space with db.execute sql=VACUUM. Back then we found it to be quite convenient for users that automatic cleaning is performed at the end of a session, even if it might take some time.

metzm · 2022-06-22T16:53:12Z

lib/init/grass.py

@@ -2110,11 +2111,11 @@ def clean_temp():
    nul.close()


-def clean_all():
+def clean_all(*, start_time):


I understand you like this new syntax (*, keyword), but is it really needed here? Golden rule: Keep code sufficiently simple.

I don't think it is necessary in this particular function or PR, but it contributes to both documentation and enforcement of correct use and it seems as something which is becoming standard Python now (also judging from introduction of the / to mark positional-only).

Python is using it (at least os.path in 3.10: os.path.realpath(path, *, strict=False)) and scipy does too: scipy.stats.sampling.DiscreteAliasUrn(dist, *, domain=None, urn_factor=1, random_state=None).

As documentation, it is not needed for running the code, but it seems to be increasingly used for documenting Python.

As an enforcement of using start_time as only a keyword argument, * is the minimal code needed to specify that rule.

* says "I may add more parameters in the future and I don't know how I will order them." This is quite fitting here and many other places in the Python API.

wenzeslaus · 2022-06-22T17:57:38Z

Thanks for the explanation of the original motivation.

...we found it to be quite convenient for users that automatic cleaning is performed at the end of a session, even if it might take some time.

The issue is that it takes significant time (0.5s for 25MB, 4s for 380MB) to complete. With the original code, there was no visible slowdown because the session ended without waiting for the vacuum process, so with interactive session, one did not feel the cost of the vacuum back then.

With the PR, in extreme example, you pay 4s or more for deleting one row from the database. Disk space is now usually not that costly. For example, Git makes the assumption that you have plenty of disk space. We may work with larger data, but perhaps for us, too, time is more valuable than disk space (this would be different from raster compression where we save I/O time and disk space, right?).

...the initial motivation for this automatic vacuum was that users may not be aware that they can save potentially GBs of disk space with db.execute sql=VACUUM.

Maybe, the same can be addressed by a new module which does db.execute sql=VACUUM and/or turns on SQLite auto-vacuum mentioned by @tmszi in #2449. It somewhat promotes the issue and its solution, but it is opt-in (as with the default SQLite behavior) and it is managed separately (as with PostgreSQL).

metzm · 2022-06-25T13:01:48Z

I approve the change that running time for cleaning up the db is now apparent to the user (no longer running in the background).

It is true that disk space is not that costly, but how many users are able and allowed to change the drive in their laptop for a larger one? IMHO, disk space is still an issue, as is time, and this PR is a good compromise.

wenzeslaus · 2022-07-25T17:55:55Z

OSGeo4W jobs are failing now in general, but before OSGeo4W check passed.

wenzeslaus · 2022-07-25T18:38:47Z

I'm ready to merge this. It improves the situation significantly when you don't edit the database. (Any edit, e.g., adding a vector map with a table, triggers the vacuum when the file is larger than MEMORYMB.)

For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB. The message is now only verbose and printed when the task is running. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit 73736 (f790d68).

… a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually.

… need to vacuum when it is only read or unused.

For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB. Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums. Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files. The message is now only verbose and printed in parallel while the task is running. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).

wenzeslaus added this to the 8.4.0 milestone May 17, 2022

wenzeslaus added the enhancement New feature or request label May 17, 2022

wenzeslaus mentioned this pull request Jun 18, 2022

init: Remove automatic vacuum #2449

Closed

metzm reviewed Jun 22, 2022

View reviewed changes

wenzeslaus added bug Something isn't working Python Related code is in Python database Related to database management labels Jun 24, 2022

wenzeslaus force-pushed the less-sqlite-vacuum branch from 145940f to 0ca3158 Compare July 26, 2022 18:17

wenzeslaus added 5 commits July 26, 2022 17:37

Use MEMORYMB as info for what small file means. If user does not have…

295fcde

… a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually.

Vacuum only if the db was modified since the start of the session. No…

4867f33

… need to vacuum when it is only read or unused.

Clarify running of two processes in parallel in a comment

2a6a2eb

Fix check for undefined connection reported as None

4d04c70

wenzeslaus force-pushed the less-sqlite-vacuum branch from 0ca3158 to 4d04c70 Compare July 26, 2022 21:42

wenzeslaus changed the title ~~init: Use VACUUM for large files only~~ init: Use VACUUM for large, modified SQLite files only Jul 27, 2022

wenzeslaus merged commit 8b3b000 into OSGeo:main Jul 27, 2022

wenzeslaus deleted the less-sqlite-vacuum branch July 27, 2022 13:56

wenzeslaus mentioned this pull request Mar 22, 2024

db.vacuum: Add module to manage SQL vaccum #2462

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init: Use VACUUM for large, modified SQLite files only #2377

init: Use VACUUM for large, modified SQLite files only #2377

wenzeslaus commented May 17, 2022

wenzeslaus commented Jun 17, 2022

wenzeslaus commented Jun 18, 2022

metzm Jun 22, 2022

wenzeslaus Jul 25, 2022

metzm commented Jun 22, 2022

metzm Jun 22, 2022

wenzeslaus Jul 25, 2022

wenzeslaus commented Jun 22, 2022

metzm commented Jun 25, 2022

wenzeslaus commented Jul 25, 2022

wenzeslaus commented Jul 25, 2022

init: Use VACUUM for large, modified SQLite files only #2377

init: Use VACUUM for large, modified SQLite files only #2377

Conversation

wenzeslaus commented May 17, 2022

wenzeslaus commented Jun 17, 2022

Bug fix

Performance

Tiny database

Smallest not small database

wenzeslaus commented Jun 18, 2022

metzm Jun 22, 2022

Choose a reason for hiding this comment

wenzeslaus Jul 25, 2022

Choose a reason for hiding this comment

metzm commented Jun 22, 2022

metzm Jun 22, 2022

Choose a reason for hiding this comment

wenzeslaus Jul 25, 2022

Choose a reason for hiding this comment

wenzeslaus commented Jun 22, 2022

metzm commented Jun 25, 2022

wenzeslaus commented Jul 25, 2022

wenzeslaus commented Jul 25, 2022