-
-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
init: Use VACUUM for large, modified SQLite files only #2377
Conversation
The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. This dangling subprocess may cause unexpected behavior. The new code waits for the process to finish, but vacuums only for bigger database files. Bug fixThe issues:
The fix: The new code waits for the vacuum process to finish. PerformanceWhen the database is small, the code now runs much faster. When the database is not small, the code is slower than before and the time depends on the size of the database. The old code has a constant speed regardless of the size of the database. Tiny database
Repeats: 100 SQLite database size: 25MiB (27MB) Tested command: grass ~/grassdata/nc_spm_08_grass7/PERMANENT/ --exec g.mapset -p perf command: perf stat -r 100 ... Smallest not small database
Repeats: 10 SQLite database size: 378MiB (397MB) |
This removes automatic vacuum of SQLite database, i.e., reverts the addition of automatic vacuum when a session ends. Doing vacuum can take long time which the current implementation hides by running the vacuum process in the background. The typical time is much longer than the 0.1 sec suggested by the sleep call. Doing vacuum only sometimes would be less predictable for the users and doing it only for large database would mean that it will take really long time (with a lot of disk or memory usage) when it actually happens suggesting that that's something user may want to trigger manually. Notably, SQLite doesn't do the vacuums automatically, although PostgreSQL has autovacuum on by default. This is an alternative to OSGeo#2377 which uses VACUUM for large, modified files only.
Or a simpler alternative: #2449 removes the automatic vacuum completely. |
process = gs.start_command("db.execute", sql="VACUUM") | ||
gs.verbose(_("Cleaning up default SQLite database...")) | ||
process.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process = gs.start_command("db.execute", sql="VACUUM") | |
gs.verbose(_("Cleaning up default SQLite database...")) | |
process.wait() | |
gs.verbose(_("Cleaning up default SQLite database...")) | |
process = gs.run_command("db.execute", sql="VACUUM") |
should do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost the same. The message job runs in parallel. Because the message is a subprocess, it takes some tiny but measurable time to execute, so I tried to shave off some time by showing it in parallel while the vacuum is running. I tried to clarify that now in the comment.
With the greater threshold for vacuum I introduced later, the practical difference is likely negligible, but doesn't do any harm. It breaks the promise of no multiple processes, but not in a data processing context, so likely permissible. Anyway, I can remove it if it seems too strange.
Looks good to me. I prefer this approach over #2449 because the initial motivation for this automatic vacuum was that users may not be aware that they can save potentially GBs of disk space with |
@@ -2110,11 +2111,11 @@ def clean_temp(): | |||
nul.close() | |||
|
|||
|
|||
def clean_all(): | |||
def clean_all(*, start_time): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you like this new syntax (*, keyword)
, but is it really needed here? Golden rule: Keep code sufficiently simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary in this particular function or PR, but it contributes to both documentation and enforcement of correct use and it seems as something which is becoming standard Python now (also judging from introduction of the /
to mark positional-only).
Python is using it (at least os.path in 3.10: os.path.realpath(path, *, strict=False)
) and scipy does too: scipy.stats.sampling.DiscreteAliasUrn(dist, *, domain=None, urn_factor=1, random_state=None)
.
As documentation, it is not needed for running the code, but it seems to be increasingly used for documenting Python.
As an enforcement of using start_time
as only a keyword argument, *
is the minimal code needed to specify that rule.
*
says "I may add more parameters in the future and I don't know how I will order them." This is quite fitting here and many other places in the Python API.
Thanks for the explanation of the original motivation.
The issue is that it takes significant time (0.5s for 25MB, 4s for 380MB) to complete. With the original code, there was no visible slowdown because the session ended without waiting for the vacuum process, so with interactive session, one did not feel the cost of the vacuum back then. With the PR, in extreme example, you pay 4s or more for deleting one row from the database. Disk space is now usually not that costly. For example, Git makes the assumption that you have plenty of disk space. We may work with larger data, but perhaps for us, too, time is more valuable than disk space (this would be different from raster compression where we save I/O time and disk space, right?).
Maybe, the same can be addressed by a new module which does |
I approve the change that running time for cleaning up the db is now apparent to the user (no longer running in the background). It is true that disk space is not that costly, but how many users are able and allowed to change the drive in their laptop for a larger one? IMHO, disk space is still an issue, as is time, and this PR is a good compromise. |
OSGeo4W jobs are failing now in general, but before OSGeo4W check passed. |
I'm ready to merge this. It improves the situation significantly when you don't edit the database. (Any edit, e.g., adding a vector map with a table, triggers the vacuum when the file is larger than MEMORYMB.) |
145940f
to
0ca3158
Compare
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB. The message is now only verbose and printed when the task is running. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit 73736 (f790d68).
… a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually.
… need to vacuum when it is only read or unused.
0ca3158
to
4d04c70
Compare
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB. Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums. Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files. The message is now only verbose and printed in parallel while the task is running. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB. Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums. Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files. The message is now only verbose and printed in parallel while the task is running. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under MEMORYMB value or 300MB. Use MEMORYMB as info for what small file means. If user does not have a problem keeping 300MB or more in memory, there is likely enough space on the disk for many files of that size. If database cleanup is needed, user can always run VACUUM manually or configure SQLite to do automatic vacuums. Additionally, vacuum runs only if the database was modified since the start of the session because there is no need to vacuum when it is only read or unused. Any edits, including additions, still trigger the vacuum. Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended. The original code runs relatively quickly, because it just starts the vacuum subprocess which then runs on its own. However, this dangling subprocess may cause unexpected behavior (locked SQLite, process termination on shared hardware). The new code waits for the process to finish, but vacuums only for large, modified database files. The message is now only verbose and printed in parallel while the task is running. The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697) and is best visible in a bulk backport commit f790d68 (https://trac.osgeo.org/grass/changeset/73736).
For SQLite database, VACUUM is used when the session is ending. There is a cost in running the task even when the database is small and no vacuuming is needed. This change adds a check for the size of the database and ignores databases under 100MB.
The message is now only verbose and printed when the task is running.
Unlike the original implementation, the vacuum task is now synchronous so that there is no running GRASS process after the session has ended.
The original implementation comes from Trac ticket 3697 (https://trac.osgeo.org/grass/ticket/3697)
and is best visible in a bulk backport commit 73736 (f790d68).