v.db.join: speed up processing by using fewer db.execute commands #3286

griembauer · 2023-12-04T11:41:50Z

This PR speeds up v.db.join for large tables. Previously, individual v.db.addcolumn and db.execute commands were run for each column to join.
This PR alters v.db.addcolumn and v.db.join such that only one db.execute command is executed for each, using a temporary sql_file instead of stdin as input for the SQL statement.

Previous approach (deprecated):
~~In this PR, one single v.db.addcolumn and db.execute command is executed for a chunk of max 100 columns at once.
Here, all individual v.db.addcolumn commands are grouped into one single command. Additionally, all db.execute UPDATE table... commands to add data to the new columns are grouped into a single command. As the SQL command may become very long this way (and incomplete input in sqlite3_prepare() errors may be encountered, see also #3273), commands with a total string length of > ~10.000 are again split into individual SQL UPDATE commands.~~

scripts/v.db.join/v.db.join.py

tmszi · 2023-12-20T08:02:21Z

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

grass/db/db.execute/main.c

Lines 189 to 191 in ea8f3ea

    
           int get_stmt(FILE *fd, dbString *stmt) 
        
           { 
        
               char buf[DB_SQL_MAX], buf2[DB_SQL_MAX];

grass/include/grass/dbmi.h

Line 142 in ea8f3ea

#define DB_SQL_MAX 65536

This line truncates the SQL string to 8206 chars.

grass/db/db.execute/main.c

Line 196 in ea8f3ea

if (G_getl2(buf, sizeof(buf), fd) == 0)

Example of long SQL string:

Generate long SQL string with 30145 chars (create table) with Python

sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"

Try to execute this SQL string with db.execute input="-" command
db.execute command fails because SQL string is truncated and not valid

For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

…addcol

….db.join_addcol

griembauer · 2024-02-26T10:40:23Z

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

grass/db/db.execute/main.c

Lines 189 to 191 in ea8f3ea

int get_stmt(FILE *fd, dbString *stmt)

{

char buf[DB_SQL_MAX], buf2[DB_SQL_MAX];

grass/include/grass/dbmi.h

Line 142 in ea8f3ea

#define DB_SQL_MAX 65536

This line truncates the SQL string to 8206 chars.

grass/db/db.execute/main.c

Line 196 in ea8f3ea

if (G_getl2(buf, sizeof(buf), fd) == 0)

Example of long SQL string:
1. Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"
2. Try to execute this SQL string with `db.execute input="-"` command

3. `db.execute` command fails because SQL string is truncated and not valid
For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

Thanks! I now updated both v.db.join and v.db.update so that they both use db.executewith a temporary sql file as input instead of stdin. For joining 500 INT columns to a test vector with ~750 objects this brings calculation time down from 110s to 9s on my local machine. However, in the sql_files there are still separate sql statements:

[...]
UPDATE builtup_vectorized_base SET test_column_491 = (SELECT test_column_491 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_492 = (SELECT test_column_492 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_493 = (SELECT test_column_493 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_494 = (SELECT test_column_494 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
[...]

instead of having just one UPDATE statement. I am not sure if this would significantly improve the processing time again, but I couldn't manage to put together a single working UPDATE statement - input is welcome!

neteler · 2024-02-26T11:21:28Z

Probably a set of SQL commands should be wrapped into a TRANSACTION?

    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")

Random example:

grass/scripts/v.dissolve/v.dissolve.py

Line 200 in 24365e1

sql = ["BEGIN TRANSACTION"]

griembauer · 2024-02-26T12:02:39Z

Probably a set of SQL commands should be wrapped into a TRANSACTION?
    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")
Random example:

grass/scripts/v.dissolve/v.dissolve.py

Line 200 in 24365e1

sql = ["BEGIN TRANSACTION"]

Thanks! Adding this to both v.db.addcolumn and v.db.join seems to double the processing speed

scripts/v.db.addcolumn/v.db.addcolumn.py

scripts/v.db.join/v.db.join.py

tmszi

First I appologize for long delay.

Looks good to me. I tested this fix with adding 500 columns and time was reduced from approximately 1 minute to 0.721s.

metzm

Looks good to me, impressive speed-up!

griembauer added 4 commits December 4, 2023 12:24

speed_up

70c1f94

add comments

1150f5d

apply black

536c107

remove unnecessary import

6c58f23

neteler added enhancement New feature or request Python Related code is in Python labels Dec 4, 2023

neteler added this to the 8.4.0 milestone Dec 4, 2023

griembauer added 2 commits December 6, 2023 14:59

remove addcolumn part

f790d03

remove unnecessary import

c5dede1

metzm reviewed Dec 6, 2023

View reviewed changes

scripts/v.db.join/v.db.join.py Outdated Show resolved Hide resolved

griembauer added 3 commits December 7, 2023 10:11

use chunks

9937208

reduce to 100er chunks

5aac6f5

apply black

23fcd33

Merge branch 'main' into v.db.join_addcol

eaa32ce

github-actions bot added the module label Jan 5, 2024

griembauer added 3 commits February 26, 2024 09:28

Merge branch 'main' of https://github.com/OSGeo/grass into v.db.join_…

208ede1

…addcol

sql_files

defa84b

Merge branch 'v.db.join_addcol' of github.com:griembauer/grass into v…

caa40a7

….db.join_addcol

github-actions bot added the vector Related to vector data processing label Feb 26, 2024

griembauer added 2 commits February 26, 2024 11:28

remove enumerate

4223d0a

update black version

fa3ec92

remove not required import

18c7158

review MN; add TRANSACTION

79c1535

griembauer requested a review from tmszi March 6, 2024 08:40