Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v.db.join: speed up processing by using fewer db.execute commands #3286

Merged
merged 18 commits into from
Apr 9, 2024

Conversation

griembauer
Copy link
Contributor

@griembauer griembauer commented Dec 4, 2023

This PR speeds up v.db.join for large tables. Previously, individual v.db.addcolumn and db.execute commands were run for each column to join.
This PR alters v.db.addcolumn and v.db.join such that only one db.execute command is executed for each, using a temporary sql_file instead of stdin as input for the SQL statement.

Previous approach (deprecated):
~~In this PR, one single v.db.addcolumn and db.execute command is executed for a chunk of max 100 columns at once.
Here, all individual v.db.addcolumn commands are grouped into one single command. Additionally, all db.execute UPDATE table... commands to add data to the new columns are grouped into a single command. As the SQL command may become very long this way (and incomplete input in sqlite3_prepare() errors may be encountered, see also #3273), commands with a total string length of > ~10.000 are again split into individual SQL UPDATE commands.~~

@neteler neteler added enhancement New feature or request Python Related code is in Python labels Dec 4, 2023
@neteler neteler added this to the 8.4.0 milestone Dec 4, 2023
@tmszi
Copy link
Member

tmszi commented Dec 20, 2023

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

grass/db/db.execute/main.c

Lines 189 to 191 in ea8f3ea

int get_stmt(FILE *fd, dbString *stmt)
{
char buf[DB_SQL_MAX], buf2[DB_SQL_MAX];

#define DB_SQL_MAX 65536

This line truncates the SQL string to 8206 chars.

if (G_getl2(buf, sizeof(buf), fd) == 0)

Example of long SQL string:

  1. Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"
  1. Try to execute this SQL string with db.execute input="-" command
  2. db.execute command fails because SQL string is truncated and not valid

For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

@github-actions github-actions bot added the module label Jan 5, 2024
@github-actions github-actions bot added the vector Related to vector data processing label Feb 26, 2024
@griembauer
Copy link
Contributor Author

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

grass/db/db.execute/main.c

Lines 189 to 191 in ea8f3ea

int get_stmt(FILE *fd, dbString *stmt)
{
char buf[DB_SQL_MAX], buf2[DB_SQL_MAX];

#define DB_SQL_MAX 65536

This line truncates the SQL string to 8206 chars.

if (G_getl2(buf, sizeof(buf), fd) == 0)

Example of long SQL string:

1. Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"
2. Try to execute this SQL string with `db.execute input="-"` command

3. `db.execute` command fails because SQL string is truncated and not valid

For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

Thanks! I now updated both v.db.join and v.db.update so that they both use db.executewith a temporary sql file as input instead of stdin. For joining 500 INT columns to a test vector with ~750 objects this brings calculation time down from 110s to 9s on my local machine. However, in the sql_files there are still separate sql statements:

[...]
UPDATE builtup_vectorized_base SET test_column_491 = (SELECT test_column_491 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_492 = (SELECT test_column_492 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_493 = (SELECT test_column_493 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_494 = (SELECT test_column_494 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
[...]

instead of having just one UPDATE statement. I am not sure if this would significantly improve the processing time again, but I couldn't manage to put together a single working UPDATE statement - input is welcome!

@neteler
Copy link
Member

neteler commented Feb 26, 2024

Probably a set of SQL commands should be wrapped into a TRANSACTION?

    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")

Random example:

sql = ["BEGIN TRANSACTION"]

@griembauer
Copy link
Contributor Author

Probably a set of SQL commands should be wrapped into a TRANSACTION?

    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")

Random example:

sql = ["BEGIN TRANSACTION"]

Thanks! Adding this to both v.db.addcolumn and v.db.join seems to double the processing speed

@griembauer griembauer requested a review from tmszi March 6, 2024 08:40
@griembauer griembauer requested a review from tmszi April 2, 2024 08:40
Copy link
Member

@tmszi tmszi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I appologize for long delay.

Looks good to me. I tested this fix with adding 500 columns and time was reduced from approximately 1 minute to 0.721s.

Copy link
Contributor

@metzm metzm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, impressive speed-up!

@metzm metzm merged commit f98c3f2 into OSGeo:main Apr 9, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module Python Related code is in Python vector Related to vector data processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants