Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing huge seed files failes #127

Closed
ghost opened this issue Apr 12, 2021 · 7 comments · Fixed by #151
Closed

Importing huge seed files failes #127

ghost opened this issue Apr 12, 2021 · 7 comments · Fixed by #151
Labels
good first issue Good for newcomers

Comments

@ghost
Copy link

ghost commented Apr 12, 2021

I have a seed file with 14 columns and 152 lines. It fails with the error message:

image

Note: 14 columns x 152 rows = 2.128 cells

This looks like that the seed files are not batched when they are imported so they fail when they are too big.

I think dbt should create batches for lines/cells to avoid such issues.

I'm working here with a SQL Server 2014, but I guess this issue applies to the current SQL Server version as well.

@mikaelene
Copy link
Collaborator

There is already a limit on number of rows in dbt-core. The use of seeds are not advised and may get deprecated in the future. I think it will require some work to make dbt batch insert the data.

But 152 rows is not that much. I don't know what the limit are in more recent versions of SQL Server.

@ghost
Copy link
Author

ghost commented Apr 12, 2021

Well, the limit is actually 2.100 as you can see from the error query but the dbt module for SQL Server seems to make each cell as insert / or parameter which makes it break. I don't have much insight into the inner logic what the module executes here but I think 152 rows should be supported.

@ghost
Copy link
Author

ghost commented Apr 12, 2021

The SQL Server insert ... VALUE (..), (..) limit is 1.000 rows. So, thats for sure not the issue here. I think it has to do that each line value is passed through a parameter e.g. like this

INSERT INTO seed_table
(col1, col2, ...)
VALUES
(@row1_col1, row1_col2, ...),
(@row2_col1, row2_col2, ...),
...

and SQL Server has a limit on these parameter values. But I am just guessing here...

@dataders
Copy link
Collaborator

@hz-lschick you're directionally correct here.

If you look at line 18, each column value is a parameter marked by ?. AFAIK,sqlserver__basic_load_csv_rows() takes a batch_size argument which is set by sqlserver__load_csv_rows() to be 200 (i.e. 200 rows). I believe that sqlserver__basic_load_csv_rows() will create one INSERT INTO statement for each group of 200 rows.

So a potential workaround would be to override re-create the adapter's sqlserver__load_csv_rows() by re-creating it in your local macros/ dir, but change the 200 batch_size to something smaller. 149 should work for your example bc 149*14=2086

{% macro sqlserver__load_csv_rows(model, agate_table) %}
  {{ return(sqlserver__basic_load_csv_rows(model, 200, agate_table) )}}
{% endmacro %}

you might try putting synapse's seed macro in your local macros dir and renaming synapse__ to sqlserver__.

{% macro sqlserver__basic_load_csv_rows(model, batch_size, agate_table) %}
{% set cols_sql = get_seed_column_quoted_csv(model, agate_table.column_names) %}
{% set bindings = [] %}
{% set statements = [] %}
{% for chunk in agate_table.rows | batch(batch_size) %}
{% set bindings = [] %}
{% for row in chunk %}
{% set _ = bindings.extend(row) %}
{% endfor %}
{% set sql %}
insert into {{ this.render() }} ({{ cols_sql }}) values
{% for row in chunk -%}
({%- for column in agate_table.column_names -%}
?
{%- if not loop.last%},{%- endif %}
{%- endfor -%})
{%- if not loop.last%},{%- endif %}
{%- endfor %}
{% endset %}
{% set _ = adapter.add_query(sql, bindings=bindings, abridge_sql_log=True) %}
{% if loop.index0 == 0 %}
{% set _ = statements.append(sql) %}
{% endif %}
{% endfor %}

@mikaelene
Copy link
Collaborator

I guess we could change the 200 setting to 100 instead. Will not make a difference in performance. Anyone interested in doing a pull request?

@mikaelene mikaelene added the good first issue Good for newcomers label Apr 13, 2021
@ghost
Copy link
Author

ghost commented Apr 13, 2021

Hmm I think the offered solutions are both good and will work for me, but would wish that someone would do this calculation within the code. E.g. when you have a batch size of 200 and you have over 10 columns the batch size should be automatically be reduced so that one query does not use more than 2100 parameters/cells in the insert query.

@mikaelene
Copy link
Collaborator

From my point of view it is a little overkill. I guess we can set it to 42 and have a disclaimer that seed files with more than 50 columns is not supported. I think SQL Server would handle the inserts quite fast even with the small batch size.

But, I will approve a PR with this logic if it works ok :-).

Seeds files are in your project directory and will be version controlled. This means that if you have a large file it will be sent to git on every change making your project heavy and slow. If you have larger files you may want to look into an other solution for loading them.

dataders added a commit that referenced this issue Jul 19, 2021
Issue #127 - Respect SQL Servers Parameter Limit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants