Importing huge seed files failes #127

ghost · 2021-04-12T13:42:38Z

I have a seed file with 14 columns and 152 lines. It fails with the error message:

Note: 14 columns x 152 rows = 2.128 cells

This looks like that the seed files are not batched when they are imported so they fail when they are too big.

I think dbt should create batches for lines/cells to avoid such issues.

I'm working here with a SQL Server 2014, but I guess this issue applies to the current SQL Server version as well.

mikaelene · 2021-04-12T13:53:49Z

There is already a limit on number of rows in dbt-core. The use of seeds are not advised and may get deprecated in the future. I think it will require some work to make dbt batch insert the data.

But 152 rows is not that much. I don't know what the limit are in more recent versions of SQL Server.

ghost · 2021-04-12T14:39:42Z

Well, the limit is actually 2.100 as you can see from the error query but the dbt module for SQL Server seems to make each cell as insert / or parameter which makes it break. I don't have much insight into the inner logic what the module executes here but I think 152 rows should be supported.

ghost · 2021-04-12T14:42:50Z

The SQL Server insert ... VALUE (..), (..) limit is 1.000 rows. So, thats for sure not the issue here. I think it has to do that each line value is passed through a parameter e.g. like this

INSERT INTO seed_table
(col1, col2, ...)
VALUES
(@row1_col1, row1_col2, ...),
(@row2_col1, row2_col2, ...),
...

and SQL Server has a limit on these parameter values. But I am just guessing here...

dataders · 2021-04-13T06:01:07Z

@hz-lschick you're directionally correct here.

If you look at line 18, each column value is a parameter marked by ?. AFAIK,sqlserver__basic_load_csv_rows() takes a batch_size argument which is set by sqlserver__load_csv_rows() to be 200 (i.e. 200 rows). I believe that sqlserver__basic_load_csv_rows() will create one INSERT INTO statement for each group of 200 rows.

So a potential workaround would be to override re-create the adapter's sqlserver__load_csv_rows() by re-creating it in your local macros/ dir, but change the 200 batch_size to something smaller. 149 should work for your example bc 149*14=2086

{% macro sqlserver__load_csv_rows(model, agate_table) %}
  {{ return(sqlserver__basic_load_csv_rows(model, 200, agate_table) )}}
{% endmacro %}

you might try putting synapse's seed macro in your local macros dir and renaming synapse__ to sqlserver__.

dbt-sqlserver/dbt/include/sqlserver/macros/materializations/seed/seed.sql

Lines 1 to 30 in 08b4b3d

    
           {% macro sqlserver__basic_load_csv_rows(model, batch_size, agate_table) %} 
        
               {% set cols_sql = get_seed_column_quoted_csv(model, agate_table.column_names) %} 
        
               {% set bindings = [] %} 
        
               {% set statements = [] %} 
        
               {% for chunk in agate_table.rows | batch(batch_size) %} 
        
                   {% set bindings = [] %} 
        
                   {% for row in chunk %} 
        
                       {% set _ = bindings.extend(row) %} 
        
                   {% endfor %} 
        
                   {% set sql %} 
        
                       insert into {{ this.render() }} ({{ cols_sql }}) values 
        
                       {% for row in chunk -%} 
        
                           ({%- for column in agate_table.column_names -%} 
        
                               ? 
        
                               {%- if not loop.last%},{%- endif %} 
        
                           {%- endfor -%}) 
        
                           {%- if not loop.last%},{%- endif %} 
        
                       {%- endfor %} 
        
                   {% endset %} 
        
                   {% set _ = adapter.add_query(sql, bindings=bindings, abridge_sql_log=True) %} 
        
                   {% if loop.index0 == 0 %} 
        
                       {% set _ = statements.append(sql) %} 
        
                   {% endif %} 
        
               {% endfor %}

mikaelene · 2021-04-13T06:14:49Z

I guess we could change the 200 setting to 100 instead. Will not make a difference in performance. Anyone interested in doing a pull request?

ghost · 2021-04-13T06:43:51Z

Hmm I think the offered solutions are both good and will work for me, but would wish that someone would do this calculation within the code. E.g. when you have a batch size of 200 and you have over 10 columns the batch size should be automatically be reduced so that one query does not use more than 2100 parameters/cells in the insert query.

mikaelene · 2021-04-13T09:35:20Z

From my point of view it is a little overkill. I guess we can set it to 42 and have a disclaimer that seed files with more than 50 columns is not supported. I think SQL Server would handle the inserts quite fast even with the small batch size.

But, I will approve a PR with this logic if it works ok :-).

Seeds files are in your project directory and will be version controlled. This means that if you have a large file it will be sent to git on every change making your project heavy and slow. If you have larger files you may want to look into an other solution for loading them.

Issue #127 - Respect SQL Servers Parameter Limit

mikaelene added the good first issue Good for newcomers label Apr 13, 2021

dataders mentioned this issue Jul 19, 2021

Issue #127 - Respect SQL Servers Parameter Limit #151

Merged

dataders closed this as completed in #151 Jul 19, 2021

dataders added a commit that referenced this issue Jul 19, 2021

Merge pull request #151 from jacobm001/issue-127

75ff846

Issue #127 - Respect SQL Servers Parameter Limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing huge seed files failes #127

Importing huge seed files failes #127

ghost commented Apr 12, 2021

mikaelene commented Apr 12, 2021

ghost commented Apr 12, 2021

ghost commented Apr 12, 2021 •

edited by ghost

Loading

dataders commented Apr 13, 2021

mikaelene commented Apr 13, 2021

ghost commented Apr 13, 2021

mikaelene commented Apr 13, 2021

Importing huge seed files failes #127

Importing huge seed files failes #127

Comments

ghost commented Apr 12, 2021

mikaelene commented Apr 12, 2021

ghost commented Apr 12, 2021

ghost commented Apr 12, 2021 • edited by ghost Loading

dataders commented Apr 13, 2021

mikaelene commented Apr 13, 2021

ghost commented Apr 13, 2021

mikaelene commented Apr 13, 2021

ghost commented Apr 12, 2021 •

edited by ghost

Loading