Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease batch size for seeds #35

Closed
tomreitz opened this issue Jan 10, 2023 · 1 comment
Closed

Decrease batch size for seeds #35

tomreitz opened this issue Jan 10, 2023 · 1 comment

Comments

@tomreitz
Copy link

Thanks for your work on this adapter, it's great.

The batch size to use when loading seeds is specified here as 100000.

I'm not sure how the 100k value was determined, but I ran into the following SQLite error when trying to load a seed file of 70k rows, 5 cols, 2.37MB (a dictionary of common words in 14 languages):

17:46:24  7 of 7 ERROR loading seed file main.synth_words ................................ [ERROR in 2.54s]
...
17:46:24  Database Error in seed synth_words (seeds/synth_words.csv)
17:46:24    too many SQL variables

I resolved the error by manually changing the batch size in macros/materializations/seed/seed.sql from 100k down to 10k.

SQLite's limits include SQLITE_MAX_VARIABLE_NUMBER

which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

(I'm running SQLite 3.31.1.) It seems that the number of variables used in the seed query scales with batch size.

Sidebar: dbt itself specifies a maximum file size for seeds but this is only used to determine whether or not to hash the file. dbt discourages using very large seed files, but does not impose a limit as far as I can tell.

Would it be possible to decrease the batch size from 100k to say 10k or so? and/or use a configurable seed_batch_size parameter?

My guess is that the optimal value probably depends on both the number of rows and columns in the seed file, so it may be difficult to hard-code and justify any specific batch size. But something smaller, while resulting in more queries against the database, should still be fairly performant.

@codeforkjeff
Copy link
Owner

Bumping down to 10k sounds good, thanks for reporting this. I'll cut a new release containing this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants