Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

sgoley · 2021-10-01T17:48:35Z

Describe the feature

Use alternative delimiters for seeds besides "," (comma)

Describe alternatives you've considered

Currently, the only option is to load a non-csv seed into database as a temp table and then reference that as a "source".

Additional context

No, not database specific.

Who will this benefit?

Anyone who uses seeds.

jtcohen6 · 2021-10-04T12:25:54Z

@sgoley When you say "non-csv file types," are you talking only about tabular data with non-comma delimiters (e.g. TSVs)? Or are you also thinking about other formats for structured data (JSON, XML, ...)?

If it's just a question of delimiters, could you provide a bit more background on the difficulty you're encountering? Is there a blocker to pre-processing your files, so as to switch from tab/semicolon/other delimiter to commas?

We've got another issue already open for newline-delimited JSON support: #2365

sgoley · 2021-10-05T02:10:14Z

Yes, in my case I am specifically talking about non-comma delimiters like TSVs, semi-colon delimited (common in some european countries given the numerical comma standard), or "|" ( 'VERTICAL LINE' U+007C ) delimited files (infrequent but used within US financial systems).

I do completely understand that pre-processing is the only current workable solution, just opening the issue here since I searched and no truly similar issue has been raised or closed.

After reading more into the agate csv reader function, it looks like that particular api does not support a "sep" / "delimiter" parameter which is why I assume this was not supported natively?

jtcohen6 · 2021-10-05T13:50:43Z

@sgoley You're right, we just call agate's from_csv method for this:
https://github.com/dbt-labs/dbt/blob/f7680379fca80f653e8e7e2d45d7165a0fd864da/core/dbt/clients/agate_helper.py#L146

The good news: in that method, agate itself is happy to pass through any/all keyword arguments supported by python's built-in reader. It accepts **kwargs and passes them into csv.reader().

So I think this could be as simple as:

adding a delimiter arg to agate_helper.from_csv
adding a delimiter config to seeds (right around here), with default value either , or None (i.e. comma)
adjusting the load_agate_table context method

        column_types = self.model.config.column_types
        delimiter = self.model.config.delimiter
        try:
            table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter)

So, I'm pretty close to tagging this as a good first issue. I have just a few more questions:

What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?
Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.
Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing?

jameseon · 2022-01-17T13:25:11Z

Hello @jtcohen6

Following this issue along with JSON related (following #2365 )
As a dbt community member, If I may answer your questions to @sgoley

I understand the poor use cases outlined for dbt here https://docs.getdbt.com/docs/building-a-dbt-project/seeds - for valid/good use cases seeding small amounts of data as part of a build process (sampling ) - writing tests, documentation etc

What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?

-- I believe the community would benefit from adding generalized support for csv - COMMA is a very common character and shows up everywhere. If the data is not escaped or quoted, the data may shift even for simple files.

support for different delimiters, text enclosure and escape character at a minimum

Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.

-- Another community benefit, sometimes we need to work with small datasets from different formats. I believe support for CSV (if it has the configurable delimiter with text enclosure and escape character; this covers any character delimited file .txt,.tsv, pipe delimited etc - csv.reader() kwargs should handle this beautifully), json , xml, parquet and ORC: these are becoming common.

Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing? keeping it simple is great but this limits the entry point to dbt as only comma separated files. Going through

---- As the community grows this is going to be a growing need. Now, dbt seed is still not ideal for loading data into a warehouse, as part of the build process for a data project, sampling (maybe find size limitations to avoid functionality abuse) needs to be recommended as part of best practice to keep the functionality light as is. Yes (dbt is not a a data loader), but can it be used to build an ELT pipeline based on small samples of different file formats? Mocking the (EL) process for common data formats, build pipeline based on a small amount of data. Let's not say "no", say yes 😄 - the community needs these functionalities. The preprocess of a file before entry to dbt is a little hectic and expensive for a modern day tool.

Thank you for considering this. Keep up the great work.

github-actions · 2022-07-17T02:14:19Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-07-24T02:14:40Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

…labs#3990)

ramonvermeulen · 2023-03-17T13:55:08Z

I opened a pull request that adds this feature into dbt-core: #7186

jtcohen6 · 2023-03-20T08:38:18Z

Thanks @ramonvermeulen! Reopening, given you've picked this up

… (#7186) * Support configurable delimiter for seed files, default to comma (#3990) * Update Features-20230317-144957.yaml * Moved "delimiter" to seed config instead of node config * Update core/dbt/clients/agate_helper.py Co-authored-by: Cor <jczuurmond@protonmail.com> * Update test_contracts_graph_parsed.py * fixed integration tests * Added functional tests for seed files with a unique delimiter * Added docstrings * Added a test for an empty string configured delimiter value * whitespace * ran black * updated changie entry * Update Features-20230317-144957.yaml --------- Co-authored-by: Cor <jczuurmond@protonmail.com>

sgoley added enhancement New feature or request triage labels Oct 1, 2021

jtcohen6 removed the triage label Oct 4, 2021

jtcohen6 added the seeds Issues related to dbt's seed functionality label Oct 5, 2021

jtcohen6 added the Team: Execution label Nov 18, 2021

github-actions bot added the stale Issues that have gone stale label Jul 17, 2022

github-actions bot closed this as completed Jul 24, 2022

ramonvermeulen added a commit to ramonvermeulen/dbt-core that referenced this issue Mar 17, 2023

Support configurable delimiter for seed files, default to comma (dbt-…

596856b

…labs#3990)

ramonvermeulen mentioned this issue Mar 17, 2023

Support configurable delimiter for seed files, default to comma (#3990) #7186

Merged

6 tasks

jtcohen6 reopened this Mar 20, 2023

jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Language and removed stale Issues that have gone stale Team:Execution labels Mar 20, 2023

jtcohen6 removed the Team:Language label Jul 19, 2023

QMalcolm closed this as completed in #7186 Aug 1, 2023

tlento mentioned this issue Nov 6, 2023

Update typing-extensions version to >=4.4 #9012

Merged

5 tasks

tlento mentioned this issue Feb 26, 2024

Update dbt-semantic-interfaces dependency to compatible range #9671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

sgoley commented Oct 1, 2021

jtcohen6 commented Oct 4, 2021

sgoley commented Oct 5, 2021

jtcohen6 commented Oct 5, 2021

jameseon commented Jan 17, 2022 •

edited

Loading

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

ramonvermeulen commented Mar 17, 2023

jtcohen6 commented Mar 20, 2023

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Comments

sgoley commented Oct 1, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jtcohen6 commented Oct 4, 2021

sgoley commented Oct 5, 2021

jtcohen6 commented Oct 5, 2021

jameseon commented Jan 17, 2022 • edited Loading

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

ramonvermeulen commented Mar 17, 2023

jtcohen6 commented Mar 20, 2023

jameseon commented Jan 17, 2022 •

edited

Loading