Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

Closed
1 task done
mdutoo opened this issue Feb 17, 2022 · 5 comments
Labels
bug Something isn't working jira seeds Issues related to dbt's seed functionality

Comments

@mdutoo
Copy link

mdutoo commented Feb 17, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

On dbt 1.0.1 / Python 3.8.10 / Ubuntu 20.04, running dbt seed on the following file :
seed_underscore_detected_as_int_example.csv

raises the following error:
14:59:39 Database Error in seed seed_underscore_detected_as_int_example (seeds/seed_underscore_detected_as_int_example.csv)
14:59:39 integer out of range

The reason seems to be that the "29051_00040" value is wrongly read as "2905100040" and the column type wrongly detected as integer.

Workaround : enforcing the column type as text solves it.

Expected Behavior

Type of a column with value "29051_00040" should be detected as text.

Steps To Reproduce

See above

Relevant log output

No response

Environment

- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1

What database are you using dbt with?

postgres

Additional Context

No response

@mdutoo mdutoo added bug Something isn't working triage labels Feb 17, 2022
@jtcohen6 jtcohen6 added Team:Execution seeds Issues related to dbt's seed functionality labels Feb 17, 2022
@ChenyuLInx
Copy link
Contributor

Hey @mdutoo , thanks for submitting this, and sorry for getting back to you late!
I think this might be a known issue for seed.
We use agate to infer the column types for seeds here. And somehow it decided to cut the _ in the example you provided and convert it to an integer. One quick way to get around this is to define the column type in dbt_project.yml in your project repo following the guide here

This is what I had locally that solves the issue

seeds:
  toy: # you must include the project name
    seed_underscore_detected_as_int_example:
      +column_types:
        underscore_detected_as_int: varchar(32)

@jtcohen6 One thing I noticed is that we get the agate table from the CSV, but looks like we just used the column type but not the value from it when running the seed. Is this the intended behavior?

@ChenyuLInx ChenyuLInx removed the triage label Mar 1, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 2, 2022

This is a weird one indeed. The cause for this issue seems to lie in agate.Table.from_csv():

ipdb> with open(path) as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            print(", ".join(row))
underscore_detected_as_int
29051_00040
ipdb> import agate
ipdb> table = agate.Table.from_csv(path)
ipdb> table.print_table()
| underscore_detect... |
| -------------------- |
|        2,905,100,040 |

@leahwicz leahwicz added the jira label Mar 21, 2022
@github-actions github-actions bot changed the title [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) [CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) Mar 21, 2022
@iknox-fa
Copy link
Contributor

@jtcohen6 This ticket came up in BLG today, but it seems that adding an explicit +column_types to the project config fixes this specific case. Do we want to make explicit changes to the codebase so this happens automatically? We're concerned it may cause other mis-identified column types.

@mdutoo
Copy link
Author

mdutoo commented Apr 11, 2022

Hi, thanks for having a look at this bug.
My 2 cents:
I get it that it's not really in DBT's yard and so hard to patch nicely.
But the real problem is that it is a "gotcha" that makes developers loose time finding out the cause unless they have already seen it once, which is frustrating.

To give you an example, I've considered enforcing column types in all my seeds to prevent other similar cases, but that's quite some work, and so even wrote some scripts to automate that ! But that's making my dbt working differently from others, which is not a good practice. And BTW I don't like putting configuration in dbt_project.yml and prefer YAML config files.

My idea is that the nicest way is maybe to make all these types explicit, for instance in a crude manner by generating the appropriate configuration to dbt_project.yml , using an operation or depending on a config variable. This way, finding out the cause of the problem (and all other similar ones) and solving it (changing the enforced type) becomes really quick.
The problem is that it ideally requires some "project start / archetype generation" or at least code generation practices that are not already really in dbt for now.

So the easiest solution is probably merely to tell developers in the documentation that type autodetection is there to help developers start out faster, but beyond that enforcing column types is strongly advised because type autodetection it is notoriously not reliable (and then see how to make that easier as said).

@jtcohen6
Copy link
Contributor

@mdutoo Thanks for the response!

And BTW I don't like putting configuration in dbt_project.yml and prefer YAML config files.

Good news: Since v0.21, you can also specify this config in a one-off yaml file:

# seeds/config.yml
version: 2

seeds:
  - name: seed_underscore_detected_as_int_example
    config:
      column_types:
        underscore_detected_as_int: varchar(32)

So the easiest solution is probably merely to tell developers in the documentation that type autodetection is there to help developers start out faster, but beyond that enforcing column types is strongly advised because type autodetection it is notoriously not reliable (and then see how to make that easier as said).

I agree with this! Would you be up for opening an issue (or even a PR) over at https://github.com/dbt-labs/docs.getdbt.com/issues? The current docs on the column_types config understate this a bit:

Without specifying this, dbt will infer the datatype based on the column values in your seed file.

In the meantime, I'm going to close the issue. We know that agate can be... interesting, in undesirable ways: #3413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jira seeds Issues related to dbt's seed functionality
Projects
None yet
Development

No branches or pull requests

5 participants