[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

mdutoo · 2022-02-17T15:08:13Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

On dbt 1.0.1 / Python 3.8.10 / Ubuntu 20.04, running dbt seed on the following file :
seed_underscore_detected_as_int_example.csv

raises the following error:
14:59:39 Database Error in seed seed_underscore_detected_as_int_example (seeds/seed_underscore_detected_as_int_example.csv)
14:59:39 integer out of range

The reason seems to be that the "29051_00040" value is wrongly read as "2905100040" and the column type wrongly detected as integer.

Workaround : enforcing the column type as text solves it.

Expected Behavior

Type of a column with value "29051_00040" should be detected as text.

Steps To Reproduce

See above

Relevant log output

No response

Environment

- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1

What database are you using dbt with?

postgres

Additional Context

No response

ChenyuLInx · 2022-03-01T21:36:52Z

Hey @mdutoo , thanks for submitting this, and sorry for getting back to you late!
I think this might be a known issue for seed.
We use agate to infer the column types for seeds here. And somehow it decided to cut the _ in the example you provided and convert it to an integer. One quick way to get around this is to define the column type in dbt_project.yml in your project repo following the guide here

This is what I had locally that solves the issue

seeds:
  toy: # you must include the project name
    seed_underscore_detected_as_int_example:
      +column_types:
        underscore_detected_as_int: varchar(32)

@jtcohen6 One thing I noticed is that we get the agate table from the CSV, but looks like we just used the column type but not the value from it when running the seed. Is this the intended behavior?

jtcohen6 · 2022-03-02T14:55:45Z

This is a weird one indeed. The cause for this issue seems to lie in agate.Table.from_csv():

ipdb> with open(path) as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            print(", ".join(row))
underscore_detected_as_int
29051_00040

ipdb> import agate
ipdb> table = agate.Table.from_csv(path)
ipdb> table.print_table()
| underscore_detect... |
| -------------------- |
|        2,905,100,040 |

iknox-fa · 2022-04-11T15:16:47Z

@jtcohen6 This ticket came up in BLG today, but it seems that adding an explicit +column_types to the project config fixes this specific case. Do we want to make explicit changes to the codebase so this happens automatically? We're concerned it may cause other mis-identified column types.

mdutoo · 2022-04-11T15:57:25Z

Hi, thanks for having a look at this bug.
My 2 cents:
I get it that it's not really in DBT's yard and so hard to patch nicely.
But the real problem is that it is a "gotcha" that makes developers loose time finding out the cause unless they have already seen it once, which is frustrating.

To give you an example, I've considered enforcing column types in all my seeds to prevent other similar cases, but that's quite some work, and so even wrote some scripts to automate that ! But that's making my dbt working differently from others, which is not a good practice. And BTW I don't like putting configuration in dbt_project.yml and prefer YAML config files.

My idea is that the nicest way is maybe to make all these types explicit, for instance in a crude manner by generating the appropriate configuration to dbt_project.yml , using an operation or depending on a config variable. This way, finding out the cause of the problem (and all other similar ones) and solving it (changing the enforced type) becomes really quick.
The problem is that it ideally requires some "project start / archetype generation" or at least code generation practices that are not already really in dbt for now.

So the easiest solution is probably merely to tell developers in the documentation that type autodetection is there to help developers start out faster, but beyond that enforcing column types is strongly advised because type autodetection it is notoriously not reliable (and then see how to make that easier as said).

jtcohen6 · 2022-04-21T10:29:45Z

@mdutoo Thanks for the response!

And BTW I don't like putting configuration in dbt_project.yml and prefer YAML config files.

Good news: Since v0.21, you can also specify this config in a one-off yaml file:

# seeds/config.yml
version: 2

seeds:
  - name: seed_underscore_detected_as_int_example
    config:
      column_types:
        underscore_detected_as_int: varchar(32)

So the easiest solution is probably merely to tell developers in the documentation that type autodetection is there to help developers start out faster, but beyond that enforcing column types is strongly advised because type autodetection it is notoriously not reliable (and then see how to make that easier as said).

I agree with this! Would you be up for opening an issue (or even a PR) over at https://github.com/dbt-labs/docs.getdbt.com/issues? The current docs on the column_types config understate this a bit:

Without specifying this, dbt will infer the datatype based on the column values in your seed file.

In the meantime, I'm going to close the issue. We know that agate can be... interesting, in undesirable ways: #3413

mdutoo added bug Something isn't working triage labels Feb 17, 2022

jtcohen6 added Team:Execution seeds Issues related to dbt's seed functionality labels Feb 17, 2022

ChenyuLInx removed the triage label Mar 1, 2022

leahwicz added the jira label Mar 21, 2022

github-actions bot changed the title ~~[Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error)~~ [CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) Mar 21, 2022

jtcohen6 closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

mdutoo commented Feb 17, 2022

ChenyuLInx commented Mar 1, 2022

jtcohen6 commented Mar 2, 2022

iknox-fa commented Apr 11, 2022

mdutoo commented Apr 11, 2022

jtcohen6 commented Apr 21, 2022

[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

[CT-399] [Bug] seed : value made of underscore-separated numbers gets detected as int ("integer out of range" error) #4740

Comments

mdutoo commented Feb 17, 2022

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

What database are you using dbt with?

Additional Context

ChenyuLInx commented Mar 1, 2022

jtcohen6 commented Mar 2, 2022

iknox-fa commented Apr 11, 2022

mdutoo commented Apr 11, 2022

jtcohen6 commented Apr 21, 2022