Prevent Agate from coercing values in query result sets #3499

drewbanin · 2021-06-26T17:07:33Z

resolves #2984, #2666
related to #3413

Description

dbt uses Agate to represent internal dataframes (eg. seed file tabular representations, or in-memory query result sets). Agate's type coercions are interesting and helpful for identifying column types in csv files, but they can lead to incorrect/confusing results when applied to query result sets. This PR prevents type coercion from happening when marshaling a query resultset into an Agate table.

Example

I used the following macro to debug / reproduce some type coercion issues locally (using Snowflake):

{% macro debug() %}

    {% set query %}

        select '0005' as stringnum, 'false' as stringbool, 10 as intfield, false as boolfield, '01T00000aabbccdd' as dtstring
        union all
        select '0006' as stringnum, 'true' as stringbool, 20 as intfield, true as boolfield, '01T00000aabbccde' as dtstring

    {% endset %}

    {% set res = run_query(query) %}
    {% do res.print_table() %}

{% endmacro %}

Before this change:

STRINGNUM (bad)	STRINGBOOL (bad)	INTFIELD	BOOLFIELD	DTSTRING (would be funny if it wasn't so bad)
5	False	10	False	0101-01-01 00:00:00
6	True	20	True	0101-01-01 00:00:00

After this change:

STRINGNUM	STRINGBOOL	INTFIELD	BOOLFIELD	DTSTRING
0005	false	10	False	01T00000aabbccdd
0006	true	20	True	01T00000aabbccde

About the implementation

This PR assumes that we never want to coerce types in tables returned by the database, which I think is correct, but would welcome any holes that can be poked in that assumption. I was originally going to try to use the cursor.description provided by dbapi implementation to map columns in a query resultset to Python types and then use those Python types to "force" types for all of the columns in the Agate table. In practice, this would require us to add some logic to every adapter plugin which felt both onerous and unnecessary. Instead, we can just peek at the actual query results and "force" string columns in the resultset to be retained as strings in the Agate table that dbt produces. This works because:

If the database is returning a string value in a field, we can feel 100% confident that the right type for the column in Agate is a string
Columns in a database result cannot be of mixed types, so if we find one string value in a column, we know the whole column should be treated as a string
Agate won't coerce a not-string into an unexpected type (eg. an int returned by the database can't become a bool in Agate)

What else

I tried to make a surgical change here. By putting this logic in the agate_helper client, all adapter plugins should be able to take advantage of the change. Longer term, I think we should consider making a less-surgical change that removes Agate entirely. Let's pick up that discussion in #3413 though :)

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

drewbanin · 2021-06-26T17:11:32Z

core/dbt/clients/agate_helper.py

    for _row in data:
        row = []
-        for value in list(_row.values()):
+        for col_name in column_names:


I think our existing implementation assumed that the order of keys in a dictionary was guaranteed... but IIRC that's not true for all versions of Python. My Python is rusty, so maybe this change is not necessary. I can revert if need be :)

Not my area, but I do remember hearing that this changed in recent versions of python. Indeed this blog post suggests that dictionaries are ordered by insertion as of python 3.6, and that this is officially guaranteed as of python 3.7. dbt currently requires py36 or higher.

drewbanin · 2021-06-26T17:12:32Z

core/dbt/clients/agate_helper.py

-                row.append(value)
+                # Represent container types as json strings
+                value = json.dumps(value, cls=dbt.utils.JSONEncoder)
+                text_only_columns.add(col_name)


I was originally just going to peek at the first row in the result set, but it's possible that the value in a string column returned by the database could be NULL. This implementation looks at all of the values in the result set, which we're actually already doing anyway to json-encode dict/list/tuple types.

drewbanin · 2021-06-26T17:19:00Z

CHANGELOG.md

@@ -21,6 +21,7 @@ Contributors:
 ### Fixes

 - Handle quoted values within test configs, such as `where` ([#3458](https://github.com/fishtown-analytics/dbt/issues/3458), [#3459](https://github.com/fishtown-analytics/dbt/pull/3459))
+- Fix type coercion issues when fetching query result sets ([#2984](https://github.com/fishtown-analytics/dbt/issues/2984), [#3499](https://github.com/fishtown-analytics/dbt/pull/3499))


Is 0.20.0 the right home for this? Or should I slot it into 0.21.0?

I'd prefer 0.21, rather than sneaking it in during RC time. Even though it is technically a bug fix, I imagine it might be breaking behavior for folks who do a lot of custom query-results munging with Jinja + agate.

gshank

Looks good to me!

…agate-undesirable-casting

drewbanin · 2021-06-30T00:22:17Z

Right on, thanks. I updated the changelog to include this fix in the 0.21 section. What's the right base branch for this PR? Is it develop or something else?

jtcohen6 · 2021-06-30T01:49:19Z

develop is right!

cla-bot bot added the cla:yes label Jun 26, 2021

drewbanin temporarily deployed to Redshift June 26, 2021 17:13 Inactive

drewbanin temporarily deployed to Postgres June 26, 2021 17:13 Inactive

drewbanin had a problem deploying to Bigquery June 26, 2021 17:13 Error

drewbanin had a problem deploying to Bigquery June 26, 2021 17:13 Failure

drewbanin temporarily deployed to Snowflake June 26, 2021 17:13 Inactive

drewbanin force-pushed the fix/agate-undesirable-casting branch 2 times, most recently from 621647d to a5a2dc7 Compare June 26, 2021 17:18

drewbanin commented Jun 26, 2021

View reviewed changes

drewbanin temporarily deployed to Redshift June 26, 2021 17:23 Inactive

drewbanin temporarily deployed to Bigquery June 26, 2021 17:23 Inactive

drewbanin temporarily deployed to Postgres June 26, 2021 17:23 Inactive

drewbanin temporarily deployed to Snowflake June 26, 2021 17:23 Inactive

(#2984) Prevent Agate from coercing values in query result sets

1ad1c83

drewbanin force-pushed the fix/agate-undesirable-casting branch from a5a2dc7 to 1ad1c83 Compare June 26, 2021 17:24

drewbanin marked this pull request as ready for review June 26, 2021 17:27

drewbanin temporarily deployed to Postgres June 26, 2021 17:29 Inactive

drewbanin had a problem deploying to Bigquery June 26, 2021 17:29 Error

drewbanin had a problem deploying to Bigquery June 26, 2021 17:29 Failure

drewbanin temporarily deployed to Redshift June 26, 2021 17:29 Inactive

drewbanin temporarily deployed to Snowflake June 26, 2021 17:29 Inactive

drewbanin temporarily deployed to Postgres June 26, 2021 19:42 Inactive

drewbanin temporarily deployed to Redshift June 26, 2021 19:42 Inactive

jtcohen6 requested a review from leahwicz June 28, 2021 22:59

leahwicz requested a review from gshank June 29, 2021 12:20

gshank approved these changes Jun 29, 2021

View reviewed changes

drewbanin added 2 commits June 29, 2021 20:20

Merge branch 'develop' of github.com:fishtown-analytics/dbt into fix/…

eba3fd2

…agate-undesirable-casting

Move to 0.21 changelog section

85fe32b

drewbanin temporarily deployed to Bigquery June 30, 2021 00:28 Inactive

drewbanin temporarily deployed to Snowflake June 30, 2021 00:28 Inactive

drewbanin temporarily deployed to Postgres June 30, 2021 00:28 Inactive

drewbanin temporarily deployed to Redshift June 30, 2021 00:28 Inactive

drewbanin merged commit 83b98c8 into develop Jul 1, 2021

drewbanin deleted the fix/agate-undesirable-casting branch July 1, 2021 14:53

drewbanin mentioned this pull request Jul 1, 2021

rpc returns empty string values as null values #2666

Closed

5 tasks

clrcrl mentioned this pull request Jul 8, 2021

dbt docs generate fails when a column name is " " #2564

Closed

5 tasks

jtcohen6 mentioned this pull request Aug 9, 2021

correctly parsing count(*) as integer instead of string dbt-labs/dbt-adapter-tests#22

Closed

jtcohen6 mentioned this pull request Aug 17, 2021

run_sql rpc method casts number-like strings to numbers #3756

Closed

5 tasks

jtcohen6 mentioned this pull request Sep 26, 2021

IDE displaying empty strings as NULL #3950

Closed

5 tasks

jtcohen6 mentioned this pull request Oct 5, 2021

Remove agate dependency? #3413

Closed

2rara mentioned this pull request Oct 11, 2021

modules.datetime.date regared as str #4036

Closed

5 tasks

jtcohen6 mentioned this pull request Oct 13, 2021

[CT-497] rpc silently rounds numeric results >16 digits dbt-labs/dbt-rpc#50

Closed

This was referenced Nov 7, 2021

Update agate requirement from <1.6.2,>=1.6 to >=1.6,<1.6.4 in /core #3585

Merged

[Bug] Cannot update views in numerically named database #4215

Closed

jtcohen6 mentioned this pull request Dec 17, 2021

Bools coerced to 0/1 in query results #4511

Closed

ChenyuLInx mentioned this pull request Feb 10, 2022

[CT-176] [Bug] Github source freshness check error #4698

Closed

1 task

ntnhaatj mentioned this pull request Feb 17, 2023

fix: coerce some types from query result set cloudera/dbt-spark-livy#52

Closed

6 tasks

jtcohen6 mentioned this pull request Jul 20, 2023

[CT-2853] dbt show --output json converts integers to scale-1 numerics #8167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent Agate from coercing values in query result sets #3499

Prevent Agate from coercing values in query result sets #3499

drewbanin commented Jun 26, 2021 •

edited

Loading

drewbanin Jun 26, 2021

jtcohen6 Jun 28, 2021

drewbanin Jun 26, 2021

drewbanin Jun 26, 2021

jtcohen6 Jun 28, 2021

gshank left a comment

drewbanin commented Jun 30, 2021

jtcohen6 commented Jun 30, 2021

Prevent Agate from coercing values in query result sets #3499

Prevent Agate from coercing values in query result sets #3499

Conversation

drewbanin commented Jun 26, 2021 • edited Loading

Description

Example

About the implementation

What else

Checklist

drewbanin Jun 26, 2021

Choose a reason for hiding this comment

jtcohen6 Jun 28, 2021

Choose a reason for hiding this comment

drewbanin Jun 26, 2021

Choose a reason for hiding this comment

drewbanin Jun 26, 2021

Choose a reason for hiding this comment

jtcohen6 Jun 28, 2021

Choose a reason for hiding this comment

gshank left a comment

Choose a reason for hiding this comment

drewbanin commented Jun 30, 2021

jtcohen6 commented Jun 30, 2021

drewbanin commented Jun 26, 2021 •

edited

Loading