-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent Agate from coercing values in query result sets #3499
Conversation
621647d
to
a5a2dc7
Compare
for _row in data: | ||
row = [] | ||
for value in list(_row.values()): | ||
for col_name in column_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our existing implementation assumed that the order of keys in a dictionary was guaranteed... but IIRC that's not true for all versions of Python. My Python is rusty, so maybe this change is not necessary. I can revert if need be :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not my area, but I do remember hearing that this changed in recent versions of python. Indeed this blog post suggests that dictionaries are ordered by insertion as of python 3.6, and that this is officially guaranteed as of python 3.7. dbt currently requires py36 or higher.
row.append(value) | ||
# Represent container types as json strings | ||
value = json.dumps(value, cls=dbt.utils.JSONEncoder) | ||
text_only_columns.add(col_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally just going to peek at the first row in the result set, but it's possible that the value in a string column returned by the database could be NULL
. This implementation looks at all of the values in the result set, which we're actually already doing anyway to json-encode dict/list/tuple types.
CHANGELOG.md
Outdated
@@ -21,6 +21,7 @@ Contributors: | |||
### Fixes | |||
|
|||
- Handle quoted values within test configs, such as `where` ([#3458](https://github.com/fishtown-analytics/dbt/issues/3458), [#3459](https://github.com/fishtown-analytics/dbt/pull/3459)) | |||
- Fix type coercion issues when fetching query result sets ([#2984](https://github.com/fishtown-analytics/dbt/issues/2984), [#3499](https://github.com/fishtown-analytics/dbt/pull/3499)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 0.20.0 the right home for this? Or should I slot it into 0.21.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer 0.21, rather than sneaking it in during RC time. Even though it is technically a bug fix, I imagine it might be breaking behavior for folks who do a lot of custom query-results munging with Jinja + agate.
a5a2dc7
to
1ad1c83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
…agate-undesirable-casting
Right on, thanks. I updated the changelog to include this fix in the 0.21 section. What's the right base branch for this PR? Is it |
|
resolves #2984, #2666
related to #3413
Description
dbt uses Agate to represent internal dataframes (eg. seed file tabular representations, or in-memory query result sets). Agate's type coercions are interesting and helpful for identifying column types in csv files, but they can lead to incorrect/confusing results when applied to query result sets. This PR prevents type coercion from happening when marshaling a query resultset into an Agate table.
Example
I used the following macro to debug / reproduce some type coercion issues locally (using Snowflake):
Before this change:
After this change:
About the implementation
This PR assumes that we never want to coerce types in tables returned by the database, which I think is correct, but would welcome any holes that can be poked in that assumption. I was originally going to try to use the cursor.description provided by dbapi implementation to map columns in a query resultset to Python types and then use those Python types to "force" types for all of the columns in the Agate table. In practice, this would require us to add some logic to every adapter plugin which felt both onerous and unnecessary. Instead, we can just peek at the actual query results and "force" string columns in the resultset to be retained as strings in the Agate table that dbt produces. This works because:
int
returned by the database can't become a bool in Agate)What else
I tried to make a surgical change here. By putting this logic in the agate_helper client, all adapter plugins should be able to take advantage of the change. Longer term, I think we should consider making a less-surgical change that removes Agate entirely. Let's pick up that discussion in #3413 though :)
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.