-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(csv): Ensure df_to_escaped_csv does not coerce integer columns to float #20151
fix(csv): Ensure df_to_escaped_csv does not coerce integer columns to float #20151
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20151 +/- ##
==========================================
- Coverage 66.45% 66.29% -0.17%
==========================================
Files 1721 1721
Lines 64513 64518 +5
Branches 6806 6806
==========================================
- Hits 42875 42772 -103
- Misses 19906 20014 +108
Partials 1732 1732
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: John Bodley <john.bodley@airbnb.com>
Co-authored-by: John Bodley <john.bodley@airbnb.com> (cherry picked from commit 97ce920)
SUMMARY
When exporting a SQL Lab result set with a integer column containing
NULL
values, Numpy/Pandas coerces them to floats during thepd.DataFrame.applymap(...)
call given thatNaN
is actually a float, i.e.,Type coercion in Pandas is overly magical and often undesirable. The fix is somewhat yuck as well, iterating over the columns and rows of the DataFrame and escaping only those cells which are actually
str
.I tried a number of more performant/vectorized solutions but at the end of the day they rely on
numpy.array
s which require a priori declared types otherwise coercion is performed. Note thenumpy.dtype(object)
data type is used to handled both strings an integers—as a byproduct of how data is exported from PyArrow—given special handling is required for dealing wiith missing values with integers per here. The TL;DR is this is all very unpleasant.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
Added unit tests.
ADDITIONAL INFORMATION