-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Raise on parse int overflow #47167 #47168
BUG: Raise on parse int overflow #47167 #47168
Conversation
squash! TST: integer overflow on parsing with insufficient user dtype
@github-actions pre-commit |
result = _try_uint64(self.parser, i, start, end, | ||
na_filter, na_hashset) | ||
na_count = 0 | ||
|
||
if result is not None and dtype != 'int64': | ||
result = result.astype(dtype) | ||
casted = result.astype(dtype) | ||
if (casted == result).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks expensive. Can you run asvs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. I thought the same thing, when I saw exactly this check being applied for parsing (and more generally setting) with the nullable integer extension dtypes
pandas/pandas/core/arrays/integer.py
Lines 53 to 55 in 3bf2cb1
casted = values.astype(dtype, copy=copy) | |
if (casted == values).all(): | |
return casted |
Just to clarify: The additional check is never performed when parsing with automatically inferred dtype (since TextReader.dtype_cast_order contains only int64 and no other integer dtypes), nor with user specified dtype int64. I just comitted another change that prevents unnecessarily running into this check when parsing with user specified dtype uint64.
The check then only affects parsing with user specified integer dtype of size < 64 bit. As far as I can see, none of the existing asvs covers this.
Just to be sure I ran some existing asvs that seemed most relevant and found no significant change:
before after ratio
[c355145c] [994a6345]
<raise-on-parse-int-overflow~7> <raise-on-parse-int-overflow>
6.97±0.3ms 7.46±0.7ms 1.07 io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
3.43±0.1ms 3.39±0.04ms 0.99 io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
6.93±0.3ms 7.02±0.1ms 1.01 io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
3.52±0.3ms 3.28±0.08ms 0.93 io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
17.2±0.2ms 17.3±0.06ms 1.00 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
3.33±0.2ms 3.35±0.5ms 1.01 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
1.74±0.01ms 1.74±0.09ms 1.00 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
2.82±0.03ms 2.81±0.03ms 0.99 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
1.77±0.08ms 1.79±0.02ms 1.01 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
2.85±0.02ms 2.87±0.1ms 1.01 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
25.8±0.8ms 26.3±0.8ms 1.02 io.csv.ReadCSVCategorical.time_convert_direct('c')
179±3ms 180±4ms 1.01 io.csv.ReadCSVCategorical.time_convert_direct('python')
44.0±1ms 44.4±1ms 1.01 io.csv.ReadCSVCategorical.time_convert_post('c')
169±4ms 168±3ms 0.99 io.csv.ReadCSVCategorical.time_convert_post('python')
21.8±1ms 21.5±1ms 0.99 io.csv.ReadCSVComment.time_comment('c')
22.5±0.5ms 21.8±0.7ms 0.97 io.csv.ReadCSVComment.time_comment('python')
24.9±0.9ms 25.1±2ms 1.01 io.csv.ReadCSVConcatDatetime.time_read_csv
13.1±1ms 12.0±0.3ms 0.92 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
9.37±0.3ms 9.38±0.1ms 1.00 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
14.7±0.4ms 15.0±1ms 1.02 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
86.1±2ms 87.7±2ms 1.02 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'custom')
1.55±0.02ms 1.57±0.02ms 1.01 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
1.47±0.01ms 1.51±0.04ms 1.03 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
4.39±0.06ms 4.45±0.1ms 1.01 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'custom')
1.80±0.05ms 1.86±0.06ms 1.03 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
2.00±0.01ms 2.00±0.02ms 1.00 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
19.4±0.3ms 19.8±1ms 1.02 io.csv.ReadCSVEngine.time_read_bytescsv('c')
6.25±0.4ms 6.45±0.6ms 1.03 io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
303±2ms 301±3ms 0.99 io.csv.ReadCSVEngine.time_read_bytescsv('python')
20.0±0.6ms 19.3±0.5ms 0.96 io.csv.ReadCSVEngine.time_read_stringcsv('c')
7.43±0.5ms 6.93±0.1ms 0.93 io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
264±2ms 262±6ms 0.99 io.csv.ReadCSVEngine.time_read_stringcsv('python')
1.33±0.02ms 1.34±0.02ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
2.23±0.02ms 2.22±0.02ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
1.35±0.06ms 1.36±0.03ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.39±0.01ms 1.41±0.01ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.41±0.01ms 1.44±0.04ms 1.02 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.47±0.2ms 1.43±0.02ms 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
1.30±0.02ms 1.33±0.04ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
2.21±0.06ms 2.25±0.1ms 1.02 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
1.35±0.07ms 1.33±0.03ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.42±0.03ms 1.41±0.01ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
1.41±0.01ms 1.47±0.2ms 1.05 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.40±0.01ms 1.41±0.01ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
3.43±0.3ms 3.40±0.2ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
3.45±0.2ms 3.34±0.03ms 0.97 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
3.41±0.03ms 3.42±0.2ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
2.83±0.03ms 2.82±0.02ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
2.81±0.01ms 2.82±0.03ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
2.83±0.09ms 2.80±0.02ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
3.35±0.03ms 3.32±0.03ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
3.52±0.2ms 3.55±0.2ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
3.50±0.1ms 3.76±0.2ms 1.08 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
2.80±0.02ms 2.79±0.02ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
2.84±0.07ms 2.83±0.07ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
2.81±0.03ms 2.81±0.04ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
9.98±0.07ms 9.88±0.3ms 0.99 io.csv.ReadCSVIndexCol.time_read_csv_index_col
69.0±0.4ms 74.1±0.3ms 1.07 io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
1.45±0.01ms 1.56±0.06ms 1.07 io.csv.ReadCSVParseDates.time_baseline('c')
1.58±0.03ms 1.57±0.02ms 0.99 io.csv.ReadCSVParseDates.time_baseline('python')
1.78±0.01ms 1.81±0.06ms 1.01 io.csv.ReadCSVParseDates.time_multiple_date('c')
1.99±0.03ms 2.30±0.3ms ~1.15 io.csv.ReadCSVParseDates.time_multiple_date('python')
2.99±0.03ms 2.93±0.02ms 0.98 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
13.3±0.1ms 13.2±0.2ms 0.99 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
7.70±0.2ms 7.54±0.07ms 0.98 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
40.0±0.8ms 39.9±2ms 1.00 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
3.41±0.05ms 3.42±0.03ms 1.00 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
14.2±0.8ms 13.9±0.6ms 0.98 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
10.3±0.1ms 11.0±0.2ms 1.06 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
8.88±0.6ms 8.13±0.3ms 0.92 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
46.3±1ms 46.4±2ms 1.00 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
14.8±0.4ms 15.0±0.3ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
8.15±0.6ms 8.34±0.3ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
68.0±1ms 64.1±2ms 0.94 io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
14.1±0.3ms 14.5±0.3ms 1.03 io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
166±4ms 161±3ms 0.97 io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
11.4±0.2ms 11.6±0.2ms 1.02 io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
58.4±0.9ms 58.0±1ms 0.99 io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
13.4±0.2ms 13.8±0.2ms 1.02 io.csv.ReadCSVThousands.time_thousands('|', ',', 'c')
168±3ms 164±6ms 0.98 io.csv.ReadCSVThousands.time_thousands('|', ',', 'python')
11.3±0.2ms 11.8±0.4ms 1.04 io.csv.ReadCSVThousands.time_thousands('|', None, 'c')
59.2±1ms 57.7±0.9ms 0.97 io.csv.ReadCSVThousands.time_thousands('|', None, 'python')
3.38±0.6ms 3.55±0.2ms 1.05 io.csv.ReadUint64Integers.time_read_uint64
5.53±0.2ms 5.67±0.2ms 1.03 io.csv.ReadUint64Integers.time_read_uint64_na_values
5.62±0.4ms 5.43±0.2ms 0.96 io.csv.ReadUint64Integers.time_read_uint64_neg_values
114±2ms 112±1ms 0.98 io.csv.ToCSV.time_frame('long')
16.5±0.2ms 16.3±0.2ms 0.99 io.csv.ToCSV.time_frame('mixed')
91.3±1ms 94.2±3ms 1.03 io.csv.ToCSV.time_frame('wide')
7.71±0.4ms 7.79±0.3ms 1.01 io.csv.ToCSVDatetime.time_frame_date_formatting
7.15±0.06ms 7.26±0.08ms 1.02 io.csv.ToCSVDatetimeBig.time_frame(1000)
66.7±1ms 66.7±0.6ms 1.00 io.csv.ToCSVDatetimeBig.time_frame(10000)
671±5ms 668±6ms 1.00 io.csv.ToCSVDatetimeBig.time_frame(100000)
380±2ms 379±7ms 1.00 io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
142±1ms 142±2ms 1.00 io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
711±9ms 715±9ms 1.01 io.csv.ToCSVIndexes.time_head_of_multiindex
721±5ms 718±6ms 1.00 io.csv.ToCSVIndexes.time_multiindex
580±8ms 579±6ms 1.00 io.csv.ToCSVIndexes.time_standard_index
238±5ms 234±4ms 0.98 io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
21.5±0.1ms 22.1±0.3ms 1.03 io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
22.5±0.5ms 22.6±0.2ms 1.00 io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
798±60ms 747±10ms 0.94 io.excel.ReadExcel.time_read_excel('odf')
196±10ms 180±3ms 0.92 io.excel.ReadExcel.time_read_excel('openpyxl')
43.5±2ms 41.0±1ms 0.94 io.excel.ReadExcel.time_read_excel('xlrd')
439±20ms 420±7ms 0.96 io.excel.WriteExcel.time_write_excel('openpyxl')
233±10ms 230±10ms 0.99 io.excel.WriteExcel.time_write_excel('xlsxwriter')
220±8ms 227±4ms 1.03 io.excel.WriteExcel.time_write_excel('xlwt')
I added a simple asv benchmark that triggers the relevant check. It is also not affected at all
before after ratio
[c355145c] [994a6345]
<raise-on-parse-int-overflow~7> <raise-on-parse-int-overflow>
1.34±0.1ms 1.31±0.04ms 0.98 io.csv.ReadUint8Integers.time_read_uint8
A separate timeit on the comparison (arr1 == arr2).all()
shows that it takes ~5µs compared to ~1ms of the total read_csv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional remarks: For running the asvs in io.csv I had to:
- Adapt asv.conf.json to use python 3.9 instead of 3.8 in order to get build isolation + conda compilers env working (same problem like discussed e.g. in https://stackoverflow.com/questions/71340058/conda-does-not-look-for-libpthread-and-libpthread-nonshared-at-the-right-place-w)
- Bump Cython version to 0.29.30 in asv.conf.json (see
min_cython_ver
in setup.py) - Fixed the existing
ReadCSVIndexCol.time_read_csv_index_col
(see 61a36a5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel should _dtype_can_hold_range
be used here to check int64 lossless conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype_can_hold_range is specific to range objects. looks like we have an ndarray here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, comparing ndarrays here. I could change it to np.array_equal for readability, but apart from some safety boilerplate the same check is performed there: https://github.com/numpy/numpy/blob/50a74fb65fc752e77a2f9e9e2b7227629c2ba953/numpy/core/numeric.py#L2468
if user_dtype and dtype == 'uint64': | ||
do_try_uint64 = True | ||
else: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pattern here is to _try_dtype then if that fails try another one, can you just do that here (rather than all of this if/then logic).
e.g. add a _try_uint64
is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to stay with the pattern _try_int64
-> if fail -> _try_uint64
, but shortcut the two cases with user_dtype
and dtype in ["int64", "uint64"]
, where we can fail after the specific _try_dtype. I added another _try_uint64
instead of the do_try_uint64
. Is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I agree with @jreback, this is hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree and added a _try_uint64
in commit 927ddad like @jreback suggested. Were you also referring to the former version with do_try_uint64
@phofl ? The latest version is https://github.com/SandroCasagrande/pandas/blob/5896e017ca2960e0d535c8c0a0b9db978377bc91/pandas/_libs/parsers.pyx#L1182-L1198. Sorry, if I did not signify correctly that I performed changes and the newest version should be reviewed again. Can you please have a look @jreback or @phofl?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
I am still interested in this. I just added changes that adress requests raised by @jreback and @simonjayhawkins. Local tests run successfully. I decided to add the check for engine="python" as I outlined above inside the base parser (with the only difference being that the check is also necessary for user dtype uint64). Concerning the question I raised about having an option to check for overflows in Please reopen and review. |
Thanks @mroeschke for reopening. |
I just recognized that the changes in this PR has a good side effect. Up to now, implicity lossy coercion of float to int could occur for
while these similar variants already raise errors
The changes in this PR let all of the above variants raise an error, such that the "c" and "python" engine will behave consistently. Two remarks concerning this:
|
98af575
to
054289f
Compare
054289f
to
2c16f74
Compare
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -1134,6 +1134,8 @@ I/O | |||
- Bug in :func:`read_parquet` with ``use_nullable_dtypes=True`` where ``float64`` dtype was returned instead of nullable ``Float64`` dtype (:issue:`45694`) | |||
- Bug in :meth:`DataFrame.to_json` where ``PeriodDtype`` would not make the serialization roundtrip when read back with :meth:`read_json` (:issue:`44720`) | |||
- Bug in :func:`read_xml` when reading XML files with Chinese character tags and would raise ``XMLSyntaxError`` (:issue:`47902`) | |||
- Bug in :func:`read_csv` with specified (non-extension) integer ``dtype`` can cause silent overflow or unexpected return dtype (:issue:`47167`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-extension? If you want to reference numpy dtypes you can say numpy integer dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, did that. Should I also move it to aim for v1.6.0 already?
if user_dtype and dtype == 'uint64': | ||
do_try_uint64 = True | ||
else: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I agree with @jreback, this is hard to read
069203e
to
24c3ef9
Compare
24c3ef9
to
a1a6764
Compare
dbf5a3c
to
b392f32
Compare
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.