Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Raise on parse int overflow #47167 #47168

Conversation

SandroCasagrande
Copy link
Contributor

@SandroCasagrande SandroCasagrande commented May 30, 2022

@SandroCasagrande
Copy link
Contributor Author

@github-actions pre-commit

result = _try_uint64(self.parser, i, start, end,
na_filter, na_hashset)
na_count = 0

if result is not None and dtype != 'int64':
result = result.astype(dtype)
casted = result.astype(dtype)
if (casted == result).all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks expensive. Can you run asvs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. I thought the same thing, when I saw exactly this check being applied for parsing (and more generally setting) with the nullable integer extension dtypes

casted = values.astype(dtype, copy=copy)
if (casted == values).all():
return casted
. However, I could not come up with a better version (that does not repeat or use fused types(?) in _try_int64). And finally, the impact is negligible (see below).

Just to clarify: The additional check is never performed when parsing with automatically inferred dtype (since TextReader.dtype_cast_order contains only int64 and no other integer dtypes), nor with user specified dtype int64. I just comitted another change that prevents unnecessarily running into this check when parsing with user specified dtype uint64.
The check then only affects parsing with user specified integer dtype of size < 64 bit. As far as I can see, none of the existing asvs covers this.

Just to be sure I ran some existing asvs that seemed most relevant and found no significant change:

       before           after         ratio
     [c355145c]       [994a6345]
     <raise-on-parse-int-overflow~7>       <raise-on-parse-int-overflow>
       6.97±0.3ms       7.46±0.7ms     1.07  io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
       3.43±0.1ms      3.39±0.04ms     0.99  io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
       6.93±0.3ms       7.02±0.1ms     1.01  io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
       3.52±0.3ms      3.28±0.08ms     0.93  io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
       17.2±0.2ms      17.3±0.06ms     1.00  io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
       3.33±0.2ms       3.35±0.5ms     1.01  io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
      1.74±0.01ms      1.74±0.09ms     1.00  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
      2.82±0.03ms      2.81±0.03ms     0.99  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
      1.77±0.08ms      1.79±0.02ms     1.01  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
      2.85±0.02ms       2.87±0.1ms     1.01  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
       25.8±0.8ms       26.3±0.8ms     1.02  io.csv.ReadCSVCategorical.time_convert_direct('c')
          179±3ms          180±4ms     1.01  io.csv.ReadCSVCategorical.time_convert_direct('python')
         44.0±1ms         44.4±1ms     1.01  io.csv.ReadCSVCategorical.time_convert_post('c')
          169±4ms          168±3ms     0.99  io.csv.ReadCSVCategorical.time_convert_post('python')
         21.8±1ms         21.5±1ms     0.99  io.csv.ReadCSVComment.time_comment('c')
       22.5±0.5ms       21.8±0.7ms     0.97  io.csv.ReadCSVComment.time_comment('python')
       24.9±0.9ms         25.1±2ms     1.01  io.csv.ReadCSVConcatDatetime.time_read_csv
         13.1±1ms       12.0±0.3ms     0.92  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
       9.37±0.3ms       9.38±0.1ms     1.00  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
       14.7±0.4ms         15.0±1ms     1.02  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
         86.1±2ms         87.7±2ms     1.02  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'custom')
      1.55±0.02ms      1.57±0.02ms     1.01  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
      1.47±0.01ms      1.51±0.04ms     1.03  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
      4.39±0.06ms       4.45±0.1ms     1.01  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'custom')
      1.80±0.05ms      1.86±0.06ms     1.03  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
      2.00±0.01ms      2.00±0.02ms     1.00  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
       19.4±0.3ms         19.8±1ms     1.02  io.csv.ReadCSVEngine.time_read_bytescsv('c')
       6.25±0.4ms       6.45±0.6ms     1.03  io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
          303±2ms          301±3ms     0.99  io.csv.ReadCSVEngine.time_read_bytescsv('python')
       20.0±0.6ms       19.3±0.5ms     0.96  io.csv.ReadCSVEngine.time_read_stringcsv('c')
       7.43±0.5ms       6.93±0.1ms     0.93  io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
          264±2ms          262±6ms     0.99  io.csv.ReadCSVEngine.time_read_stringcsv('python')
      1.33±0.02ms      1.34±0.02ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
      2.23±0.02ms      2.22±0.02ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
      1.35±0.06ms      1.36±0.03ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
      1.39±0.01ms      1.41±0.01ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
      1.41±0.01ms      1.44±0.04ms     1.02  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
       1.47±0.2ms      1.43±0.02ms     0.98  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
      1.30±0.02ms      1.33±0.04ms     1.03  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
      2.21±0.06ms       2.25±0.1ms     1.02  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
      1.35±0.07ms      1.33±0.03ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
      1.42±0.03ms      1.41±0.01ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
      1.41±0.01ms       1.47±0.2ms     1.05  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
      1.40±0.01ms      1.41±0.01ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
       3.43±0.3ms       3.40±0.2ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
       3.45±0.2ms      3.34±0.03ms     0.97  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
      3.41±0.03ms       3.42±0.2ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
      2.83±0.03ms      2.82±0.02ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
      2.81±0.01ms      2.82±0.03ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
      2.83±0.09ms      2.80±0.02ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
      3.35±0.03ms      3.32±0.03ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
       3.52±0.2ms       3.55±0.2ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
       3.50±0.1ms       3.76±0.2ms     1.08  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
      2.80±0.02ms      2.79±0.02ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
      2.84±0.07ms      2.83±0.07ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
      2.81±0.03ms      2.81±0.04ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
      9.98±0.07ms       9.88±0.3ms     0.99  io.csv.ReadCSVIndexCol.time_read_csv_index_col
       69.0±0.4ms       74.1±0.3ms     1.07  io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
                0                0      n/a  io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
                0                0      n/a  io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
      1.45±0.01ms      1.56±0.06ms     1.07  io.csv.ReadCSVParseDates.time_baseline('c')
      1.58±0.03ms      1.57±0.02ms     0.99  io.csv.ReadCSVParseDates.time_baseline('python')
      1.78±0.01ms      1.81±0.06ms     1.01  io.csv.ReadCSVParseDates.time_multiple_date('c')
      1.99±0.03ms       2.30±0.3ms    ~1.15  io.csv.ReadCSVParseDates.time_multiple_date('python')
      2.99±0.03ms      2.93±0.02ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
       13.3±0.1ms       13.2±0.2ms     0.99  io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
       7.70±0.2ms      7.54±0.07ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
       40.0±0.8ms         39.9±2ms     1.00  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
      3.41±0.05ms      3.42±0.03ms     1.00  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
       14.2±0.8ms       13.9±0.6ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
       10.3±0.1ms       11.0±0.2ms     1.06  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
       8.88±0.6ms       8.13±0.3ms     0.92  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
         46.3±1ms         46.4±2ms     1.00  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
       14.8±0.4ms       15.0±0.3ms     1.02  io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
       8.15±0.6ms       8.34±0.3ms     1.02  io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
         68.0±1ms         64.1±2ms     0.94  io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
       14.1±0.3ms       14.5±0.3ms     1.03  io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
          166±4ms          161±3ms     0.97  io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
       11.4±0.2ms       11.6±0.2ms     1.02  io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
       58.4±0.9ms         58.0±1ms     0.99  io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
       13.4±0.2ms       13.8±0.2ms     1.02  io.csv.ReadCSVThousands.time_thousands('|', ',', 'c')
          168±3ms          164±6ms     0.98  io.csv.ReadCSVThousands.time_thousands('|', ',', 'python')
       11.3±0.2ms       11.8±0.4ms     1.04  io.csv.ReadCSVThousands.time_thousands('|', None, 'c')
         59.2±1ms       57.7±0.9ms     0.97  io.csv.ReadCSVThousands.time_thousands('|', None, 'python')
       3.38±0.6ms       3.55±0.2ms     1.05  io.csv.ReadUint64Integers.time_read_uint64
       5.53±0.2ms       5.67±0.2ms     1.03  io.csv.ReadUint64Integers.time_read_uint64_na_values
       5.62±0.4ms       5.43±0.2ms     0.96  io.csv.ReadUint64Integers.time_read_uint64_neg_values
          114±2ms          112±1ms     0.98  io.csv.ToCSV.time_frame('long')
       16.5±0.2ms       16.3±0.2ms     0.99  io.csv.ToCSV.time_frame('mixed')
         91.3±1ms         94.2±3ms     1.03  io.csv.ToCSV.time_frame('wide')
       7.71±0.4ms       7.79±0.3ms     1.01  io.csv.ToCSVDatetime.time_frame_date_formatting
      7.15±0.06ms      7.26±0.08ms     1.02  io.csv.ToCSVDatetimeBig.time_frame(1000)
         66.7±1ms       66.7±0.6ms     1.00  io.csv.ToCSVDatetimeBig.time_frame(10000)
          671±5ms          668±6ms     1.00  io.csv.ToCSVDatetimeBig.time_frame(100000)
          380±2ms          379±7ms     1.00  io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
          142±1ms          142±2ms     1.00  io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
          711±9ms          715±9ms     1.01  io.csv.ToCSVIndexes.time_head_of_multiindex
          721±5ms          718±6ms     1.00  io.csv.ToCSVIndexes.time_multiindex
          580±8ms          579±6ms     1.00  io.csv.ToCSVIndexes.time_standard_index
          238±5ms          234±4ms     0.98  io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
       21.5±0.1ms       22.1±0.3ms     1.03  io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
       22.5±0.5ms       22.6±0.2ms     1.00  io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
         798±60ms         747±10ms     0.94  io.excel.ReadExcel.time_read_excel('odf')
         196±10ms          180±3ms     0.92  io.excel.ReadExcel.time_read_excel('openpyxl')
         43.5±2ms         41.0±1ms     0.94  io.excel.ReadExcel.time_read_excel('xlrd')
         439±20ms          420±7ms     0.96  io.excel.WriteExcel.time_write_excel('openpyxl')
         233±10ms         230±10ms     0.99  io.excel.WriteExcel.time_write_excel('xlsxwriter')
          220±8ms          227±4ms     1.03  io.excel.WriteExcel.time_write_excel('xlwt')

I added a simple asv benchmark that triggers the relevant check. It is also not affected at all

       before           after         ratio
     [c355145c]       [994a6345]
     <raise-on-parse-int-overflow~7>       <raise-on-parse-int-overflow>
       1.34±0.1ms      1.31±0.04ms     0.98  io.csv.ReadUint8Integers.time_read_uint8

A separate timeit on the comparison (arr1 == arr2).all() shows that it takes ~5µs compared to ~1ms of the total read_csv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional remarks: For running the asvs in io.csv I had to:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel should _dtype_can_hold_range be used here to check int64 lossless conversion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype_can_hold_range is specific to range objects. looks like we have an ndarray here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comparing ndarrays here. I could change it to np.array_equal for readability, but apart from some safety boilerplate the same check is performed there: https://github.com/numpy/numpy/blob/50a74fb65fc752e77a2f9e9e2b7227629c2ba953/numpy/core/numeric.py#L2468

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
pandas/tests/io/parser/test_textreader.py Outdated Show resolved Hide resolved
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jun 6, 2022
if user_dtype and dtype == 'uint64':
do_try_uint64 = True
else:
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern here is to _try_dtype then if that fails try another one, can you just do that here (rather than all of this if/then logic).
e.g. add a _try_uint64 is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to stay with the pattern _try_int64 -> if fail -> _try_uint64, but shortcut the two cases with user_dtype and dtype in ["int64", "uint64"], where we can fail after the specific _try_dtype. I added another _try_uint64 instead of the do_try_uint64. Is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I agree with @jreback, this is hard to read

Copy link
Contributor Author

@SandroCasagrande SandroCasagrande Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree and added a _try_uint64 in commit 927ddad like @jreback suggested. Were you also referring to the former version with do_try_uint64 @phofl ? The latest version is https://github.com/SandroCasagrande/pandas/blob/5896e017ca2960e0d535c8c0a0b9db978377bc91/pandas/_libs/parsers.pyx#L1182-L1198. Sorry, if I did not signify correctly that I performed changes and the newest version should be reviewed again. Can you please have a look @jreback or @phofl?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 7, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jul 22, 2022
@SandroCasagrande
Copy link
Contributor Author

I am still interested in this. I just added changes that adress requests raised by @jreback and @simonjayhawkins. Local tests run successfully.

I decided to add the check for engine="python" as I outlined above inside the base parser (with the only difference being that the check is also necessary for user dtype uint64). Concerning the question I raised about having an option to check for overflows in astype_nansafe, I came to the conclusion that this is probably unnecessary. The only relevant situation where I expect to consistently fail on insufficiently specified dtypes is parsing, i.e. what is handled in this PR.

Please reopen and review.

@mroeschke mroeschke reopened this Jul 24, 2022
@SandroCasagrande
Copy link
Contributor Author

Thanks @mroeschke for reopening.

@SandroCasagrande
Copy link
Contributor Author

I just recognized that the changes in this PR has a good side effect.

Up to now, implicity lossy coercion of float to int could occur for engine="python" with non-extension integer dtype, i.e.

>>> import pandas as pd
>>> from io import StringIO
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="python").squeeze().item()
0

while these similar variants already raise errors

>>> pd.read_csv(StringIO("0.1"), header=None, dtype="Int64", engine="python")
Traceback (most recent call last):
  ...
TypeError: cannot safely cast non-equivalent float64 to int64
>>> pd.read_csv(StringIO("0.1"), header=None, dtype="int", engine="c")
Traceback (most recent call last):
  ...
ValueError: cannot safely convert passed user dtype of int64 for float64 dtyped data in column 0

The changes in this PR let all of the above variants raise an error, such that the "c" and "python" engine will behave consistently.

Two remarks concerning this:

  • I had to change a test on read_fwf that assumed lossy float to int coercion. The exact behavior of float to int coercion due to given dtype parameter is not explicitly documented for read_csv, read_fwf, and read_table as far as I can see. So the present possibility of lossy coercion to int is not guaranteed. Imho a consistent, and safe behavior is better.
  • The behavior of engine="pyarrow" is to perform silent lossy float to int coercion and integer overflows.

asv_bench/benchmarks/io/csv.py Show resolved Hide resolved
asv_bench/benchmarks/io/csv.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/io/csv.py Show resolved Hide resolved
@@ -1134,6 +1134,8 @@ I/O
- Bug in :func:`read_parquet` with ``use_nullable_dtypes=True`` where ``float64`` dtype was returned instead of nullable ``Float64`` dtype (:issue:`45694`)
- Bug in :meth:`DataFrame.to_json` where ``PeriodDtype`` would not make the serialization roundtrip when read back with :meth:`read_json` (:issue:`44720`)
- Bug in :func:`read_xml` when reading XML files with Chinese character tags and would raise ``XMLSyntaxError`` (:issue:`47902`)
- Bug in :func:`read_csv` with specified (non-extension) integer ``dtype`` can cause silent overflow or unexpected return dtype (:issue:`47167`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why non-extension? If you want to reference numpy dtypes you can say numpy integer dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, did that. Should I also move it to aim for v1.6.0 already?

if user_dtype and dtype == 'uint64':
do_try_uint64 = True
else:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I agree with @jreback, this is hard to read

pandas/tests/io/parser/common/test_ints.py Outdated Show resolved Hide resolved
pandas/tests/io/parser/common/test_ints.py Outdated Show resolved Hide resolved
pandas/tests/io/parser/common/test_ints.py Outdated Show resolved Hide resolved
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: integer overflow in csv_reader
6 participants