REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

snowman2 · 2020-12-28T18:37:15Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

Wish I could give more information, but this is all I have (not sure I can share the input data):

 Fatal Python error: Segmentation fault
Current thread 0x00007f658531a740 (most recent call first):
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2056 in read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1052 in read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 463 in _read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 605 in read_csv

Once I added the pin pandas<1.2 everything works again.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-1020-aws
Version : #29-Ubuntu SMP Wed Jun 14 15:54:52 UTC 2017
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0
numpy : 1.19.1
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.4.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.22
tables : None
tabulate : 0.8.7
xarray : 0.16.2
xlrd : None
xlwt : None
numba : None
None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2020-12-28T18:55:43Z

Hi @snowman2, thanks for the report! Unfortunately, without more information it will be difficult to figure out this issue. Is there any way you can mimic the characteristics of the failing example to come up with a reproducible example you can share? Or if you have any experience with it, showing output from running through with a C debugger could also help isolate the issue (https://pandas.pydata.org/docs/development/debugging_extensions.html).

mzeitlin11 · 2020-12-28T18:58:31Z

Maybe #14782 is related?

snowman2 · 2020-12-28T19:25:13Z

Definitely understand. I will try to get some time to get better information to you later.

twoertwein · 2020-12-28T20:53:40Z

Can you please test it one time with engine="c" and another time with engine="python"?

snowman2 · 2020-12-29T00:29:38Z

Maybe #14782 is related?

I don't think it is related.

showing output from running through with a C debugger could also help isolate the issue

$ gdb python
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
...
(gdb) run testread.py 
Starting program: ~/pd/bin/python testread.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ae700 (LWP 2907)]
[New Thread 0x7ffff3fad700 (LWP 2908)]
[New Thread 0x7fffef7ac700 (LWP 2909)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe432eb96 in precise_xstrtod ()
   from ~/pd/lib/python3.8/site-packages/pandas/_libs/parsers.cpython-38-x86_64-linux-gnu.so

I think I am going to see if I can reproduce it with master and get better debug output.

Can you please test it one time with engine="c" and another time with engine="python"?

I tested this in a local environment with wheels:

Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.read_csv("data.csv", engine="c")
Segmentation fault (core dumped)

Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.read_csv("data.csv", engine="python")
Segmentation fault (core dumped)

snowman2 · 2020-12-29T00:48:26Z

From master branch:

 gdb python
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2

...
(gdb) run testread.py 
Starting program: /home/snowal/pd/bin/python testread.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ae700 (LWP 4395)]
[New Thread 0x7ffff1fad700 (LWP 4396)]
[New Thread 0x7fffef7ac700 (LWP 4397)]
[Thread 0x7fffef7ac700 (LWP 4397) exited]
[Thread 0x7ffff1fad700 (LWP 4396) exited]
[Thread 0x7ffff47ae700 (LWP 4395) exited]
[Detaching after fork from child process 4398]
[Detaching after fork from child process 4399]
[Detaching after fork from child process 4404]
[Detaching after fork from child process 4405]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe882a5bc in precise_xstrtod (str=0xf46495 "0106000020E61000000100000001030000000", endptr=0x7fffffffbd80, decimal=46 '.', sci=69 'E', 
    tsep=0 '\000', skip_trailing=1, error=0x7fffffffbd6c, maybe_int=0x0) at pandas/_libs/src/parser/tokenizer.c:1752
1752	        number /= e[-308 - exponent];

Can reproduce with this as the contents of the CSV file:

data
0106000020E61000000100000001030000000

asishm · 2020-12-29T00:59:06Z

on WSL the behavior that csv file returns (on 1.2.0.dev0+1692.g87d9c8f31)

   data
0   NaN

with both c and python engines

mzeitlin11 · 2020-12-29T01:44:30Z

On OS X

data = io.StringIO("data\n0106000020E61000000100000001030000000")
df = pd.read_csv(data)

segfaults about 1/2 the time,
other half gives

   data
0   NaN

With python engine still segfaults (but less often?)

jorisvandenbossche · 2020-12-29T07:53:53Z

On linux I can also confirm that it segfaults with that example.

As a temporary workaround, you can use pd.read_csv(data, float_precision="legacy"), which doesn't segfault for me.

The fact that it fails in precise_xstrtod (and that float_precision="legacy") seems to indicate this is caused by #36228 (that PR itself did not change implementation, only switched the default). cc @Dr-Irv

So it seems that if we want to keep float_precision="high" as the new default, we would need to fix precise_xstrtod for those corner cases ..

snowman2 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 28, 2020

mzeitlin11 added IO CSV read_csv, to_csv Segfault Non-Recoverable Error Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 28, 2020

mzeitlin11 added Regression Functionality that used to work in a prior pandas version and removed Bug labels Dec 28, 2020

mzeitlin11 removed the Needs Info Clarification about behavior needed to assess issue label Dec 29, 2020

jorisvandenbossche added this to the 1.2.1 milestone Dec 29, 2020

mzeitlin11 mentioned this issue Dec 29, 2020

BUG: Fix precise_xstrtod segfault on long exponent #38789

Merged

5 tasks

jreback closed this as completed in #38789 Dec 30, 2020

djherbis mentioned this issue Mar 3, 2021

BUG: dataframe concatenation segementation fault (core dump) #39144

Closed

1 task

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

snowman2 commented Dec 28, 2020 •

edited

Loading

INSTALLED VERSIONS

mzeitlin11 commented Dec 28, 2020

mzeitlin11 commented Dec 28, 2020

snowman2 commented Dec 28, 2020

twoertwein commented Dec 28, 2020

snowman2 commented Dec 29, 2020

snowman2 commented Dec 29, 2020 •

edited

Loading

asishm commented Dec 29, 2020 •

edited

Loading

mzeitlin11 commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

Comments

snowman2 commented Dec 28, 2020 • edited Loading

Code Sample, a copy-pastable example

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Dec 28, 2020

mzeitlin11 commented Dec 28, 2020

snowman2 commented Dec 28, 2020

twoertwein commented Dec 28, 2020

snowman2 commented Dec 29, 2020

snowman2 commented Dec 29, 2020 • edited Loading

asishm commented Dec 29, 2020 • edited Loading

mzeitlin11 commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020

snowman2 commented Dec 28, 2020 •

edited

Loading

Output of `pd.show_versions()`

snowman2 commented Dec 29, 2020 •

edited

Loading

asishm commented Dec 29, 2020 •

edited

Loading