Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

Closed
3 tasks done
snowman2 opened this issue Dec 28, 2020 · 9 comments · Fixed by #38789
Closed
3 tasks done

REGR: pd.read_csv segfaults with 1.2 (has worked since before pandas 1.0) #38753

snowman2 opened this issue Dec 28, 2020 · 9 comments · Fixed by #38789
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error
Milestone

Comments

@snowman2
Copy link

snowman2 commented Dec 28, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Wish I could give more information, but this is all I have (not sure I can share the input data):

 Fatal Python error: Segmentation fault
Current thread 0x00007f658531a740 (most recent call first):
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2056 in read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1052 in read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 463 in _read
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 605 in read_csv

Once I added the pin pandas<1.2 everything works again.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-1020-aws
Version : #29-Ubuntu SMP Wed Jun 14 15:54:52 UTC 2017
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0
numpy : 1.19.1
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.4.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.22
tables : None
tabulate : 0.8.7
xarray : 0.16.2
xlrd : None
xlwt : None
numba : None
None

@snowman2 snowman2 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 28, 2020
@mzeitlin11
Copy link
Member

Hi @snowman2, thanks for the report! Unfortunately, without more information it will be difficult to figure out this issue. Is there any way you can mimic the characteristics of the failing example to come up with a reproducible example you can share? Or if you have any experience with it, showing output from running through with a C debugger could also help isolate the issue (https://pandas.pydata.org/docs/development/debugging_extensions.html).

@mzeitlin11 mzeitlin11 added IO CSV read_csv, to_csv Segfault Non-Recoverable Error Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 28, 2020
@mzeitlin11
Copy link
Member

Maybe #14782 is related?

@mzeitlin11 mzeitlin11 added Regression Functionality that used to work in a prior pandas version and removed Bug labels Dec 28, 2020
@snowman2
Copy link
Author

Definitely understand. I will try to get some time to get better information to you later.

@twoertwein
Copy link
Member

Can you please test it one time with engine="c" and another time with engine="python"?

@snowman2
Copy link
Author

Maybe #14782 is related?

I don't think it is related.

showing output from running through with a C debugger could also help isolate the issue

$ gdb python
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
...
(gdb) run testread.py 
Starting program: ~/pd/bin/python testread.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ae700 (LWP 2907)]
[New Thread 0x7ffff3fad700 (LWP 2908)]
[New Thread 0x7fffef7ac700 (LWP 2909)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe432eb96 in precise_xstrtod ()
   from ~/pd/lib/python3.8/site-packages/pandas/_libs/parsers.cpython-38-x86_64-linux-gnu.so

I think I am going to see if I can reproduce it with master and get better debug output.

Can you please test it one time with engine="c" and another time with engine="python"?

I tested this in a local environment with wheels:

Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.read_csv("data.csv", engine="c")
Segmentation fault (core dumped)
Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.read_csv("data.csv", engine="python")
Segmentation fault (core dumped)

@snowman2
Copy link
Author

snowman2 commented Dec 29, 2020

From master branch:

 gdb python
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2

...
(gdb) run testread.py 
Starting program: /home/snowal/pd/bin/python testread.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47ae700 (LWP 4395)]
[New Thread 0x7ffff1fad700 (LWP 4396)]
[New Thread 0x7fffef7ac700 (LWP 4397)]
[Thread 0x7fffef7ac700 (LWP 4397) exited]
[Thread 0x7ffff1fad700 (LWP 4396) exited]
[Thread 0x7ffff47ae700 (LWP 4395) exited]
[Detaching after fork from child process 4398]
[Detaching after fork from child process 4399]
[Detaching after fork from child process 4404]
[Detaching after fork from child process 4405]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe882a5bc in precise_xstrtod (str=0xf46495 "0106000020E61000000100000001030000000", endptr=0x7fffffffbd80, decimal=46 '.', sci=69 'E', 
    tsep=0 '\000', skip_trailing=1, error=0x7fffffffbd6c, maybe_int=0x0) at pandas/_libs/src/parser/tokenizer.c:1752
1752	        number /= e[-308 - exponent];

Can reproduce with this as the contents of the CSV file:

data
0106000020E61000000100000001030000000

@asishm
Copy link
Contributor

asishm commented Dec 29, 2020

on WSL the behavior that csv file returns (on 1.2.0.dev0+1692.g87d9c8f31)

   data
0   NaN

with both c and python engines

@mzeitlin11
Copy link
Member

On OS X

data = io.StringIO("data\n0106000020E61000000100000001030000000")
df = pd.read_csv(data)

segfaults about 1/2 the time,
other half gives

   data
0   NaN

With python engine still segfaults (but less often?)

@mzeitlin11 mzeitlin11 removed the Needs Info Clarification about behavior needed to assess issue label Dec 29, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Dec 29, 2020
@jorisvandenbossche
Copy link
Member

On linux I can also confirm that it segfaults with that example.

As a temporary workaround, you can use pd.read_csv(data, float_precision="legacy"), which doesn't segfault for me.

The fact that it fails in precise_xstrtod (and that float_precision="legacy") seems to indicate this is caused by #36228 (that PR itself did not change implementation, only switched the default). cc @Dr-Irv

So it seems that if we want to keep float_precision="high" as the new default, we would need to fix precise_xstrtod for those corner cases ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants