Seqfault on creation of dataframe with np.empty_like #13717

flatberg · 2016-07-20T12:50:45Z

This code segfaults:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3), index='A B C'.split())
pd.DataFrame(np.empty_like(df.index))

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-91-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

jreback · 2016-07-20T13:03:42Z

In [1]: pd.DataFrame(np.empty_like(np.array([1,2])))
Out[1]:
     0
0  109
1    0

In [2]: pd.DataFrame(np.empty_like(np.array(['a', 'b'])))
Out[2]:
  0
0
1

# setfault
In [3]: pd.DataFrame(np.empty_like(Index(['a', 'b'])))
Out[3]:

not really sure what np.empty_like actually does under the hood.

cc @charris
cc @gfyoung

gfyoung · 2016-07-20T15:28:24Z

I suspect it is a case of memory corruption on the numpy end when creating a None for the object dtype, as evidenced below:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index(['a', 'b']))
>>> arr
array([None, None], dtype=object)
>>> arr2 = np.array([None, None], dtype=object)
>>> arr == arr2
True
>>> arr.data.tobytes()    # segfault
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> arr2.data.tobytes()   # no segfault
b'0\x91\xad\x8b\xd8\x7f\x00\x000\x91\xad\x8b\xd8\x7f\x00\x00'

You can see here that the data is not corrupted when you have integers:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index([1, 2]))
>>> arr
array([140568033367712, 140568033367744], dtype=int64)
>>> arr2 = np.array([140568033367712, 140568033367744], dtype=np.int64)
>>> arr == arr2
True
>>> arr.data.tobytes()    # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00'
>>> arr2.data.tobytes()   # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00'

Also, you don't need an Index to trigger the segfault:

>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame(np.empty_like([None]))

bashtage · 2016-07-20T18:11:34Z

You can do more wierd tricks with this:

np.empty_like(np.array([None, None], dtype=object)).data.tobytes()
Out[41]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).data.tobytes()
Out[42]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).astype(np.object).data.tobytes()
Out[43]: b'\xd0\x01\x9ac\x00\x00\x00\x00\xd0\x01\x9ac\x00\x00\x00\x00'

This looks like a NumPy bug.

I suspect that an object array initialized using None should contain the value of id(None) if correct. 0-filled "Nones" don't make any sense, and seem to an incorrect treatment of NumPy.

>>> np.array([None]).data.tobytes()[::-1].hex()
'0000000063979420'

>>> hex(id(None))
'0x63979420'

jreback · 2016-07-20T20:54:02Z

agree with @bashtage here - some memory aliasing going on

want to open a numpy side issue?

gfyoung · 2016-07-21T07:14:00Z

@jreback : I opened a PR now in numpy that should fix the issue. However, even if it doesn't get merged, technically, this issue could be closed since the bug is on the numpy side.

jreback · 2016-07-21T12:08:26Z

running this in the debugger actually works. I think there is a reference issue to the Nones; an element-wise copy would fix it I suppose (from pandas perspective). but yes this is a numpy-bug.

bashtage · 2016-07-21T12:10:55Z

It is probably, technically, a Pandas bug. NumPy is returns an empty array filled with NULL. Pandas doesn't seems to know what to do when an object array contains NULLs. This said, it would be better if NumPy addressed this by never returning object arrays containing NULLs, and in instead returned the address of None.

gfyoung · 2016-07-21T16:16:43Z

@bashtage : True, but the annoying part on the pandas end about this issue is that it's hard to step through and catch where exactly the segfault is occurring. I tried throwing Exceptions in the middle of the __init__ for DataFrame and NDFrame without any success.

njsmith · 2016-07-21T19:44:45Z

This should probably be treated as a pandas bug, and definitely not as a bug in empty_like specifically.

Numpy's semantics for interpreting the bytes in object arrays are, consistently:

Zero field values represent None, but without holding a refcount
All other fields values represent a concrete PyObject*, and hold a refcount.

If pandas wants to implement its own logic for reading objects out of object arrays, then it kinda has to implement the same semantics that numpy uses.

Potentially numpy could get rid of the special case for null pointers. I'm not actually sure if this would be a good idea or not, since there is a bit of an optimization there (the special case means we can allocate a large array of None objects, as in np.empty or np.zeroes, in O(1) time instead of the O(n) it would take to initialize all those fields and issue all those calls to Py_INCREF -- but OTOH maybe the win here is trivial and not with the complexity, I dunno).

Getting rid of this special case is more involved than just patching empty_like, though: this logic is implemented consistently in all numpy code dealing with object arrays, and potentially depended on by third party libraries.

gfyoung · 2016-07-21T20:28:17Z

@njsmith: So you have semantically aliased the zero field value to Py_None if I understand properly?

njsmith · 2016-07-22T01:19:33Z

Well not me personally :-). But yes, that's my understanding.

gfyoung · 2016-07-23T06:27:10Z

Okay, given the push-back I'm seeing from the numpy maintainers, digging a bit deeper on the pandas end, the issue is with printing of the DataFrame and the formatting behind the scenes. The segfault can be triggered in a much simpler fashion:

>>> import pandas.lib as lib
>>> import numpy as np
>>> lib.isnullobj(np.empty_like([None]))

This drew my attention when I realised the segfault wasn't occurring with just the mere creation of the DataFrame, but rather when I called str(...). pdb indicated that there was a segfault being signalled in lib.c, which is the Cythonized version of lib.pyx.

So it does seem like pandas does not handle these zero-value Py_None values, but I didn't realise it would be coming from printing! Definitely on the right track, so will continue digging...

gfyoung · 2016-07-23T07:18:24Z

Okay, so the issue is a memory access violation occurring right here (if you follow my example). This definitely looks like a numpy issue to me, as we do nothing to the object array beforehand.

I should also point out that if I define the following function in numpy (e.g. mtrand.pyx):

def getfirstelt(ndarray[object] arr):
    return arr[0]

and then do this:

>>> from numpy.random import getfirstelt
>>> import numpy as np
>>> np.getfirstelt(np.empty_like([None]))

this also segfaults.

@njsmith : what are your thoughts about this?

njsmith · 2016-07-23T13:02:34Z

@gfyoung: I think that your code there is triggering cython to generate some C code that directly pokes around inside the ndarray's memory. Possibly cython is generating buggy C code?

gfyoung · 2016-07-23T18:15:07Z

@njsmith : Perhaps, though FWIW, this will work (i.e. not segfault):

def getfirstelt(ndarray arr):
    return arr[0]

It could be a Cython bug, and the reason why it hasn't surfaced in numpy is because __getitem__ IINM is implemented in C by you guys already, so there is no Cython layer to compile for that.

njsmith · 2016-07-23T18:33:46Z

So look at the C code cython generates for those two snippets and see what the difference is :-)

gfyoung · 2016-07-23T19:35:57Z

@njsmith : maybe when I have A LOT of time on my hands :). FYI, I think we can also close the related numpy issue since we figured out what to do here.

Weird segfault arises when you call lib.isnullobj (or any of its equivalents like lib.isnullobj2d) with an array that uses 0-field values to mean None. Changed input to be a Python object (i.e. no typing), and the segfault went away. Closes pandas-devgh-13717. [ci skip]

Weird segfault arises when you call `lib.isnullobj` with an array that uses 0-field values to mean `None`. Changed input to be a `Python` object (i.e. no typing), and the segfault went away. Discovered when there were segfaults in printing a `DataFrame` containing such an array. Closes #13717. Author: gfyoung <gfyoung17@gmail.com> Closes #13764 from gfyoung/isnullobj-segfault and squashes the following commits: 0338b5d [gfyoung] BUG: Fix segfault in lib.isnullobj

jreback · 2016-07-24T14:01:53Z

closed by #13764

jreback added the Bug label Jul 20, 2016

jreback added this to the Next Major Release milestone Jul 20, 2016

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 20, 2016

flatberg mentioned this issue Jul 20, 2016

Memory (segfault) issues with np.empty_like numpy/numpy#7854

Closed

gfyoung mentioned this issue Jul 21, 2016

BUG: Explicitly fill in None for np.empty_like numpy/numpy#7857

Closed

jreback closed this as completed Jul 21, 2016

jreback modified the milestones: No action, Next Major Release Jul 21, 2016

gfyoung mentioned this issue Jul 23, 2016

BUG: Fix segfault in lib.isnullobj #13764

Closed

jreback modified the milestones: 0.19.0, No action Jul 24, 2016

jreback mentioned this issue Aug 24, 2016

DataFrame construction breaks python on numpy empty_like #14082

Closed

caspervdw mentioned this issue Jul 30, 2021

Pygeos segfaults on arrays constructed with np.empty_like pygeos/pygeos#373

Closed

This was referenced Jun 6, 2022

BUG: Segfault when printing dataframe #46848

Closed

Fix segfault with printing dataframe #47097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seqfault on creation of dataframe with np.empty_like #13717

Seqfault on creation of dataframe with np.empty_like #13717

flatberg commented Jul 20, 2016

jreback commented Jul 20, 2016

gfyoung commented Jul 20, 2016 •

edited

Loading

bashtage commented Jul 20, 2016

jreback commented Jul 20, 2016

gfyoung commented Jul 21, 2016

jreback commented Jul 21, 2016

bashtage commented Jul 21, 2016

gfyoung commented Jul 21, 2016 •

edited

Loading

njsmith commented Jul 21, 2016

gfyoung commented Jul 21, 2016 •

edited

Loading

njsmith commented Jul 22, 2016

gfyoung commented Jul 23, 2016 •

edited

Loading

gfyoung commented Jul 23, 2016 •

edited

Loading

njsmith commented Jul 23, 2016

gfyoung commented Jul 23, 2016 •

edited

Loading

njsmith commented Jul 23, 2016

gfyoung commented Jul 23, 2016

jreback commented Jul 24, 2016

Seqfault on creation of dataframe with np.empty_like #13717

Seqfault on creation of dataframe with np.empty_like #13717

Comments

flatberg commented Jul 20, 2016

INSTALLED VERSIONS

jreback commented Jul 20, 2016

gfyoung commented Jul 20, 2016 • edited Loading

bashtage commented Jul 20, 2016

jreback commented Jul 20, 2016

gfyoung commented Jul 21, 2016

jreback commented Jul 21, 2016

bashtage commented Jul 21, 2016

gfyoung commented Jul 21, 2016 • edited Loading

njsmith commented Jul 21, 2016

gfyoung commented Jul 21, 2016 • edited Loading

njsmith commented Jul 22, 2016

gfyoung commented Jul 23, 2016 • edited Loading

gfyoung commented Jul 23, 2016 • edited Loading

njsmith commented Jul 23, 2016

gfyoung commented Jul 23, 2016 • edited Loading

njsmith commented Jul 23, 2016

gfyoung commented Jul 23, 2016

jreback commented Jul 24, 2016

gfyoung commented Jul 20, 2016 •

edited

Loading

gfyoung commented Jul 21, 2016 •

edited

Loading

gfyoung commented Jul 21, 2016 •

edited

Loading

gfyoung commented Jul 23, 2016 •

edited

Loading

gfyoung commented Jul 23, 2016 •

edited

Loading

gfyoung commented Jul 23, 2016 •

edited

Loading