-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seqfault on creation of dataframe with np.empty_like #13717
Comments
not really sure what |
I suspect it is a case of memory corruption on the >>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index(['a', 'b']))
>>> arr
array([None, None], dtype=object)
>>> arr2 = np.array([None, None], dtype=object)
>>> arr == arr2
True
>>> arr.data.tobytes() # segfault
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> arr2.data.tobytes() # no segfault
b'0\x91\xad\x8b\xd8\x7f\x00\x000\x91\xad\x8b\xd8\x7f\x00\x00' You can see here that the data is not corrupted when you have integers: >>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index([1, 2]))
>>> arr
array([140568033367712, 140568033367744], dtype=int64)
>>> arr2 = np.array([140568033367712, 140568033367744], dtype=np.int64)
>>> arr == arr2
True
>>> arr.data.tobytes() # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00'
>>> arr2.data.tobytes() # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00' Also, you don't need an >>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame(np.empty_like([None])) |
You can do more wierd tricks with this: np.empty_like(np.array([None, None], dtype=object)).data.tobytes()
Out[41]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).data.tobytes()
Out[42]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).astype(np.object).data.tobytes()
Out[43]: b'\xd0\x01\x9ac\x00\x00\x00\x00\xd0\x01\x9ac\x00\x00\x00\x00' This looks like a NumPy bug. I suspect that an object array initialized using >>> np.array([None]).data.tobytes()[::-1].hex()
'0000000063979420'
>>> hex(id(None))
'0x63979420' |
agree with @bashtage here - some memory aliasing going on want to open a numpy side issue? |
@jreback : I opened a PR now in |
running this in the debugger actually works. I think there is a reference issue to the Nones; an element-wise copy would fix it I suppose (from pandas perspective). but yes this is a numpy-bug. |
It is probably, technically, a Pandas bug. NumPy is returns an empty array filled with NULL. Pandas doesn't seems to know what to do when an object array contains NULLs. This said, it would be better if NumPy addressed this by never returning object arrays containing NULLs, and in instead returned the address of |
@bashtage : True, but the annoying part on the |
This should probably be treated as a pandas bug, and definitely not as a bug in empty_like specifically. Numpy's semantics for interpreting the bytes in object arrays are, consistently:
If pandas wants to implement its own logic for reading objects out of object arrays, then it kinda has to implement the same semantics that numpy uses. Potentially numpy could get rid of the special case for null pointers. I'm not actually sure if this would be a good idea or not, since there is a bit of an optimization there (the special case means we can allocate a large array of None objects, as in Getting rid of this special case is more involved than just patching |
@njsmith: So you have semantically aliased the zero field value to |
Well not me personally :-). But yes, that's my understanding. |
Okay, given the push-back I'm seeing from the >>> import pandas.lib as lib
>>> import numpy as np
>>> lib.isnullobj(np.empty_like([None])) This drew my attention when I realised the segfault wasn't occurring with just the mere creation of the So it does seem like |
Okay, so the issue is a memory access violation occurring right here (if you follow my example). This definitely looks like a I should also point out that if I define the following function in def getfirstelt(ndarray[object] arr):
return arr[0] and then do this: >>> from numpy.random import getfirstelt
>>> import numpy as np
>>> np.getfirstelt(np.empty_like([None])) this also segfaults. @njsmith : what are your thoughts about this? |
@gfyoung: I think that your code there is triggering cython to generate some C code that directly pokes around inside the ndarray's memory. Possibly cython is generating buggy C code? |
@njsmith : Perhaps, though FWIW, this will work (i.e. not segfault): def getfirstelt(ndarray arr):
return arr[0] It could be a Cython bug, and the reason why it hasn't surfaced in |
So look at the C code cython generates for those two snippets and see what the difference is :-) |
@njsmith : maybe when I have A LOT of time on my hands :). FYI, I think we can also close the related |
Weird segfault arises when you call lib.isnullobj (or any of its equivalents like lib.isnullobj2d) with an array that uses 0-field values to mean None. Changed input to be a Python object (i.e. no typing), and the segfault went away. Closes pandas-devgh-13717. [ci skip]
Weird segfault arises when you call `lib.isnullobj` with an array that uses 0-field values to mean `None`. Changed input to be a `Python` object (i.e. no typing), and the segfault went away. Discovered when there were segfaults in printing a `DataFrame` containing such an array. Closes #13717. Author: gfyoung <gfyoung17@gmail.com> Closes #13764 from gfyoung/isnullobj-segfault and squashes the following commits: 0338b5d [gfyoung] BUG: Fix segfault in lib.isnullobj
closed by #13764 |
This code segfaults:
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-91-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: