-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfault with printing dataframe #47097
Fix segfault with printing dataframe #47097
Conversation
I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen. I deem this not a workable regression test, but I am at a loss how to trigger the issue otherwise. |
eb6301c
to
9981fb1
Compare
Does the segfault arise in one of the lines of the diff of this PR? Or is it only when this result is used later on? |
@@ -183,8 +183,15 @@ def take_2d_axis1_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values, | |||
{{if c_type_in == "uint8_t" and c_type_out == "object"}} | |||
out[i, j] = True if values[i, idx] > 0 else False | |||
{{else}} | |||
{{if c_type_in == "object"}} # GH46848 | |||
if values[i][idx] is None: | |||
out[i, j] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea why this makes a difference? this looks deeply weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel i've posted the arguments that cause take_2d_axis1_object_object
to segfault in #46848 (comment) if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps should explicitly check for NULL for clarity. see #46848 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps should explicitly check for NULL for clarity. see #46848 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fix here, probably should also patch take_2d_axis0_
, see #46848 (comment)
@roberthdevries would changing L159 like done in https://github.com/pandas-dev/pandas/pull/13764/files also work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmpff, It seems empty_like
does not fill the object array with None
, but just ensures NULL
'ing. NumPy supports this (mostly?), it is perfectly fine if an array is filled with NULL's, this is the same as if the array was filled with None
.
It is probably best if you also add support for that in pandas. That said, it would also be better to explicitly fill the array with None
in NumPy. (And I do think we should define that as the right thing, even if there are branches where NULL's must be expected a fully initialized array filled with NULL
s should never happen!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To proof the point:
In [7]: arr = np.empty(3, dtype=object)
In [8]: arr.tobytes()
Out[8]: b'`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00'
In [9]: np.empty_like(arr).tobytes()
Out[9]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg How would you effectively detect NULL
values in a numpy array? I assume that @simonjayhawkins prefers a fix somewhere in the setitem
method of DataFrame
, but I am at a loss how to intercept those values and convert them to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, it seems to me that cython may not know about the possibility of NULLs for object typed buffers (either that way or typed memoryviews). The generated code doesn't look like it. So I am not sure what a decent way forward would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a work-around that is relatively quick may be the following:
cdef cnp.intp_t[2] idx
idx[0] = i
idx[1] = idx
cdef void *data = cnp.PyArray_GetPtr(values, idx)
if data == NULL:
data = <void *>None
out[i,j] = <object>data
But maybe its nicer to just drop that weird PyArray_GetPtr
:
cdef void *data = arr.data + x * arr.strides[0] + y * arr.strides[1]
if data == NULL:
data = <void *>None
return <object>data
which will be as fast as possible without any micro-optimizations (not quite sure on Cython 3, but maybe it doesn't matter).
I am have planning on plugging this hole in NumPy (numpy/numpy#21817), but I am not sure if it is better there or in Cython.
(IMO, it is clearly right to fix this either in NumPy or in Cython, although if we fix it in NumPy, pandas still needs the work-around since you can rely on a new Cython, but not a new NumPy.)
The segfault is caused on the original line 186. |
can you add a test which replicates |
@roberthdevries can you also add a release note to doc/source/whatsnew/v1.4.3.rst and resolve conflicts with main |
use MRE from #46848 (comment) |
@jbrockmendel thoughts on catching the uninitialised numpy array in setitem instead? A change to setitem was the cause to the regression. either way I guess there will a performance regression. according to numpy/numpy#7854 it is a Cython issue. maybe we should also not fix. |
its not clear to me how we check for this. is there something in like arr.flags that would have what we need? |
I was thinking along the lines that maybe more performant to do a check once on setitem rather than on every subsequent take operation. but thinking some more should probably take this suggestion of the table and be consistent with the fix for the similar issue for DataFrame construction #13717. |
aa49b89
to
6f51d22
Compare
in the issue, the OP is using for col in df.columns:
df[col] = np.empty_like(df[col])
print(df) which results in the segfault. instead for col in df.columns:
df.loc[:, col] = np.empty_like(df[col])
print(df) works fine. #46267 is reporting a significant slowdown for repeated assigned via __setitem__ this sort of supports the "won't fix" (in pandas) argument as the workaround is more performant. @roberthdevries can you run the asvs to see if the current proposed fix affects performance. |
The current proposed fix should greatly slow down things. You probably would need to push the code closer to C to fix it, though (unless cython is "fixed"). Maybe we should prioritize NumPy avoiding this. There are some cases where NumPy probably cannot, but I guess that those cases do not really affect normal cython use cases. Of course, that will only fix the issue on newer NumPy versions. |
6f51d22
to
d8da853
Compare
moving off 1.4.3 milestone. |
I have run the asvs, but the output is huge and I cannot find anything obvious in it that would indicate a slowdown. But maybe you could help me by pointing in the direction where I should look, or tell me what you are looking for. |
d8da853
to
1bb002b
Compare
there's a benchmark using So if we don't see any other slowdowns, maybe that's good or maybe should add an explicit benchmark to be sure |
My nightly run did not report any significant performance degradations:
|
@jbrockmendel wdyt here? The underlying issue this is fixing has been around since at least pandas-0.25.3 (the oldest I checked with using your code sample in #46848 (comment)) There have been issues with np.empty-like before but this current issue has only surfaced for the DataFrame printing due to the __setitem__ change in 1.4. The fix here is probably the right place (if fixing in pandas) but the take code, I assume, is performance critical, although benchmarks haven't turned anything up. (do you know if the 2d take path is heavily used?) |
Agreed this is the best option currently available. Maybe @scoder can weigh in on future possibilities.
It's object-dtype, so I'm inclined to just not worry about it. |
1bb002b
to
450523f
Compare
Btw. I have opened a cython PR that should fix this: cython/cython#4859 Even if we say that the NumPy fix is the right path, that might be the best way forward to avoid lingering issues until we can rely on a "corrected" NumPy. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
That's great. Thanks @seberg I've not yet tested the fix, but now that that fix is in a released Cython we maybe able to close the issue with just merging the test from this PR. Probably best not to add the release note, just in case we have issues at wheel build time, but can certainly backport the test to 1.4.x |
Yeah, now that cython has a new release the fix is to pin that version and only add the test if anything. I guess there might be a few other places with hacks in place that can now be removed. |
Thanks @roberthdevries for the PR. closing in favor of #47979 tested with @jbrockmendel minimal example and cython 0.29.32 |
GH 46848
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.