Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault with printing dataframe #47097

Conversation

roberthdevries
Copy link
Contributor

@roberthdevries roberthdevries commented May 23, 2022

GH 46848

@roberthdevries
Copy link
Contributor Author

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen. I deem this not a workable regression test, but I am at a loss how to trigger the issue otherwise.
As far as I can determine the crash only happens when printing a dataframe requires the use of dots to indicate missing values when the screen is not wide enough.
Apparently there is some logic to determine this situation, and then take another code path. This other code path then takes care of the injection of the dots, but also triggers the code path that produces the segfault.
Ideally we can make a regression test that provokes the code path that produces the segfault without the print logic.

@roberthdevries roberthdevries marked this pull request as draft May 23, 2022 17:35
@roberthdevries roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from eb6301c to 9981fb1 Compare May 23, 2022 17:41
@rhshadrach
Copy link
Member

Does the segfault arise in one of the lines of the diff of this PR? Or is it only when this result is used later on?

@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone May 26, 2022
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Output-Formatting __repr__ of pandas objects, to_string labels May 26, 2022
@@ -183,8 +183,15 @@ def take_2d_axis1_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
{{if c_type_in == "uint8_t" and c_type_out == "object"}}
out[i, j] = True if values[i, idx] > 0 else False
{{else}}
{{if c_type_in == "object"}} # GH46848
if values[i][idx] is None:
out[i, j] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea why this makes a difference? this looks deeply weird

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel i've posted the arguments that cause take_2d_axis1_object_object to segfault in #46848 (comment) if that helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should explicitly check for NULL for clarity. see #46848 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps should explicitly check for NULL for clarity. see #46848 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fix here, probably should also patch take_2d_axis0_, see #46848 (comment)

@roberthdevries would changing L159 like done in https://github.com/pandas-dev/pandas/pull/13764/files also work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmpff, It seems empty_like does not fill the object array with None, but just ensures NULL'ing. NumPy supports this (mostly?), it is perfectly fine if an array is filled with NULL's, this is the same as if the array was filled with None.

It is probably best if you also add support for that in pandas. That said, it would also be better to explicitly fill the array with None in NumPy. (And I do think we should define that as the right thing, even if there are branches where NULL's must be expected a fully initialized array filled with NULLs should never happen!)

Copy link
Contributor

@seberg seberg Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To proof the point:

In [7]: arr = np.empty(3, dtype=object)

In [8]: arr.tobytes()
Out[8]: b'`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00'

In [9]: np.empty_like(arr).tobytes()
Out[9]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg How would you effectively detect NULL values in a numpy array? I assume that @simonjayhawkins prefers a fix somewhere in the setitem method of DataFrame, but I am at a loss how to intercept those values and convert them to None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it seems to me that cython may not know about the possibility of NULLs for object typed buffers (either that way or typed memoryviews). The generated code doesn't look like it. So I am not sure what a decent way forward would be.

Copy link
Contributor

@seberg seberg Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a work-around that is relatively quick may be the following:

    cdef cnp.intp_t[2] idx
    idx[0] = i
    idx[1] = idx

    cdef void *data = cnp.PyArray_GetPtr(values, idx)
    if data == NULL:
        data = <void *>None
    out[i,j] = <object>data

But maybe its nicer to just drop that weird PyArray_GetPtr:

    cdef void *data = arr.data + x * arr.strides[0] + y * arr.strides[1]
    if data == NULL:
        data = <void *>None
    return <object>data

which will be as fast as possible without any micro-optimizations (not quite sure on Cython 3, but maybe it doesn't matter).

I am have planning on plugging this hole in NumPy (numpy/numpy#21817), but I am not sure if it is better there or in Cython.
(IMO, it is clearly right to fix this either in NumPy or in Cython, although if we fix it in NumPy, pandas still needs the work-around since you can rely on a new Cython, but not a new NumPy.)

@roberthdevries
Copy link
Contributor Author

The segfault is caused on the original line 186.
The code generated by cython tries to increment the refcount of the object on the rhs of the assignment expression. However that is a NULL pointer.
Question is where that NULL value comes from.
The extra check for None prevents this call to increase the refcount.

@jreback
Copy link
Contributor

jreback commented May 27, 2022

can you add a test which replicates

@simonjayhawkins
Copy link
Member

can you add a test which replicates

@roberthdevries can you also add a release note to doc/source/whatsnew/v1.4.3.rst and resolve conflicts with main

@simonjayhawkins
Copy link
Member

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen.

use MRE from #46848 (comment)

@simonjayhawkins
Copy link
Member

@jbrockmendel thoughts on catching the uninitialised numpy array in setitem instead? A change to setitem was the cause to the regression. either way I guess there will a performance regression.

according to numpy/numpy#7854 it is a Cython issue. maybe we should also not fix.

@jbrockmendel
Copy link
Member

thoughts on catching the uninitialised numpy array in setitem instead?

its not clear to me how we check for this. is there something in like arr.flags that would have what we need?

@simonjayhawkins simonjayhawkins mentioned this pull request Jun 8, 2022
@simonjayhawkins
Copy link
Member

thoughts on catching the uninitialised numpy array in setitem instead?

its not clear to me how we check for this. is there something in like arr.flags that would have what we need?

I was thinking along the lines that maybe more performant to do a check once on setitem rather than on every subsequent take operation. but thinking some more should probably take this suggestion of the table and be consistent with the fix for the similar issue for DataFrame construction #13717.

@roberthdevries roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch 3 times, most recently from aa49b89 to 6f51d22 Compare June 10, 2022 12:51
@simonjayhawkins
Copy link
Member

according to numpy/numpy#7854 it is a Cython issue. maybe we should also not fix.

in the issue, the OP is using

for col in df.columns:
    df[col] = np.empty_like(df[col])

print(df)

which results in the segfault.

instead

for col in df.columns:
    df.loc[:, col] = np.empty_like(df[col])

print(df)

works fine.

#46267 is reporting a significant slowdown for repeated assigned via __setitem__

this sort of supports the "won't fix" (in pandas) argument as the workaround is more performant.

@roberthdevries can you run the asvs to see if the current proposed fix affects performance.

@seberg
Copy link
Contributor

seberg commented Jun 21, 2022

The current proposed fix should greatly slow down things. You probably would need to push the code closer to C to fix it, though (unless cython is "fixed").
(That is should not be particularly difficult, but not a typical code pattern here I guess.)

Maybe we should prioritize NumPy avoiding this. There are some cases where NumPy probably cannot, but I guess that those cases do not really affect normal cython use cases. Of course, that will only fix the issue on newer NumPy versions.

@roberthdevries roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from 6f51d22 to d8da853 Compare June 21, 2022 17:54
@simonjayhawkins
Copy link
Member

moving off 1.4.3 milestone.

@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 21, 2022
@roberthdevries
Copy link
Contributor Author

@roberthdevries can you run the asvs to see if the current proposed fix affects performance.

I have run the asvs, but the output is huge and I cannot find anything obvious in it that would indicate a slowdown. But maybe you could help me by pointing in the direction where I should look, or tell me what you are looking for.

@roberthdevries roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from d8da853 to 1bb002b Compare June 22, 2022 10:18
@simonjayhawkins
Copy link
Member

there's a benchmark using take_nd called parallel_take1d so I assume not relevant. And there's a time_take in asv_bench/benchmarks/indexing.py which is for a Series. So it maybe that we don't have any benchmarks specific to the changed code path.

So if we don't see any other slowdowns, maybe that's good or maybe should add an explicit benchmark to be sure

@roberthdevries
Copy link
Contributor Author

My nightly run did not report any significant performance degradations:

[99.81%] ··· Setting up timedelta:14                                                                                                                                  ok
[99.81%] ··· timedelta.DatetimeAccessor.time_dt_accessor                                                                                                         178±2ns
[99.83%] ··· timedelta.DatetimeAccessor.time_timedelta_days                                                                                                     1.16±0ms
[99.86%] ··· timedelta.DatetimeAccessor.time_timedelta_microseconds                                                                                             1.17±0ms
[99.88%] ··· timedelta.DatetimeAccessor.time_timedelta_nanoseconds                                                                                              1.16±0ms
[99.90%] ··· timedelta.DatetimeAccessor.time_timedelta_seconds                                                                                               1.14±0.01ms
[99.93%] ··· Setting up tslibs.timedelta:55                                                                                                                           ok
[99.93%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_days                                                                                            167±2ns
[99.95%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_microseconds                                                                                    165±1ns
[99.98%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_nanoseconds                                                                                   167±0.6ns
[100.00%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_seconds                                                                                         166±1ns
(base) robert@thinkpad-rhdv:~/projects/pandas/asv_bench$

@simonjayhawkins
Copy link
Member

@jbrockmendel wdyt here?

The underlying issue this is fixing has been around since at least pandas-0.25.3 (the oldest I checked with using your code sample in #46848 (comment))

There have been issues with np.empty-like before but this current issue has only surfaced for the DataFrame printing due to the __setitem__ change in 1.4.

The fix here is probably the right place (if fixing in pandas) but the take code, I assume, is performance critical, although benchmarks haven't turned anything up. (do you know if the 2d take path is heavily used?)

@jbrockmendel
Copy link
Member

The fix here is probably the right place (if fixing in pandas) but the take code,

Agreed this is the best option currently available. Maybe @scoder can weigh in on future possibilities.

I assume, is performance critical, although benchmarks haven't turned anything up. (do you know if the 2d take path is heavily used?)

It's object-dtype, so I'm inclined to just not worry about it.

@roberthdevries roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from 1bb002b to 450523f Compare June 23, 2022 13:28
@roberthdevries roberthdevries marked this pull request as ready for review June 23, 2022 13:28
@seberg
Copy link
Contributor

seberg commented Jun 23, 2022

Btw. I have opened a cython PR that should fix this: cython/cython#4859

Even if we say that the NumPy fix is the right path, that might be the best way forward to avoid lingering issues until we can rely on a "corrected" NumPy.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 24, 2022
@simonjayhawkins
Copy link
Member

Btw. I have opened a cython PR that should fix this: cython/cython#4859

That's great. Thanks @seberg

I've not yet tested the fix, but now that that fix is in a released Cython we maybe able to close the issue with just merging the test from this PR.

Probably best not to add the release note, just in case we have issues at wheel build time, but can certainly backport the test to 1.4.x

@seberg
Copy link
Contributor

seberg commented Aug 3, 2022

Yeah, now that cython has a new release the fix is to pin that version and only add the test if anything. I guess there might be a few other places with hacks in place that can now be removed.

@simonjayhawkins simonjayhawkins mentioned this pull request Aug 5, 2022
2 tasks
@simonjayhawkins
Copy link
Member

Thanks @roberthdevries for the PR. closing in favor of #47979

tested with @jbrockmendel minimal example and cython 0.29.32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Segfault when printing dataframe
6 participants