Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid memory write segmentation fault #118

Merged
merged 10 commits into from
Oct 24, 2022

Conversation

chrispap95
Copy link
Collaborator

This proposed change fixes the segmentation fault that occurs for some input arrays. My investigation indicated that the invalid memory access was happening in the last iteration of this loop. Specifically, the last element of *ptroff seemed to be after the allocated block. After running valgrind on this, I got messages like:

==39853== Invalid write of size 4
==39853==    at 0x14E86C5C: pybind11_init__ext(pybind11::module_&)::{lambda(output_wrapper, double)#1}::operator()(output_wrapper, double) const (_ext.cpp:144)
...
==39853==  Address 0x100e45d8 is 0 bytes after a block of size 1,496 alloc'd

I am still not 100% sure why the allocation is missing 1 byte. Maybe some trailing character is appended by pybind11 arrays? More knowledgable people might know. For the moment being, I decided to reallocate 1 more byte for all definitions of *ptroff. Tests are passing and the behavior seems normal thus far.

@chrispap95 chrispap95 merged commit 4613646 into scikit-hep:main Oct 24, 2022
@chrispap95
Copy link
Collaborator Author

chrispap95 commented Oct 24, 2022

Solves (for the moment) #98

@henryiii
Copy link
Member

This code is kind of terrifying and doesn't look like the right way to cast an array to numpy.

@chrispap95
Copy link
Collaborator Author

Do you have any examples of proper implementations? Was the original code correct?

@jpivarski
Copy link
Member

The "kind of terrifying" got my attention, so I stared at this PR for a while.

I don't see how calling realloc on a NumPy array's pointer is ever the right thing to do. (Probably what @henryiii meant by "terrifying".) The py::buffer_info already allocated memory; the realloc call will either return that memory region (in which case, ptroff and the NumPy array have the same pointer, and filling ptroff actually fills the NumPy array) or realloc will allocate a new one, in which case the NumPy array will not be filled. In either case, I don't know whether NumPy will delete the original allocation or the new one when the array gets garbage collected.

More fundamentally, off and ptroff must be the ListOffsetArray offsets, whose length is one greater than the length of the jagged array (fencepost principle: its first entry is the starting index of the first nested list and its last entry is the stopping index of the last nested list).

But it was allocated with length len:

https://github.com/chrispap95/fastjet/blob/8b37967ba030ff8805e38fdccc098255af49b88c/src/_ext.cpp#L128

and you need it to have length len + 1. Instead of reallocating it, make it the right length in the py::buffer_info constructor. (Similarly for all the edits in this PR.)

How it worked at all—before or after this fix—was probably some accident of how memory was laid out in RAM.

@chrispap95
Copy link
Collaborator Author

Okay, now I think I understand what's the problem. Actually, my first attempt at fixing this was making the allocation
len+1 in py::buffer_info instead of len. However, this made the tests fail because it introduced an extra empty array at the end of the output. Perhaps, I should have adjusted the loop as well?

@chrispap95
Copy link
Collaborator Author

In fact, there is a commit named allocate len+1 instead of len

@jpivarski
Copy link
Member

Then your first thought was right and the test failed for something downstream. Replacing lenlen + 1 will make the resulting ListOffsetArray one list longer.

Either

  • the offsets need to be len + 1 and the kinematic arrays need to be len or
  • the offsets need to be len and the kinematic arrays need to be len - 1

The filling code seems to think that offsets needs to be len + 1, since ptroff++ is executed len + 1 times: here and here.

There's an off-by-one error here somewhere; it depends on what len means. It's also worth considering that the test may be wrong, though I don't think jet-finding would return an empty jet, [], right?

@jpivarski
Copy link
Member

Inspired by this PR: #124.

@chrispap95
Copy link
Collaborator Author

Thanks! This clears things up a lot for me. I will try to put a fixing PR together for this tomorrow.

@chrispap95 chrispap95 deleted the mem-patch branch November 1, 2022 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants