-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency between the interface of vector bindings and python list & potential wild pointer problems. #32
Comments
I had the exact same thoughts just before reading this passage. |
I'll add the notice as soon as possible. @leleogere, Sorry to interrupt, but I wanted to ask what tool you used to create your previous diagram. It would be easier to understand if I could draw a schematic of the memory layout |
I did them manually in Inkscape (an open source vector drawing app). It is quite powerful, but the learning curve might be a bit steep for beginners. About the topic of this issue, I don't know a lot of C++, but personally I would expect from a python library to stick as much as possible to the python behavior. Asking people "not to hold a reference to an event for a long time" is kinda vague and might confuse them even more. I'm not sure what is the best solution to solve this, but I thing that I would rather a (small?) performance hit than a mysterious segmentation fault, eventually happening dozens of lines after the real origin of the problem. |
@leleogere Thanks. I agree with you. I'm now discussing with the author of nanobind. wjakob/nanobind#432 I've come up with a possible solution that has a smaller overhead. I'll be refining this idea these days. |
As I was nearing the end of writing the container I had envisioned, I realized that this solution still didn't guarantee semantic alignment with Python lists. I ended up choosing to use a large number of shared ptrs to solve this problem. Of course, this will have a relatively large impact on performance, and I will test this further once the changes are complete. |
@Natooz It's still a ways away from switching to shared_ptr, but I can probably predict its overhead already. This will introduce about This may be acceptable, for which I'm not sure. But the overhead makes me uncomfortable (I admit, I'm a bit of a perfectionist), since most of the overhead in time comes from the frequent memory allocation of a single object, like a note. Theoretically, this should be able to be avoid, but I just don't find a way. But perhaps we can solve this problem with interface design instead of using All the problems come from references. So if we copy an object in the list each time we access it, then all the problems would be solved. But this way makes it hard for users to modify a list. One possible way is to design an interface like the one in # There are just a few tracks commonly, so we could use shared_ptr for tracks.
# And then, we could access a track throw reference
notes = s.tracks[0].notes # reference, no copy
note = notes[1] # copy
notes[1].pitch += 12 # the original data is not modified because of copy
# new interface
notes[1, "pitch"] += 12 # the original data is now modified Admittedly this approach is a bit counter-intuitive, but it is currently the most "efficient" way to deal with the dangling pointer problem. What's your opinion about the technological route? |
Hi @Yikai-Liao , thank you for keeping me updated! 😃 |
I understand your "frustration" here as I'm also a perfectionist, especially when it comes to basic operations used for many downstream tasks. First, as the issue here comes from the reference inconsistency between Python lists and vector bindings, it would be too bad to add overhead for all operations by switching everything to shared_ptr. I think copying the data instead of returning a reference might be a preferable route, and would probably add less overhead overall, while remaining intuitive. For the |
I finally finished writing a container, pyvec, that will be used to solve the lifetime problem. Luckily, after testing, the overhead introduced by this container is much smaller compared to shared_ptr, in many cases an order of magnitude smaller. This means that after considering the overhead associated with python bindings, this overhead is largely negligible. Even when using only c++, I think this overhead is acceptable. Here are the test results of midi parsing from vtune, where the |
🥳 I have almost finished the new version. Now, the But since I have totally rewritten the binding codes, I'm note sure did I missed some functions that existed in old version. Further tests are needed. Edit: Well, firstly, I need to make it compiled on all platforms 😅. |
@Natooz The tests on symusic's github action have been passed🥳. If you have time, you could test if the new version works correctly on Miditok. |
@Natooz Sorry for the late reply. I've been busy with my graduation project. The new version of symusic now passes all the tests in MidiTok. However, this involves a tweak to the copy function call. For now I have set up the interface semantics as follows:
I noticed that in MidiTok's test code, the However, because of a mistake in my previous implementation (wrong definition of the Therefore, I'm not sure this interface change I'm introducing makes sense. |
Just to be on the safe side, I think it's time for me to introduce coverage tests in this new release (although it's long overdue). For the time being, I think it's okay to release this as a beta version. As well, I need to take the time to write out the documentation for all the api and discuss the rationale for the default arguments. |
Wonderful ! Thank you for these heads-up! I think it makes total sense to bind the methods following the same principles. I'll make sure to update the methods in MidiTok. For the tests, I intended to transfer some MidiTok's directly here, I'm especially thinking of Score splits and concatenations. Would you want me to do it? |
@Natooz It would be great if you got time to do so. And I will first add some tasks on cpp side. Do you think that we need to separate the Python test files and cpp test files into two sub folders? |
I created a draft PR to add more tests. RN I just added a few ones, I'll add more and make sure everything runs smoothly when I'll more time later. For the CPP tests, I don't know what the best practice is in this case, but I assume separating them in different directories wouldn't be a bad idea. |
@Natooz I have merged I change some |
Wonderful! |
It's not urgent. Just deal with it when you have time. |
I'll take care of the tests tomorrow! 🙌 |
The few tests that doesn't pass are due to some error in the equal assertion of @pytest.mark.parametrize("file_path", MIDI_PATHS_MULTITRACK)
def test_split_concat_score(file_path: Path):
score = Score(file_path)
if score.end() == 0:
pytest.skip("empty file")
tick_split = score.tpq * 4
score_1 = score.clip(0, tick_split, clip_end=False)
score_2 = score.clip(tick_split, score.end(), clip_end=True).shift_time(-tick_split)
# Concatenate split MIDIs and assert its equal to original one
score_concat = concat_scores((score_1, score_2), [tick_split])
# Assert the concatenated Score is identical to the original one
# We do not test tempos, time signatures and key signature as they are duplicated
# in score_concat (same consecutive ones for each chunk).
t1 = score.tracks
t2 = score_concat.tracks
tt = [(tt1, tt2) for tt1, tt2 in zip(t1, t2) if tt1 != tt2]
for tt1, tt2 in tt:
neq = tt1.notes == tt2.notes
ceq = tt1.controls == tt2.controls # Can be False
peq = tt1.pedals == tt2.pedals
pbeq = tt1.pitch_bends == tt2.pitch_bends
names_eq = tt1.name == tt2.name
programs_eq = tt1.program == tt2.program
n1 = [n for n in tt1.notes]
n2 = [n for n in tt2.notes]
nn = [(nn1, nn2) for nn1, nn2 in zip(n1, n2) if nn1 != nn2]
cc = [(cc1, cc2) for cc1, cc2 in zip(tt1.controls, tt1.controls) if cc1 != cc2] # Are all equals when compared individually
assert score.tracks == score_concat.tracks
assert score.lyrics == score_concat.lyrics
assert score.markers == score_concat.markers I tried to look into the stubs for any potential cause of this without success. Do you know what could cause this error? Also I made the lints pass, will open a PR soon. |
@Natooz Maybe, tt1.controls and tt2.controls got different length? By the way, I think |
Indeed they had different lengths, thank you for the hint.
I find it more convenient and less redundant too! |
I know I'm pretty late to the thread, but why not use This way you have stable references, and pythonic shared-ptr behaviors without the complexity of custom data structures |
In fact, I tried this option and almost finished the code. But this scheme introduces a big performance overhead, probably because shared pointer contains a lot of atomic operations to ensure thread safety. For now, I'm using my own implementation of pyvec to store events like notes, and shared_ptr<vector<shared_ptr>> to store tracks. This prevents tracks from making a deep copy when appending. |
Which operations were causing performance overheads? This should only penalize you when creating a lot of python proxy objects, and I would expect the penalty would be dwarfed by the cpython overhead of creating a lot of objects anyway. Are you sure you're not mixing up |
A vector automatically reallocates memory if it exceeds capacity during push back, causing all references to be invalidated on the Python side, resulting in pointer dangling problems. So simply using shared_ptr, you must use To avoid this overhead, we need to use something like the stable_vector in boost, which does not automatically reallocate memory. But stable_vector is a linked list in fact. I want the data structure to be better localized (cache friendly). So I wrote pyvec. I wrote pyvec as a container precisely to achieve the kind of LAZY execution you envision. I'll draw a schematic of the pyvec data structure at a later date for ease of understanding, but I'm not much of a user of svg's drawing tools and I'll try to draw it by hand first. The most important features of pyvec are:
|
In numpy, any append or stack, concatenate operation will copy the whole array to avoid reallocating memory. And I think it's not acceptable, because we expect users to use the list container in symusic just like using a python list, not a numpy array |
I think you missed my point. If your python objects hold these references, then no vector allocations would ever invalidate them: std::shared_ptr<std::vector<Note>> array;
int index; The Your requirements also already describe And I never suggested using the numpy append operator. I'm suggesting using std::vector for append. I'm just pointing out that it's a common pattern (in numpy and many other tensor libraries) for user friendly (pythonic) indexing and slicing to grab a shared_ptr to the root buffer and just hold indexing data, to prevent dangling references, exactly the problem you have here. |
Firstly, I want to make it clear that deque is not always stable.
Secondly, Let's think about your code. std::shared_ptr<std::vector<Note>> cpp_arrary; On the python side, we got a corresponding python object, But I need to make it clear that # Build the py_array object and get a reference of its element
# We just create a shared_ptr<Note> from shared_ptr<vector<Note>>
# The destructor of the corresponding cpp_array is not executed until the python side releases obj0.
obj0 = py_array[0]
# Assuming that the size and capacity of py_array is 8 at the moment
# Append one object would cause reallocation
# During this process, the reference counts of all shared_ptr remain unchanged, and the destructor functions of obj0 and py_array are not executed
# However, the memory in which obj0's actual data is stored has already been automatically freed by the vector, a process that is not sensed by shared_ptr because it does not modify the reference count.
py_array.append( ... )
# Try access obj0
# You may find that obj0 is still accessible sometimes, because that part of the memory is not collected by the system in time, but it is already an access out of bounds.
print(obj0) I only mentioned it before, not because it's the only problem, but I think it's deadly enough. # assume that size=5 and capacity=10
obj4 = py_array[4]
obj3_copied = deepcopy(py_array[3])
# Insert a new object in the front
# At this point, there is enough capacity in the vector, and the insert operation has moved all the elements back.
py_array.insert(0, ...)
# However, from obj4's point of view, there is no sense of what has happened in the vector; the reference count of shared_ptr has not changed, the destructor has not been executed, and the location of the pointer has not changed.
# So after the insert operation, obj4 will point to an element that was previously obj3.
assert obj4 == obj3_copied All of the above analysis, I have done the actual testing myself (using both pybind11 and nanobind). I am trying to show that I did think long and hard about the solution you are talking about and experimented with it. I have also been upset about the fact that adding a shared_ptr directly does not meet the actual requirements, as it increases my workload considerably.
Lastly, I also appreciate you being able to offer suggestions, and I'm happy to be able to share with others the tons of difficulties I've had dealing with this issue, especially since my initial thought process was in line with yours. If there is anything else that I have not expressed clearly or misrepresented, please feel free to continue to ask questions. |
I know that there is a solution: when we accessing an element in a container, only returns a copy but not a reference. As long as references are not allowed, there is no pointer dangling problem. PyO3 just use this strategy to keep rust's memory safety. I also wanted to use pyo3 to write symusic in the first place. However, I want to make it easier for users to modify the elements in the container, like this: for note in notes:
note.pitch += 12
# or
notes[-1].pitch += 12
notes[0].duration = 0 If this solution is used, the second way of modifying the elements in the container will fail even if we can use some tricks to allow reference access in the iterator. It is for this reason that we discarded the midi parsing code that we had already written under rust before. |
You're still missing my point about never holding note pointers, but rather only container pointers and indexes. This way it's literally impossible to have a dangling reference. But your point about index shifting when inserting at the beginning or middle of an array is valid. I think this highlights a key divergence in my use case and expectations from yours. I picture the parser (and writer) as standalone components, and editing as a separate process better served by other tools. I ended up writing a prototype last night so I could move a different project forward. https://github.com/wrongbad/tensormidi Refraining from custom data structures unlocks a ton of optimized manipulations doable with existing tools. Convert delta microseconds to absolute seconds: events = tensormidi.MergedTrack('song.mid')
abs_times = numpy.cumsum(events['dt'] / 1e6) Filter by channels and event type: events = tensormidi.MergedTrack('song.mid')
chan1 = events[events['channel'] == 1]
chan1notes = chan1[(chan1['type'] == NOTE_ON) | (chan1['type'] == NOTE_OFF)] There are infinite ways to splice and mutate midi data, and it would be really hard to support all of them with custom APIs when you control the data structure and its mutations. |
As you said, having an index within each object holding vector is also an unusable solution, sorting, inserting and other operations invalidate it. Secondly, to better support other libraries such as numpy, I chose to add the |
Based on your code, what I understand is that you are trying to decode midi to the midi message level numpy array or tensor. In symusic, I chose to decode the midi to note level, and the SoA's mentioned earlier are all note level as well. I choose Note level because when I was researching the field of symbolic music generation, there was almost no work left that directly followed the midi message token sequence, except for the early Music Transformer. A great deal of work has demonstrated that placing messages from the same Note in adjacent or the same token facilitates model learning. Of course, this is also a parody of miditoolkit and pretty midi. Note that converting midi messages to note level information requires a more complex control flow (if and else), and it is difficult to get high efficiency using numpy alone. And this is one of the things I want to speed up in symusic |
It all depends what you're trying to use it for. I think it's perfectly usable as long as you don't use the library itself to edit the data. Personally I'd prefer the tradeoff where I can parse straight into a contiguous buffer which can be mapped directly into pytorch tensors with the buffer protocol and then sent to GPU, all in sub-ms time.
This is getting off topic but I'm actually very curious what you mean here. I'm also curious why you think it can't be done with vectorized numpy (or torch) ops. Are you talking about tokenizing with duration, rather than "note off" events? |
In fact, on the c++ side, we have already abstracted the code for the midi message layer, the minimidi mentioned in the readme. But for simplicity, there is currently no interface to bind the midi message level. |
First, on the subject of saving the index, why would you need to save the index in a scenario where you don't need to edit it? The size of a note is smaller than shared_ptr + size_t. Copy is obviously better in this case (just like what pyo3 did).
Second, about the midi message to Note level decoding problem, you need to solve a lot of note on and note off pairing problems, determine each note on the time corresponding to the track and channel. Different events such as control change, set tempo, etc., need corresponding code to decode specific information, not to mention the meta message has variable length. All of these suggese that you need to Iterate through all messages sequentially, with lots of if and else blocks, which does not suit parallel processing. In symusic's decoding performance test, it is also observed that the compiler's automatic SIMD optimization does not affect the decoding performance either ( confirming on the side that note level decoding is difficult to be parallelized). |
For the development history of symbolic music representation, I recommend you to read Music Transforer -> REMI -> Compound Word It's true that symbolic music representations are still evolving, and even with tokenizers like MidiTok that can be used directly, people are still modifying their representations, but there are still some trends that we can see in the development process. And I wrote the library symusic to replace miditoolkit and pretty midi to speed up the process of decoding MIDI files to note level information. After decoding to note level, a lot of operations (such as data enhancement, etc.) can be done directly at the numpy level in quick and easy batch operations. |
You could still allow in-place note edits without allowing insert/delete, as it would be less surprising who try to modify notes. But since writing my initial suggestion, I think the numpy structured array is much better. The reference and mutation semantics are well established now. Yeah, I guess duration requires sequential processing. With a pure-numpy input, you could write a In any case, I'm still not seeing much use for supporting sequence insertions or deletions. The numpy philosophy is for length-altering ops to just copy / fill a new destination array. Since you're generally adding or removing many notes, this is actually much faster in common cases than using a data-structure with insertion capability, and applying each insertion / deletion sequentially. I've been working on low-latency causal modeling, where durations aren't yet known when you need model predictions. I know duration helps with stuck-notes, but I've found that including the note-state as an auxiliary input also mitigates this. |
As for decoding speed, while I know symusic has some room for optimization, it shouldn't make a quantum difference. We did a lot of optimization against the profiling tool, which improved performance several times over the original version, and the sub-millisecond decoding speeds that don't show up in Benchmark are due to the fact that mahler.mid is a very large midi file, basically the largest midi file we could find. |
Can you share a link where I can find I've been testing on bread-midi-dataset, measuring aggregates in files / second. |
First of all, it's still about saving indexes. If you want to support inplace editing without changing the order, pybind11 and nanobind's default vector binding strategy (reference) is very efficient and you don't need to modify it yourself, and saving indexes may be even more inefficient. Of course, if your goal is to support append, extend, inplace editing and keeping references of some element, with operations such as insert remove sort clear disabled, then preserving the index is indeed a viable option. Secondly, I also considered using In symusic, I chose the traditional AoS structure as the main interaction interface and provided a fast bidirectional conversion interface to the numpy-based AoS structure. This seems like a good tradeoff to me. Of course, you can't expect one library to fulfill everyone's needs, and maybe my design doesn't fit your application scenario. Third, here is mahler: https://github.com/lzqlzzq/minimidi/blob/main/example/mahler.mid By the way, if you haven't had experience implementing midi decoding before, I recommend taking a look at our minimidi, which is a single header file c++ library for parsing midi files to the message level(also with the corresponding encoding code). We did run into quite a few pitfalls, or special cases to deal with, in implementing the decoding part, although most of them exist in the message level to note level conversion process. |
Actually import numba
@numba.jit
def durations(events):
n = len(events)
out = np.zeros(n, dtype=np.uint32)
durs = np.zeros((16, 128), dtype=np.uint32)
for i in range(n-1, -1, -1):
e = events[i]
if e.type == 0x90: # NOTE_ON
out[i] = durs[e.channel, e.key]
elif e.type == 0x80: # NOTE_OFF
durs[e.channel, e.key] = 0
durs += e.dt
return out
fname = os.path.expanduser('~/dev/data/midi/mahler.mid')
events = tensormidi.MergedTrack(fname).events
durs = durations(events)
durs = durs[events['type'] == 0x90]
events = events[events['type'] == 0x90]
print(events[:20], durs[:20]) In this case, |
For building scores note-by-note, is there any downside to using a python list, and then bulk packing with np.array() at the end? Or just pre-allocating a buffer with number of notes to generate, and set index |
Sorry for the typo, I meant to say that numba doesn't handle Array of Structure well. As for your question about whether we can use Python's lists to solve the problem of creating score note by note, it's really a matter of whether the library is AoS or SoA based, and of course both options are possible. symusic inherits from miditoolkit, which is AoS based, and since the data structures are all in c++, the conversion to SoA is very fast. If you choose SoA-based and use Python's native class to implement AoS, you will face the problem of slow conversion speed. This is where the tradeoffs need to be weighed based on the usage scenario. |
This is a schematic of pyvec's memory layout, essentially adding a layer of pointers @wrongbad Since most of the time we don't adjust the order of the notes we read, I'd say most cases have good memory continuity. Because ptrs is encapsulated inside pyvec and all pointers in ptrs are required to point to data in buffers, operations such as append force a copy of the object into buffers and release it at the end with other objects in buffers. Such a data structure also allows us to perform fast slicing, i.e., copying the pointer of the corresponding range and a shared ptr. This is consistent with the complexity of Python list slicing, O(K), where K is the length of the target slice. |
I finally make github actions work on the And the |
The v0.5.0 has been released. And it also fixes other bugs. |
Following #28 , I'm considering adding
inplace
argument for other functions. But I find thatinplace
operation does bring some inconsistency between the interface of vector bindings and python list. Also, this might lead to some wild pointer.An Experiment
Here is an experiment I conducted:
Theoretical Analysis
Essentially, a python list holds an array of pointers (pointing to the real data), while the std::vector holds an array of the real data. Consider the following code:
For
python list
, we always expect thenote
(a reference to the last element of that list) to be Note(9, 0, 0, 0) no matter how we insert or remove elements in the list. Since in such operations, we only move some pointers, and we don't change the position of any real data.For the
vector bindings
, e.g.NoteTickList
, we do move the real data when insert the new note, while the pointer innote
remains the same. Normally (If we don't reach the vector's capacity), thenote
will still point to the 10th element innotes
, which isNote(8, 0, 0, 0)
. So this is an inconsistency between the bindings and python list.And if we reach the capacity, vector will automatically copy all the data into a new, larger block of memory and free the original. Then, the
note
will become a "wild pointer", pointing to an invalid block of memory. If it's unfortunate enough, this will cause your program to trigger asegmentation fault
.In conclusion, even if such inconsistency is acceptable any operations that modify the position of the real data might lead to some potential "wild pointer".
Possible Solution
@Natooz I'm quite concerned about this issue and no perfect solution has been found.
numpy.append
is slow and something that should be avoided.vector<Note>
withvector<shared_ptr<Note>>
. This makes c++ vectors perform like python lists and introduces a lot of memory fragmentation and overheads.Both of these schemes introduce a significant overhead, and neither looks elegant or pythonic.
Of course, we could write in the documentation and the README to tell the user not to hold a reference to an event for a long time, but to copy it manually.
But I think there are still a lot of people who don't read these instructions, and then run into the problems I've analyzed.
Improving performance always comes at a price, which can take various forms.
The text was updated successfully, but these errors were encountered: