-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any descendant of the state can be a binary_type #1194
Any descendant of the state can be a binary_type #1194
Conversation
…ializing numpy arrays, in addition to shape, dtype etc
Oh, and respect for the ones responsible for thinking out the method on doing binary transfers, this is truly a good idea! |
Why not just save the entire array+size+dtype into a binary format? For example, we could implement this: https://docs.scipy.org/doc/numpy/neps/npy-format.html#format-specification-version-1-0 |
That would be an option, but I also need serializing a dict of numpy arrays in the future, so even if you could solve some options with what you propose, this is more flexible. And I guess that requires a memcpy (although it's probably in the order of megabytes, so it shouldn't be an issue). |
Regardless of the method used for the case of numpy arrays, I think that allowing binary fields will be necessary for other use cases. |
Yes, also other metadata, units, descriptions etc, you don't want to put it in binary form on each side of the wire, this would make life much easier. |
I hesitate to start doing things that involve picking apart objects and replacing their innards. Are we replacing things arbitrarily deep in the nesting here? |
We can probably address all possible use cases by allowing only top-level binary keys. |
Yes, any level deep, as far as lists and dicts go. We could not allow that, but that doesn't change the code at all (except for the recursive part). It is not any different than |
It will actually require hardcoded checks, so as it is now, it's the easiest to maintain I think. |
I think something that makes this complex (and it was already the case before this PR), is that first an object get created/deserialized here: https://github.com/ipython/ipywidgets/blob/master/jupyter-js-widgets/src/manager-base.ts#L346 Because now, if say my numpy array 'x' (with the missing buffer value on the js side), get deserialized, and then later the buffer gets filled in, and then the deserialize isn't getting called anymore. |
This is essentially what we have now, one level up at the object's state keys. Some keys can be binary, some can be JSON. What would we need to isolate this to a custom serializer (like the widgets serializer)? For example, what if we allowed multiple buffers to be associated with one key (to have possible zero-copy semantics). (I'm not totally against this idea, but it certainly makes the serialization more complicated, so I want to explore some simpler ideas, or ways to move the complexity out of the base codebase) |
Actually, that is also what @SylvainCorlay thought, but it is more like 0 levels deep.
|
I agree that it makes it more complex, but I think the complexity start at the two places where the deserialization is happening. If, say at object construction we would already have the buffers available, that could make this simpler, we just pop buffers out of the dicts, and replace lists values with sentinels, and put them back in place directly before the object is constructed. Now the object is first constructed without anything that has a buffer in it missing (before it was only level 0 keys missing that had buffer object), and then an update is adding in the missing buffer objects. |
@jasongrout the current mechanism allows top-level properties of the object to by memory views, not top-level properties of a given trait attribute. |
Right - any particular trait (i.e., a key/value in the object's state) can be binary. |
Yes, I understand what we're trying to do. Another way to think about this proposal is that we are simply extending our base JSON encoder/decoder to handle binary objects anywhere in it. (The popping of binary buffers happens after the object's custom serializer has run, right?) What if we allowed any particular trait to have arbitrary JSON metadata associated with it? We sort of have that now - if a trait is binary, some metadata is associated with it indicating which buffer the data is in. What if the serializer could return both a value and a metadata dict? And the deserializer could take both a data value and a metadata dict. Also, the difference between what we have now (any trait can be encoded in binary) and a level-1 binary replacer is that the serialization/deserialization happens at the trait level, so with the level-1 binary replacer, a single trait can (very easily) be encoded as both binary data and JSON data, whereas with what we have now, there needs to be some extra magic going on to associate the JSON data with the binary data in two different traits. |
I like the idea, but I can think of an example which doesn't really fit into this idea. Say a masked array, it would have two binary blobs associated with it, and I wouldn't call the array with the mask the meta data. It would make more sense to write it like this. ...
return {'data': memoryview(data), 'mask': memoryview(mask), 'shape' data.shape} |
Yes, I'm agreeing with this PR's idea more and more, especially as I think about it as being just a way of augmenting our JSON format to take advantage of the binary buffers we have available. |
Ok, good to know. The split |
Can you show an example of what a deeper nesting path would look like? a state dictionary of Implementing the serialization/deserialization in js definitely should be done too. Thanks for tackling this! It's been a frustration we've had for a while (particularly with numpy arrays). |
Ok, implemented, demo notebook here: https://gist.github.com/maartenbreddels/ac41cdcd22c8f84c9fc639dea7c90848 {'data': mv, 'shape': ar.shape, 'nested': {'list': [mv, 'test']}} that gets passed along from x to y, just to see demonstate it handles nested items ok, in the last cell you see it is property serialize and deserialized. |
I've answered your question in the docstring of _split_state_buffers. Futhermore, (in a previous commit) I've renamed buffer_keys by buffer_paths, since it better covers its usage. I also tried to keep the code simple (like a for loop in _handle_msg, instead of using reduce), to keep the code more readable for those who are not used to functional parts, but I don't know what you guidelines are on this. |
FYI: Found some bugs in this PR, will commit new changes later. |
@maartenbreddels are you still making changes to this PR? I did a rough review and this looks good. What issues did you encounter? |
…ut_buffers for py/ts codes are equivalent, unittest for py side covers more cases also putting it back together, bugfix (covered by python test) for _remove_buffers, typo corrections
Ok, it's more consistent now. Both python and javascript sides have a (_)remove_buffers/(_)put_buffers. remove buffers doesn't modify existing dicts/objects (clones when needed), but references non-modifies ancestors. The js side has cyclic protection (needed to support Controller), the py side does not. _put_buffers does modify the state, but that should be fine since it coming off the wire. |
I think I figured out why you needed the circular dependency check on the js side. Basically, we use Javascript's toJSON feature (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify#toJSON()_behavior) to serialize models, so the models are not serialized to "IPY_MODEL_..." strings until after the remove_buffers runs, so remove_buffers recurses into these complicated model objects. Looking into it... |
Yes, I was surprised to see unpack_models, but not the inverse. But I think if the toJSON can handle it, we should be able to handle it as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll push a few commits with these changes...
self.assertIn('shape', state['y']) | ||
self.assertNotIn('ar', state['x']) | ||
self.assertFalse(state['x']) # should be empty | ||
self.assertNotIn('data', state['x']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state['y']
self.assertIsNot(state['y'], y) | ||
self.assertIs(state['y']['shape'], y_shape) | ||
|
||
# check that the buffers paths really points to the right buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point
self.assertIn('plain', state) | ||
self.assertIn('shape', state['y']) | ||
self.assertNotIn('ar', state['x']) | ||
self.assertFalse(state['x']) # should be empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertEqual(state['x'], {})
ipywidgets/widgets/widget.py
Outdated
obj = obj[key] | ||
obj[buffer_path[-1]] = buffer | ||
|
||
def _seperate_buffers(substate, path, buffer_paths, buffers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
ipywidgets/widgets/widget.py
Outdated
buffers.append(v) | ||
buffer_paths.append(path + [i]) | ||
elif isinstance(v, (dict, list, tuple)): | ||
vnew = _seperate_buffers(v, path + [i], buffer_paths, buffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
ipywidgets/widgets/widget.py
Outdated
buffers.append(v) | ||
buffer_paths.append(path + [k]) | ||
elif isinstance(v, (dict, list, tuple)): | ||
vnew = _seperate_buffers(v, path + [k], buffer_paths, buffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
ipywidgets/widgets/widget.py
Outdated
[<memory at 0x107ffec48>, <memory at 0x107ffed08>]) | ||
""" | ||
buffer_paths, buffers = [], [] | ||
state = _seperate_buffers(state, [], buffer_paths, buffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
ipywidgets/widgets/widget.py
Outdated
|
||
args = dict(target_name='jupyter.widget', data=state) | ||
args = dict(target_name='jupyter.widget', data={'state': state, 'buffer_paths': buffer_paths}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict(..., buffers=buffers)
(and delete the assignment below) to keep the buffer_paths with the buffers...
jupyter-js-widgets/src/utils.ts
Outdated
else { | ||
var new_value = remove(value, path.concat([key])); | ||
// only assigned when the value changes, we may serialize objects that don't support assignment | ||
if(new_value != value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!==
jupyter-js-widgets/src/utils.ts
Outdated
else { | ||
var new_value = remove(value, path.concat([i])); | ||
// only assigned when the value changes, we may serialize objects that don't support assignment | ||
if(new_value != value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!==
I agree, good you spotted those bugs in the unittest. |
This removes the need for the visited node list.
@maartenbreddels - can you try this out? I think it should work, even with Controller... |
With the toJSON the Controller should work fine, happy to see it get merged! Great work. |
You did the great work here! Thanks! |
Thanks for the careful and thorough review 👍 |
…ay, or python 3 bytes. We disregard the python 2 buffer type - it is obsolete, even on python 2. See jupyter-widgets#1194
Before you could only do sth like this when serializing a state:
However, you'd like to be able to send shape along. This patch makes this possible, e.g.:
To be able to do this, state is seperated in
state_without_buffers
(state on the js side), and astate_with_bufffers
, where every dict items that is a buffer is pop'ed, and every list item replaced by None. Ifstate_with_bufffers
was not treated seperately, the state would first be deserialized completely, and later on the buffers were 'patched in. By keeping the
state_with_bufffers` seperate, is will first put in all the buffers in place, and deserialize then afterwards.Maybe it's not the cleanest solution, but my understanding of the code only goes this far.
This does not treat serialization (frontend -> backend), before continuing, I'd like to get 'approval' of this approach.