Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any descendant of the state can be a binary_type #1194

Merged
merged 21 commits into from
Mar 21, 2017

Conversation

maartenbreddels
Copy link
Member

Before you could only do sth like this when serializing a state:

def array_to_binary(ar, obj=None):
    ar = ar.astype(np.float64)
    return memoryview(ar)

However, you'd like to be able to send shape along. This patch makes this possible, e.g.:

def array_to_binary(ar, obj=None):
	ar = ar.astype(np.float64)
	return {'data': memoryview(ar), 'shape': ar.shape}

To be able to do this, state is seperated in state_without_buffers (state on the js side), and a state_with_bufffers, where every dict items that is a buffer is pop'ed, and every list item replaced by None. If state_with_bufffers was not treated seperately, the state would first be deserialized completely, and later on the buffers were 'patched in. By keeping the state_with_bufffers` seperate, is will first put in all the buffers in place, and deserialize then afterwards.
Maybe it's not the cleanest solution, but my understanding of the code only goes this far.
This does not treat serialization (frontend -> backend), before continuing, I'd like to get 'approval' of this approach.

…ializing numpy arrays, in addition to shape, dtype etc
@maartenbreddels
Copy link
Member Author

Oh, and respect for the ones responsible for thinking out the method on doing binary transfers, this is truly a good idea!

@jasongrout
Copy link
Member

Why not just save the entire array+size+dtype into a binary format? For example, we could implement this: https://docs.scipy.org/doc/numpy/neps/npy-format.html#format-specification-version-1-0

@maartenbreddels
Copy link
Member Author

That would be an option, but I also need serializing a dict of numpy arrays in the future, so even if you could solve some options with what you propose, this is more flexible. And I guess that requires a memcpy (although it's probably in the order of megabytes, so it shouldn't be an issue).

@SylvainCorlay
Copy link
Member

Regardless of the method used for the case of numpy arrays, I think that allowing binary fields will be necessary for other use cases.

@maartenbreddels
Copy link
Member Author

Yes, also other metadata, units, descriptions etc, you don't want to put it in binary form on each side of the wire, this would make life much easier.

@jasongrout
Copy link
Member

I hesitate to start doing things that involve picking apart objects and replacing their innards. Are we replacing things arbitrarily deep in the nesting here?

@SylvainCorlay
Copy link
Member

We can probably address all possible use cases by allowing only top-level binary keys.

@maartenbreddels
Copy link
Member Author

Yes, any level deep, as far as lists and dicts go. We could not allow that, but that doesn't change the code at all (except for the recursive part). It is not any different than _widget_to_json.

@maartenbreddels
Copy link
Member Author

It will actually require hardcoded checks, so as it is now, it's the easiest to maintain I think.

@maartenbreddels
Copy link
Member Author

I think something that makes this complex (and it was already the case before this PR), is that first an object get created/deserialized here: https://github.com/ipython/ipywidgets/blob/master/jupyter-js-widgets/src/manager-base.ts#L346
And then afterwards, the buffer items gets pushed in here: https://github.com/ipython/ipywidgets/blob/master/jupyter-js-widgets/src/widget.ts#L175
and the rest gets deserialized.
Maybe if this can be done differently, like the deserialize part not being called at two places, the state wouldn't required to be split in two parts. Then buffer items could be popped, and list items set to null, and later filled in.

Because now, if say my numpy array 'x' (with the missing buffer value on the js side), get deserialized, and then later the buffer gets filled in, and then the deserialize isn't getting called anymore.

@jasongrout
Copy link
Member

We can probably address all possible use cases by allowing only top-level binary keys.

This is essentially what we have now, one level up at the object's state keys. Some keys can be binary, some can be JSON.

What would we need to isolate this to a custom serializer (like the widgets serializer)? For example, what if we allowed multiple buffers to be associated with one key (to have possible zero-copy semantics).

(I'm not totally against this idea, but it certainly makes the serialization more complicated, so I want to explore some simpler ideas, or ways to move the complexity out of the base codebase)

@maartenbreddels
Copy link
Member Author

Actually, that is also what @SylvainCorlay thought, but it is more like 0 levels deep.

state is a dict with all values, lets say we have a widget with a property 'x', then (igoring all the other meta data), state would be state = {'x': memoryview(ar)}, and that is allowed and get properly treated in _split_state_buffers and on the js side.
However, we'd like to allow at least state = {'x': {'data': memoryview(ar), 'shape': ar.shape, ....}}, so it's an extra level deeper we need to go. And well, 1 level or N levels, it's all the same code.

@maartenbreddels
Copy link
Member Author

I agree that it makes it more complex, but I think the complexity start at the two places where the deserialization is happening. If, say at object construction we would already have the buffers available, that could make this simpler, we just pop buffers out of the dicts, and replace lists values with sentinels, and put them back in place directly before the object is constructed. Now the object is first constructed without anything that has a buffer in it missing (before it was only level 0 keys missing that had buffer object), and then an update is adding in the missing buffer objects.
Ah.... now that I am writing this, I understand where the issue is:
Widget.open first constructs the object without buffers, and then sends the state again. So my guess is, when this issue is fixed, it can at least be a bit simpler (as in not having state and state_with_buffers, but just 1 state)

@SylvainCorlay
Copy link
Member

@jasongrout the current mechanism allows top-level properties of the object to by memory views, not top-level properties of a given trait attribute.

@jasongrout
Copy link
Member

@jasongrout the current mechanism allows top-level properties of the object to by memory views, not top-level properties of a given trait attribute.

Right - any particular trait (i.e., a key/value in the object's state) can be binary.

@jasongrout
Copy link
Member

However, we'd like to allow at least state = {'x': {'data': memoryview(ar), 'shape': ar.shape, ....}}, so it's an extra level deeper we need to go. And well, 1 level or N levels, it's all the same code.

Yes, I understand what we're trying to do. Another way to think about this proposal is that we are simply extending our base JSON encoder/decoder to handle binary objects anywhere in it. (The popping of binary buffers happens after the object's custom serializer has run, right?)

What if we allowed any particular trait to have arbitrary JSON metadata associated with it? We sort of have that now - if a trait is binary, some metadata is associated with it indicating which buffer the data is in. What if the serializer could return both a value and a metadata dict? And the deserializer could take both a data value and a metadata dict.

Also, the difference between what we have now (any trait can be encoded in binary) and a level-1 binary replacer is that the serialization/deserialization happens at the trait level, so with the level-1 binary replacer, a single trait can (very easily) be encoded as both binary data and JSON data, whereas with what we have now, there needs to be some extra magic going on to associate the JSON data with the binary data in two different traits.

@maartenbreddels
Copy link
Member Author

I like the idea, but I can think of an example which doesn't really fit into this idea. Say a masked array, it would have two binary blobs associated with it, and I wouldn't call the array with the mask the meta data. It would make more sense to write it like this.

   ...
   return {'data': memoryview(data), 'mask': memoryview(mask), 'shape' data.shape}

@jasongrout
Copy link
Member

Yes, I'm agreeing with this PR's idea more and more, especially as I think about it as being just a way of augmenting our JSON format to take advantage of the binary buffers we have available.

@maartenbreddels
Copy link
Member Author

Ok, good to know. The split state and state_with_buffers can be avoided if the issue in Widget.open is fixed, but it's not a lot of code. I'll comment the code on what can be changed if this issue is fixed. So I guess the next step would be to implement serialize (from js to py)?

@jasongrout
Copy link
Member

Can you show an example of what a deeper nesting path would look like? a state dictionary of {'x': {'arr': [ar, ar]}} would give [x, 'arr', 0], [x, 'arr', 1] as paths?

Implementing the serialization/deserialization in js definitely should be done too.

Thanks for tackling this! It's been a frustration we've had for a while (particularly with numpy arrays).

@maartenbreddels
Copy link
Member Author

Ok, implemented, demo notebook here: https://gist.github.com/maartenbreddels/ac41cdcd22c8f84c9fc639dea7c90848
it's a model that takes x, and doubles every items, put it to y. As an extra check i also included a 'nested' property:

{'data': mv, 'shape': ar.shape, 'nested': {'list': [mv, 'test']}}

that gets passed along from x to y, just to see demonstate it handles nested items ok, in the last cell you see it is property serialize and deserialized.

@maartenbreddels
Copy link
Member Author

I've answered your question in the docstring of _split_state_buffers. Futhermore, (in a previous commit) I've renamed buffer_keys by buffer_paths, since it better covers its usage.
Comments on the js/ts are welcome, since I'm not so comforable with it as I am with Python.

I also tried to keep the code simple (like a for loop in _handle_msg, instead of using reduce), to keep the code more readable for those who are not used to functional parts, but I don't know what you guidelines are on this.

@maartenbreddels
Copy link
Member Author

FYI: Found some bugs in this PR, will commit new changes later.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Mar 15, 2017

@maartenbreddels are you still making changes to this PR?

I did a rough review and this looks good. What issues did you encounter?

…ut_buffers for py/ts codes are equivalent, unittest for py side covers more cases also putting it back together, bugfix (covered by python test) for _remove_buffers, typo corrections
@maartenbreddels
Copy link
Member Author

Ok, it's more consistent now. Both python and javascript sides have a (_)remove_buffers/(_)put_buffers. remove buffers doesn't modify existing dicts/objects (clones when needed), but references non-modifies ancestors. The js side has cyclic protection (needed to support Controller), the py side does not. _put_buffers does modify the state, but that should be fine since it coming off the wire.
I must say, the python unittest gives me a high level of trust in the code now, the fact that the ts/js code is practically the same makes me feel comfortable with it as well.

@jasongrout
Copy link
Member

I think I figured out why you needed the circular dependency check on the js side. Basically, we use Javascript's toJSON feature (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify#toJSON()_behavior) to serialize models, so the models are not serialized to "IPY_MODEL_..." strings until after the remove_buffers runs, so remove_buffers recurses into these complicated model objects. Looking into it...

@maartenbreddels
Copy link
Member Author

Yes, I was surprised to see unpack_models, but not the inverse. But I think if the toJSON can handle it, we should be able to handle it as well.

Copy link
Member

@jasongrout jasongrout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a few commits with these changes...

self.assertIn('shape', state['y'])
self.assertNotIn('ar', state['x'])
self.assertFalse(state['x']) # should be empty
self.assertNotIn('data', state['x'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state['y']

self.assertIsNot(state['y'], y)
self.assertIs(state['y']['shape'], y_shape)

# check that the buffers paths really points to the right buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point

self.assertIn('plain', state)
self.assertIn('shape', state['y'])
self.assertNotIn('ar', state['x'])
self.assertFalse(state['x']) # should be empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertEqual(state['x'], {})

obj = obj[key]
obj[buffer_path[-1]] = buffer

def _seperate_buffers(substate, path, buffer_paths, buffers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate

buffers.append(v)
buffer_paths.append(path + [i])
elif isinstance(v, (dict, list, tuple)):
vnew = _seperate_buffers(v, path + [i], buffer_paths, buffers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate

buffers.append(v)
buffer_paths.append(path + [k])
elif isinstance(v, (dict, list, tuple)):
vnew = _seperate_buffers(v, path + [k], buffer_paths, buffers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate

[<memory at 0x107ffec48>, <memory at 0x107ffed08>])
"""
buffer_paths, buffers = [], []
state = _seperate_buffers(state, [], buffer_paths, buffers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate


args = dict(target_name='jupyter.widget', data=state)
args = dict(target_name='jupyter.widget', data={'state': state, 'buffer_paths': buffer_paths})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict(..., buffers=buffers) (and delete the assignment below) to keep the buffer_paths with the buffers...

else {
var new_value = remove(value, path.concat([key]));
// only assigned when the value changes, we may serialize objects that don't support assignment
if(new_value != value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!==

else {
var new_value = remove(value, path.concat([i]));
// only assigned when the value changes, we may serialize objects that don't support assignment
if(new_value != value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!==

@maartenbreddels
Copy link
Member Author

I agree, good you spotted those bugs in the unittest.

@jasongrout
Copy link
Member

@maartenbreddels - can you try this out? I think it should work, even with Controller...

@maartenbreddels
Copy link
Member Author

With the toJSON the Controller should work fine, happy to see it get merged! Great work.

@jasongrout
Copy link
Member

You did the great work here! Thanks!

@jasongrout jasongrout merged commit 17a3e45 into jupyter-widgets:master Mar 21, 2017
@maartenbreddels
Copy link
Member Author

Thanks for the careful and thorough review 👍

@jasongrout jasongrout modified the milestone: 7.0 Apr 18, 2017
@jasongrout jasongrout mentioned this pull request Apr 18, 2017
jasongrout added a commit to vidartf/ipywidgets that referenced this pull request Aug 9, 2017
…ay, or python 3 bytes.

We disregard the python 2 buffer type - it is obsolete, even on python 2.

See jupyter-widgets#1194
@github-actions github-actions bot added the resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label Feb 15, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants