Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate plotly JSON encoder for numpy arrays without nans #2880

Merged
merged 10 commits into from
Nov 17, 2020

Conversation

emmanuelle
Copy link
Contributor

@emmanuelle emmanuelle commented Nov 6, 2020

This is a proof of concept PR, I need to add a bunch of tests to be sure it behaves well for numpy arrays of different types, but the idea is to remove the encoding-decoding-reencoding process when JSON-serializing a numpy array without nans or infs (which is the reason why there is this 3-step process).

For large data arrays (for example a 1000x1000 array) there is a large performance gain. For example for

from time import time
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.utils import PlotlyJSONEncoder
import json
img = np.random.randint(255, size=(1000, 1000)).astype(np.uint8)
fig = go.Figure(go.Image(z=img))
t1 = time()
_ = json.dumps(fig, cls=PlotlyJSONEncoder)
t2 = time()
print(t2 - t1)

the proposed change results in a 2.5x performance gain on my machine (0.11 s instead of 0.27). The performance gain is interesting in Dash apps in particular.

@jonmmease what do you think of this?

@jonmmease
Copy link
Contributor

Is the basic idea here that you're skipping the recursive encoding of the elements of the array? If so, that makes a lot of sense.

@emmanuelle
Copy link
Contributor Author

I understand (but maybe this is not correct) that there is an additional cycle of decoding-reencoding here in order to take care of nans and infs. If this is the case, we can check if there are no nans and infs and skip these steps (the "recursive encoding") I think.

But maybe I don't understand completely the purpose of the recursive encoding. The CI failure of one of the builds is definitely linked to this PR, I need to investigate.

@jonmmease
Copy link
Contributor

ohh, wow, I didn't realize (remember?) that we did that double encoding. That would definitely slow things down. It would be great to get rid of that all together, but it would take some investigation to work through why it's there in the first place.

What I meant by "recursive" is that JSON serialization handles lists of objects (like fig["data"]) by recursively encoding each element of the list. I'm not certain, but I think this recursive encoding happens for each element of numpy arrays as well, even though it wouldn't be strictly necessary.

@nicolaskruchten
Copy link
Contributor

It'd be nice to get rid of the double-encoding in general but in the meantime, if there are special cases where we feel safe to skip it e.g. arrays with every entry a finite value, then that would be a short-term win also for many cases :)

@nicolaskruchten
Copy link
Contributor

I'll note that by playing string games with numpy.savetxt I was able to get around a 10% speedup, but it was pretty gross :)

obj.dtype, numpy.number
):
if numpy.all(numpy.isfinite(obj)):
self.unsafe = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I am missing something. We can test all the numpy arrays that are in the dict being encoded, but we'd not catch nans and infs in any sublists, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this will work only for numpy arrays but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed. Or at least this is the intended behaviour :-), maybe I am the one missing something! I need to write tests anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed

I don't think this is the case, since as I understand this class, encode() is not called recursively, or is it?

@emmanuelle
Copy link
Contributor Author

@jonmmease I wonder if you could help me understanding the py3.7-orca CI failure, which I cannot reproduce locally in a conda environment with (I think) the same packages installed and py3.7. The problem occurs in the doctest of the create_dendogram figure factory, there is both an ImportError and a ValueError. Does it make sense to you?

@almarklein
Copy link

almarklein commented Nov 9, 2020

FWIW - just in case anyone was looking to go down this route: Iast Friday I did a brief stab at making encoding nan and inf as null in the first place (instead of replacing such values after the fact by the extra decode-encode pass).

I figured we could overload the JSONEncoder.iterencode method. It defines a function floatstr(), which seems exactly the thing we'd want to change. Except ... it's only used when the Python-based low-level encoder is used. It looks like the c-version brings it's own version of floatstr() :(

added: I did a quick test to force using the Python (non-C) version of the builtin encoder. But that makes it twice as slow.

@pfbuxton
Copy link

pfbuxton commented Nov 13, 2020

Hi @emmanuelle,

A while ago I profiled plotly/dash and found the numpy to JSON encoding to be one of the bottlenecks and also found that by not doing the encoding twice the code can be sped up by approximately a factor 2:
#1842 (comment)

The way I implemented it I was able to handle numpy arrays with both +/- inf's and nan's without the need to re-encode. But some care is needed as numpy has quite a lot of different variable types.

Hope this helps.

@emmanuelle
Copy link
Contributor Author

@almarklein you were right of course I had not understood in which order the different methods were called.

My workaround with np.isfinite did not work since some arguments might be something like x=[1, float("nan"), "platypus"] (this is actually a variable used in the tests!) and it's not possible to call np.isfinite on this kind of beast.

Therefore I resorted to a brute-force approach of testing whether the json string encoded the first time has the substrings "NaN" or "Infinity" and if not, skip the additional decoding and re-encoding step. For large data arrays the cost of checking these two patterns if small compared to the encoding time (~2% of the time on a couple of examples I tested). This is a slight cost burden to be added to figures with NaN or Infinity either in their data or in text strings (for example titles, labels etc.) but I think we can live with this, given the large speed-up (x2.5) for non-problematic cases.

@emmanuelle emmanuelle changed the title [WIP] accelerate plotly JSON encoder for numpy arrays without nans accelerate plotly JSON encoder for numpy arrays without nans Nov 15, 2020
Copy link
Contributor

@jonmmease jonmmease left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but otherwise lgtm

# this will raise errors in a normal-expected way
self.hasinfnans = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using hasinfnans anywhere else now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you were right :-). Thanks!

# but this is ok since the intention is to skip the decoding / reencoding
# step when it's completely safe
if not ("Infinity" in encoded_o or "NaN" in encoded_o):
return encoded_o
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this for now given your description of the performance characteristics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check for NaN first, as it feels like that'd appear more frequently?

@almarklein
Copy link

almarklein commented Nov 16, 2020

I like this solution! I find it actually quite elegant given the limitations of the json encoder that we have to work with, and I'm pretty sure that any solution that somehow checks the input beforehand would be slower than this proposed approach.

The only downside is that the speed of encoding depends on the presence of NaN and Inf in the object, which might be an unexpected effect. But these cases are hopefully rare, e.g. the plotly example about gaps in line charts uses None, which becomes null in json.

@nicolaskruchten
Copy link
Contributor

💃

@nicolaskruchten
Copy link
Contributor

@pfbuxton looks like you went down this path before, I'm sorry no one replied to you more than a year ago :( Thanks for bringing up your idea again in this context!

I think that your idea would layer well onto this one, i.e. now that we only re-encode when we need to, modifying encode_as_list in the way you suggest would safely exploit this fast path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants