accelerate plotly JSON encoder for numpy arrays without nans #2880

emmanuelle · 2020-11-06T20:30:41Z

This is a proof of concept PR, I need to add a bunch of tests to be sure it behaves well for numpy arrays of different types, but the idea is to remove the encoding-decoding-reencoding process when JSON-serializing a numpy array without nans or infs (which is the reason why there is this 3-step process).

For large data arrays (for example a 1000x1000 array) there is a large performance gain. For example for

from time import time
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.utils import PlotlyJSONEncoder
import json
img = np.random.randint(255, size=(1000, 1000)).astype(np.uint8)
fig = go.Figure(go.Image(z=img))
t1 = time()
_ = json.dumps(fig, cls=PlotlyJSONEncoder)
t2 = time()
print(t2 - t1)

the proposed change results in a 2.5x performance gain on my machine (0.11 s instead of 0.27). The performance gain is interesting in Dash apps in particular.

@jonmmease what do you think of this?

jonmmease · 2020-11-06T20:34:40Z

Is the basic idea here that you're skipping the recursive encoding of the elements of the array? If so, that makes a lot of sense.

emmanuelle · 2020-11-06T20:50:09Z

I understand (but maybe this is not correct) that there is an additional cycle of decoding-reencoding here in order to take care of nans and infs. If this is the case, we can check if there are no nans and infs and skip these steps (the "recursive encoding") I think.

But maybe I don't understand completely the purpose of the recursive encoding. The CI failure of one of the builds is definitely linked to this PR, I need to investigate.

jonmmease · 2020-11-06T21:04:25Z

ohh, wow, I didn't realize (remember?) that we did that double encoding. That would definitely slow things down. It would be great to get rid of that all together, but it would take some investigation to work through why it's there in the first place.

What I meant by "recursive" is that JSON serialization handles lists of objects (like fig["data"]) by recursively encoding each element of the list. I'm not certain, but I think this recursive encoding happens for each element of numpy arrays as well, even though it wouldn't be strictly necessary.

nicolaskruchten · 2020-11-06T21:11:41Z

It'd be nice to get rid of the double-encoding in general but in the meantime, if there are special cases where we feel safe to skip it e.g. arrays with every entry a finite value, then that would be a short-term win also for many cases :)

nicolaskruchten · 2020-11-06T21:14:24Z

I'll note that by playing string games with numpy.savetxt I was able to get around a 10% speedup, but it was pretty gross :)

almarklein · 2020-11-09T12:40:51Z

packages/python/plotly/_plotly_utils/utils.py

+                obj.dtype, numpy.number
+            ):
+                if numpy.all(numpy.isfinite(obj)):
+                    self.unsafe = True


Not sure if I am missing something. We can test all the numpy arrays that are in the dict being encoded, but we'd not catch nans and infs in any sublists, right?

Yes this will work only for numpy arrays but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed. Or at least this is the intended behaviour :-), maybe I am the one missing something! I need to write tests anyway.

but for other kinds of objects the existing process (encoding - decoding - reencoding) will still be executed

I don't think this is the case, since as I understand this class, encode() is not called recursively, or is it?

emmanuelle · 2020-11-09T14:29:51Z

@jonmmease I wonder if you could help me understanding the py3.7-orca CI failure, which I cannot reproduce locally in a conda environment with (I think) the same packages installed and py3.7. The problem occurs in the doctest of the create_dendogram figure factory, there is both an ImportError and a ValueError. Does it make sense to you?

almarklein · 2020-11-09T14:58:20Z

FWIW - just in case anyone was looking to go down this route: Iast Friday I did a brief stab at making encoding nan and inf as null in the first place (instead of replacing such values after the fact by the extra decode-encode pass).

I figured we could overload the JSONEncoder.iterencode method. It defines a function floatstr(), which seems exactly the thing we'd want to change. Except ... it's only used when the Python-based low-level encoder is used. It looks like the c-version brings it's own version of floatstr() :(

added: I did a quick test to force using the Python (non-C) version of the builtin encoder. But that makes it twice as slow.

pfbuxton · 2020-11-13T16:28:34Z

Hi @emmanuelle,

A while ago I profiled plotly/dash and found the numpy to JSON encoding to be one of the bottlenecks and also found that by not doing the encoding twice the code can be sped up by approximately a factor 2:
#1842 (comment)

The way I implemented it I was able to handle numpy arrays with both +/- inf's and nan's without the need to re-encode. But some care is needed as numpy has quite a lot of different variable types.

Hope this helps.

…on't work

emmanuelle · 2020-11-13T22:28:13Z

@almarklein you were right of course I had not understood in which order the different methods were called.

My workaround with np.isfinite did not work since some arguments might be something like x=[1, float("nan"), "platypus"] (this is actually a variable used in the tests!) and it's not possible to call np.isfinite on this kind of beast.

Therefore I resorted to a brute-force approach of testing whether the json string encoded the first time has the substrings "NaN" or "Infinity" and if not, skip the additional decoding and re-encoding step. For large data arrays the cost of checking these two patterns if small compared to the encoding time (~2% of the time on a couple of examples I tested). This is a slight cost burden to be added to figures with NaN or Infinity either in their data or in text strings (for example titles, labels etc.) but I think we can live with this, given the large speed-up (x2.5) for non-problematic cases.

jonmmease

One question, but otherwise lgtm

jonmmease · 2020-11-16T10:36:43Z

packages/python/plotly/_plotly_utils/utils.py

        # this will raise errors in a normal-expected way
+        self.hasinfnans = False


Are you using hasinfnans anywhere else now?

no you were right :-). Thanks!

jonmmease · 2020-11-16T10:37:50Z

packages/python/plotly/_plotly_utils/utils.py

+        # but this is ok since the intention is to skip the decoding / reencoding
+        # step when it's completely safe
+        if not ("Infinity" in encoded_o or "NaN" in encoded_o):
+            return encoded_o


I'm fine with this for now given your description of the performance characteristics.

maybe check for NaN first, as it feels like that'd appear more frequently?

almarklein · 2020-11-16T13:20:04Z

I like this solution! I find it actually quite elegant given the limitations of the json encoder that we have to work with, and I'm pretty sure that any solution that somehow checks the input beforehand would be slower than this proposed approach.

The only downside is that the speed of encoding depends on the presence of NaN and Inf in the object, which might be an unexpected effect. But these cases are hopefully rare, e.g. the plotly example about gaps in line charts uses None, which becomes null in json.

nicolaskruchten · 2020-11-16T14:43:09Z

💃

nicolaskruchten · 2020-11-16T14:46:00Z

@pfbuxton looks like you went down this path before, I'm sorry no one replied to you more than a year ago :( Thanks for bringing up your idea again in this context!

I think that your idea would layer well onto this one, i.e. now that we only re-encode when we need to, modifying encode_as_list in the way you suggest would safely exploit this fast path.

emmanuelle added 2 commits November 6, 2020 19:31

accelerate plotly JSON encoder for numpy arrays without nans

fcf2a5b

limit change to numerical data types

0fa810f

almarklein reviewed Nov 9, 2020

View reviewed changes

emmanuelle added 3 commits November 13, 2020 22:44

solution that works for 99% of cases but lists with composite types w…

561edf1

…on't work

other method: brute force string matching of Infinity or NaN

8392ead

black

627c00f

added test

c8e2922

emmanuelle changed the title ~~[WIP] accelerate plotly JSON encoder for numpy arrays without nans~~ accelerate plotly JSON encoder for numpy arrays without nans Nov 15, 2020

black

224082e

emmanuelle requested review from jonmmease and nicolaskruchten November 15, 2020 21:12

jonmmease approved these changes Nov 16, 2020

View reviewed changes

emmanuelle added 2 commits November 16, 2020 13:25

removed ununsed variable

5b18b2d

changelog entry

e528271

NaN tested before Infinity

c46bae0

emmanuelle merged commit fa9500b into master Nov 17, 2020

jonmmease mentioned this pull request Nov 29, 2020

ENH: Optional dependencies for accelerating JSON serialization #2944

Closed

emmanuelle deleted the accelerate-encoder branch December 1, 2020 13:45

jonmmease mentioned this pull request Dec 5, 2020

JSON encoding refactor and orjson encoding #2955

Merged

1 task

nicolaskruchten mentioned this pull request Jun 8, 2021

PlotlyJSONEncoder always casts values to float64 due to using tolist() #3232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate plotly JSON encoder for numpy arrays without nans #2880

accelerate plotly JSON encoder for numpy arrays without nans #2880

emmanuelle commented Nov 6, 2020 •

edited

Loading

jonmmease commented Nov 6, 2020

emmanuelle commented Nov 6, 2020

jonmmease commented Nov 6, 2020

nicolaskruchten commented Nov 6, 2020

nicolaskruchten commented Nov 6, 2020

almarklein Nov 9, 2020

emmanuelle Nov 9, 2020

almarklein Nov 9, 2020

emmanuelle commented Nov 9, 2020

almarklein commented Nov 9, 2020 •

edited

Loading

pfbuxton commented Nov 13, 2020 •

edited

Loading

emmanuelle commented Nov 13, 2020

jonmmease left a comment

jonmmease Nov 16, 2020

emmanuelle Nov 16, 2020

jonmmease Nov 16, 2020

nicolaskruchten Nov 16, 2020

almarklein commented Nov 16, 2020 •

edited

Loading

nicolaskruchten commented Nov 16, 2020

nicolaskruchten commented Nov 16, 2020

		# this will raise errors in a normal-expected way
		self.hasinfnans = False

accelerate plotly JSON encoder for numpy arrays without nans #2880

accelerate plotly JSON encoder for numpy arrays without nans #2880

Conversation

emmanuelle commented Nov 6, 2020 • edited Loading

jonmmease commented Nov 6, 2020

emmanuelle commented Nov 6, 2020

jonmmease commented Nov 6, 2020

nicolaskruchten commented Nov 6, 2020

nicolaskruchten commented Nov 6, 2020

almarklein Nov 9, 2020

Choose a reason for hiding this comment

emmanuelle Nov 9, 2020

Choose a reason for hiding this comment

almarklein Nov 9, 2020

Choose a reason for hiding this comment

emmanuelle commented Nov 9, 2020

almarklein commented Nov 9, 2020 • edited Loading

pfbuxton commented Nov 13, 2020 • edited Loading

emmanuelle commented Nov 13, 2020

jonmmease left a comment

Choose a reason for hiding this comment

jonmmease Nov 16, 2020

Choose a reason for hiding this comment

emmanuelle Nov 16, 2020

Choose a reason for hiding this comment

jonmmease Nov 16, 2020

Choose a reason for hiding this comment

nicolaskruchten Nov 16, 2020

Choose a reason for hiding this comment

almarklein commented Nov 16, 2020 • edited Loading

nicolaskruchten commented Nov 16, 2020

nicolaskruchten commented Nov 16, 2020

emmanuelle commented Nov 6, 2020 •

edited

Loading

almarklein commented Nov 9, 2020 •

edited

Loading

pfbuxton commented Nov 13, 2020 •

edited

Loading

almarklein commented Nov 16, 2020 •

edited

Loading