JSON encoding refactor and orjson encoding #2955

jonmmease · 2020-12-05T20:24:20Z

Overview

Initial implementation of the idea from #2944 of refactoring the JSON encoding pipeline and optionally performing JSON encoding with orjson.

orjson is impressively fast, and it includes built-in support for numpy arrays which is many times faster than the current approach of converting them to lists before encoding.

Also, orjson automatically converts all non-finite values to JSON null values, so we don't need workarounds like re-encoding as discussed in #2880.

JSON config object

To configure the JSON encoding engine, this PR adds a plotly.io.json.config object that mirrors plotly.io.orca.config and plotly.io.kaleido.config. Currently the only option is default_engine which can be set to "json" for the current encoder based on PlotlyJSONEncoder or "orjson" which is pretty much always much faster.

The to_json/write_json also provide an engine argument to override the default.

To try it out, install orjson with pip

$ pip install orjson

or conda

$ conda install -c conda-forge orjson

Then configure plotly to use it with

import plotly.io as pio
import numpy as np
import plotly.graph_objects as go

Quick timing example

Then time the encoding speed

N = 1000000
dtype = "float32"
x = np.random.randn(N).astype(dtype)
y = np.random.randn(N).astype(dtype)
size = np.random.rand(N).astype(dtype) * 10
opacity = np.random.rand(N).astype(dtype)
fig = go.Figure(data=[go.Scatter(x=x, y=y, marker_size=size, marker_opacity=opacity)])

%%timeit
res1 = pio.to_json(fig, engine="legacy")

2.06 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
res1 = pio.to_json(fig, engine="orjson")

169 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In this large figure case of encoding a figure with four one-million element arrays, the orjson encoding is 12x faster on my machine!

Relationship to base64 encoding

This approach is fully compatible with the base64 encoding work in #2943. I think we should focus on this approach first because there are substantial performance gains to be had without changing the schema of the resulting JSON and (hopefully) without requireing changes in Plotly.js

After this is merged, we can add base64 encoding on top of it for additional performance improvements.

Correctness testing

In addition to adding a new test suite, this PR has been tested against all of the documentation figures using the slightly modified instrumentation branch at #3012. This branch executes all json encoding requests using both the json and orjson encoders, and checks that the encoded string are identical.

There are two documentation examples that fail this test: imshow and ml-tsne-umap-projections. Both of these fail due to difference in how the orjson encoding handles floating-point numbers with precision less than 64 bits (see next section)

Numpy floating point preceision

The "json" encoder handles numpy arrays by first converting them to lists, and then encoding the lists. All numpy floating point types are converted to 64-bit python float values. For floating point numpy arrays with less than 64 bit precision, converting them to 64 bits before encoding artificially increases the precision of the values in the array, but this is really the only option.

The orjson encoder accepts numpy arrays directly, and it will output the appropriate amount of decimal places for precision of the input array.

So the encoded JSON values between the legacy json encoder and the orjson encoder will not agree when encoding floating point numpy arrays with less than 64-bit precision. This is the only known discrepancy.

Benchmarking

To compare the performance of the orjson encoder against the legacy json, #3012 records the encoding time for both encoders and writes the results to a file. Here are plots of the relative timing results across all of the figures in the plotly.py documentation

Note that ``length` here is the number of characters is the encoded JSON string

the orjson encoder is almost always faster (up to 40x in one case). The handful of cases that have equivalent or slower performance are cases that include values that are not natively supported by orjson (e.g. pd.Timestamp or PIL.Image objects), and that don't have sizable numpy arrays.

All of the cases where it's slower run in less than half a millisecond.

My conclusion is that defaulting to orjson when the package is installed is a safe default that will almost always improve performance.

TODO

The current tests are pretty much passing with the new encoder, but before merging I want to add some more tests specifically around JSON encoding, especially with dates and datetimes. And make sure that all of the encoding engines are tested.

Later we can use this to configure base64 encoding

nicolaskruchten · 2020-12-07T14:00:02Z

Cool :)

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

chriddyp · 2020-12-07T16:36:01Z

Looking at https://github.com/ijl/orjson#why-cant-i-install-it-from-pypi, we should verify that we can install this on <=4.0.1 versions of DE with older pip

jonmmease · 2020-12-07T20:46:52Z

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

I haven't looked into it deeply, but it has native components and isn't available in the main anaconda channel yet (it is on conda-forge though). If we did make it a hard dependency for 5.0, then it probably would get added to the main anaconda channel as well 🙂

So, adding it as a hard dependency carries the risk of breakage for some folks.

we should verify that we can install this on <=4.0.1 versions of DE with older pip

Yeah, it'll have the same problem we've hit elsewhere with older versions of pip.

…ested

… import

jonmmease · 2020-12-31T17:49:49Z

Tests written and ready for review

- pio.json.to_plotly_json -> pio.json.to_json_plotly - pio.json.from_plotly_json -> pio.json.from_json_plotly

nicolaskruchten · 2021-01-14T02:08:53Z

I'm not sure I understand why this change and what the implications are...?

Renamed pio.json.to_plotly_json to pio.json.to_json_plotly because we, unfortunately, use fig.to_plotly_json as a method that converts a graph_object (or Dash component) to a dictionary.

Is this to avoid the legacy encoder from calling the new encoder while recursing through a structure that contains a figure?

jonmmease · 2021-01-14T12:15:12Z

Renamed pio.json.to_plotly_json to pio.json.to_json_plotly because we, unfortunately, use fig.to_plotly_json as a method that converts a graph_object (or Dash component) to a dictionary.

This naming change was all internal to this PR. Before this PR, there was only to_json which both did figure validation and performed JSON encoding.

I extracted the JSON encoding into to_json_plotly, which is now called by to_json. to_json_plotly doesn't perform figure validation, and is what Dash will use for JSON encoding.

jonmmease · 2021-01-21T20:24:55Z

I think the plotlyjs_dev_build failure is due to the role removal on plotly.js master

since it's sometimes slower, and never faster, than current encoder. Rename "legacy" to "json".

jonmmease · 2021-01-22T18:41:17Z

Up updated the overview comment with a description of the correctness testing, and the benchmarking results, obtained using the #3012 branch on the plotly.py documentation examples.

jmsmdy · 2021-03-16T15:26:58Z

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

Just wanted to add my opinion to not to make orjson a hard dependency. orjson is written in Rust (apparently depending on some recent version to build), which is a barrier for running plotly.py on platforms that Rust has trouble compiling to.

There were recently issues with orjson on the new apple silicon (see: Apple Silicon Binaries ijl/orjson#155). These have been resolved, but these issues would have held up people trying to use plotly on the new macs had orjson been a hard dependency.
orjson does not compile to WebAssembly. After the "retrying" dependency is replaced by "tenacity" (when this pull request is merged: Replaced 'retrying' dependency with 'tenacity' in plotly package #2911), all (hard) dependencies of plotly.py will be available as universal wheels on PyPi, which enables plotly.py to run on Pyodide (a version of CPython compiled to run in WebAssembly). This would break if orjson were made a hard dependency.

This is not an objection to this pull request. In fact, the preferred solution is to get orjson working in Pyodide (since there is already a need for fast serialization to communicate between JS and Python).

jonmmease · 2021-03-16T19:10:03Z

Thanks for the feedback and perspective @jmsmdy. That all makes a lot of sense. This PR did end up making orjson optional, and that will be the case going forward.

nicolaskruchten · 2021-04-22T00:04:17Z

Just to refresh my memory: with this PR in non-orjson mode, do we still do the trick from #2880 that got us a nice performance boost in some cases?

nicolaskruchten · 2021-04-22T00:06:01Z

Also note to self to re-think-about the comment in #2880 (comment)

# Conflicts: # packages/python/plotly/plotly/io/_json.py # packages/python/plotly/tox.ini

mherrmann3 · 2021-06-08T10:31:09Z

Regarding the numpy floating point precision and that PlotlyJSONEncoder always casts those to float64 due to using tolist()...

This had always bugged me, as it resulted in much larger exports (i.e. html / ipynb file sizes) than necessary (when float16 or float32 is sufficient) and affected not only coordinate data, but also marker sizes, meta info, etc.

Just in case the plotly.py devs or others are interested: I had found a way to avoid this number inflation by modifying (& monkey patching) the encode_as_list method:

@staticmethod
def encode_as_list_patch(obj):
    """Attempt to use `tolist` method to convert to normal Python list."""
    if hasattr(obj, "tolist"):

        numpy = get_module("numpy")
        try:
            if isinstance(obj, numpy.ndarray) \
               and obj.dtype == numpy.float32 or obj.dtype == numpy.float16 \
               and obj.flags.contiguous:
                return [float('%s' % x) for x in obj]
        except AttributeError:
            raise NotEncodable

        return obj.tolist()
    else:
        raise NotEncodable

It's about 30-50x slower than .tolist(), but - being in the order of a few μs - still much faster than the json encoding, with the benefit of ~3x smaller exports.

I always wanted to report this, and this PR revived the topic. Could this be relevant for a new issue (especially since orjson will not become the default)?

FYI: for reference, a quick search revealed that a patch of encode_as_list was already suggested before: #1842 (comment), in the context of treating inf & NaN, which got brought up again in #2880 (comment).

nicolaskruchten · 2021-06-08T11:58:40Z

@mherrmann3 thanks! I've broken this out into a separate issue: #3232

dhirschfeld · 2021-07-01T11:28:15Z

I'm super keen to give this a go as I've got medium sized data and am having performance issues :(

It sounds like orjson will be used automatically if installed:

Assuming that gives a good speedup, is there a way to configure plotly to automatically use a different json parser - e.g. pysimdjson?

jonmmease · 2021-07-01T13:31:34Z

is there a way to configure plotly to automatically use a different json parser

Not right now, the logic around the JSON library needs to be customized a bit (e.g. for handling datetime formatting). That said, the refactoring that went into the orjson support, and the switchable JSON engines, would make it a lot easier to add support for additional JSON libraries in the future.

jonmmease added 6 commits December 5, 2020 12:02

WIP accelerated encoding with orjson

40b9af1

support fig to dict in io without cloning

f79e318

Merge branch 'master' into orjson_encoding

55720de

fix clone default

7b3593a

Add pio.json.config object to configure default encoder

da915d6

Later we can use this to configure base64 encoding

default_encoder to default_engine

7b235ef

jonmmease marked this pull request as draft December 5, 2020 20:24

jonmmease added 2 commits December 5, 2020 15:25

blacken

7895b6a

Handle Dash objects in to_json

ce05a68

jonmmease added 14 commits December 31, 2020 11:13

add JSON encoding tests

4ef6510

add testing of from_plotly_json

101ba85

Better error message when orjson not installed and orjson engine requ…

67d3670

…ested

Add orjson as optional testing dependency

02c00da

Replace Python 3.5 CI tests with 3.8

99ea6a1

Try only install orjson with Python 3.6+

d44ec26

Don't test orjson engine when orjson not installed

b7d8422

Try new 3.8.7 docker image since prior guess doesn't exist

ddcd6f5

greater than!

33359f3

Bump scikit image version for Python 3.8 compatibility

c7c1819

Try to help Python 2 from getting confused about which json module to…

a8d52ab

… import

Update pandas for Python 3

619838f

Revert 3.8 CI updates. Too much for this PR

7c7a272

Doh

1708703

jonmmease marked this pull request as ready for review December 31, 2020 17:49

jonmmease added 2 commits December 31, 2020 15:35

Don't skip copying during serialization

66cab10

Rename new JSON functions:

56a8945

- pio.json.to_plotly_json -> pio.json.to_json_plotly - pio.json.from_plotly_json -> pio.json.from_json_plotly

jonmmease added 3 commits January 13, 2021 13:51

Merge branch 'numpy_date_serialization' into orjson_encoding

453461d

no need to skip legacy tests now

84ba4b5

Only try datetime_as_string on datetime kinded numpy arrays

340aed3

jonmmease added 5 commits January 21, 2021 12:19

Don't store object or unicode numpy arrays in figure. Coerce to lists

6cea61d

Try orjson encoding without cleaning first

93815c1

Merge remote-tracking branch 'origin/master' into orjson_encoding

242d1fa

blacken

8a3a4b3

remove scratch file

1de750a

jonmmease added 3 commits January 21, 2021 15:50

Remove unused clone

81f73d5

Remove the new "json" encoder

80be8bd

since it's sometimes slower, and never faster, than current encoder. Rename "legacy" to "json".

Reorder dict cleaning for performance

cb54f88

jonmmease mentioned this pull request Feb 27, 2021

Various python updates (plotGlPixelRatio, template dimensions, json engine, AWS Lambda) plotly/Kaleido#76

Merged

nicolaskruchten mentioned this pull request Apr 22, 2021

Using 'date' type for scatter xaxis fails when coordinates stored in Numpy array #1927

Closed

Merge remote-tracking branch 'origin/master' into orjson_encoding

1fbfa0d

# Conflicts: # packages/python/plotly/plotly/io/_json.py # packages/python/plotly/tox.ini

nicolaskruchten merged commit 5301dcb into master May 27, 2021

nicolaskruchten mentioned this pull request Jun 8, 2021

PlotlyJSONEncoder always casts values to float64 due to using tolist() #3232

Open

archmoj deleted the orjson_encoding branch November 23, 2021 23:35

egrace479 mentioned this pull request Aug 7, 2023

Speed up the dashboard Imageomics/dashboard-prototype#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON encoding refactor and orjson encoding #2955

JSON encoding refactor and orjson encoding #2955

jonmmease commented Dec 5, 2020 •

edited

Loading

nicolaskruchten commented Dec 7, 2020

chriddyp commented Dec 7, 2020

jonmmease commented Dec 7, 2020

jonmmease commented Dec 31, 2020

nicolaskruchten commented Jan 14, 2021

jonmmease commented Jan 14, 2021

jonmmease commented Jan 21, 2021

jonmmease commented Jan 22, 2021

jmsmdy commented Mar 16, 2021

jonmmease commented Mar 16, 2021

nicolaskruchten commented Apr 22, 2021

nicolaskruchten commented Apr 22, 2021

mherrmann3 commented Jun 8, 2021

nicolaskruchten commented Jun 8, 2021

dhirschfeld commented Jul 1, 2021

jonmmease commented Jul 1, 2021

JSON encoding refactor and orjson encoding #2955

JSON encoding refactor and orjson encoding #2955

Conversation

jonmmease commented Dec 5, 2020 • edited Loading

Overview

JSON config object

Quick timing example

Relationship to base64 encoding

Correctness testing

Numpy floating point preceision

Benchmarking

TODO

nicolaskruchten commented Dec 7, 2020

chriddyp commented Dec 7, 2020

jonmmease commented Dec 7, 2020

jonmmease commented Dec 31, 2020

nicolaskruchten commented Jan 14, 2021

jonmmease commented Jan 14, 2021

jonmmease commented Jan 21, 2021

jonmmease commented Jan 22, 2021

jmsmdy commented Mar 16, 2021

jonmmease commented Mar 16, 2021

nicolaskruchten commented Apr 22, 2021

nicolaskruchten commented Apr 22, 2021

mherrmann3 commented Jun 8, 2021

nicolaskruchten commented Jun 8, 2021

dhirschfeld commented Jul 1, 2021

jonmmease commented Jul 1, 2021

jonmmease commented Dec 5, 2020 •

edited

Loading