Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca integration for static image export #1120

Merged
merged 82 commits into from
Aug 25, 2018
Merged

Orca integration for static image export #1120

merged 82 commits into from
Aug 25, 2018

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Aug 19, 2018

Overview

This PR integrates orca into plotly.py to support exporting figures as static images 🎉

cc @chriddyp @jackparmer @nicolaskruchten @cldougl @Kully @etpinard @malmaud

Even if you don't have time to look at the code or test out the branch, I'd appreciate any feedback on the architecture and API notes.

Here goes...

Background

See #1105 for background information and discussion of related work.

Architecture

In this PR I went with method (3) from the issue above, "Use orca in server mode".

The first time an image export operation is performed, an orca server process is launched in the background (as non-blocking subprocess). Image export requests are posted to the server on a local port.

By default, the server process runs until the main process exits. But there is also a timeout configuration option (more on configuration options below) that allows a user to specify that the server should be automatically shut down after a certain period of inactivity.

Regardless of whether a timeout is set, the server may also be manually shutdown and manually started.

Implementation Notes

Starting the server

The server subprocess is launched using subprocess.Popen to create a long-running background process. The server is launched in --graph-only mode to be as lean as possible (this avoids running processes for exporting thumbnails, dashboards, etc.)

Communicating with the Server

Communication with the server is done using requests.post. The request function is wrapped in the @retrying.retry decorator to handle the automatic retrying of failed requests. The retrying logic is very convenient, as it allows an image request to be made right after the server process is launched and the request will simply block until the server responds.

Shutting down the server

It's possible to terminate the particular process created using subprocess.Popen with the Popen.terminate method. Unfortunately, this isn't always enough to actually shut down the server. The trouble is that typical orca entry points (orca.sh, orca.js, orca.cmd) are simply wrapper scripts that call the main orca/electron executable. In my testing on OS X, Linux, and Windows I found that Popen.terminate generally only terminates the shell/wrapper process, leaving the orca server running. This is definitely not acceptable, as a user could end up with a new orca process each time they restart their kernel and export images.

I initially tried some workarounds involving process groups, and sending different signals, but the result ended up being platform dependent and still not fully reliable. I settled on introducing the psutil library as a new optional dependency. psutil provides a platform agnostic API for iterating over the children of a process, and then terminating them. In my testing, this psutil approach has been fully reliable in terminating the server processes across platforms. Since our CI test suites is Linux only at this point, I'm especially glad to not need to introduce any OS X/Windows specific process management logic.

Shutdown server after timeout

If a timeout is configured when the server process is launched, a threading.Timer object is created to call the shutdown function after timeout seconds.

Each time an image render request is made, any existing Timer object is canceled, and a new Timer is created.

Importantly, each timer thread has the daemon property set to True. This prevents the main process from waiting for the timer to complete before exiting.

Shutdown on exit

The shutdown function is annotated with the @atexit.register decorator to ensure that the server is properly shutdown when the main Python process exits.

API Design

This PR introduces the beginning of the plotly.io module.

Image export

Two image export functions are introduced. These function follow the export conventions proposed in #1098.

plotly.io.write_image(fig, file, format=None, scale=None, width=None, height=None)

This functions works very much like the matplotlib savefig function. fig is a Figure or compatible dict. file may be a string referring to a local filesystem path, or a file-like object to be written to. If file is a string, then the file extension is used to infer the image format if possible. The format may be used to explicitly specify the format, and it is required if file is not a string with a common extension. Supported formats are png, jpeg (jpg extension supported as well), webp, svg, pdf, and eps (with poppler installed). scale, width, and height work as you would expect.

plotly.io.to_image(fig, format=None, width=None, height=None, scale=None)

This function may be used to return the binary representation of the image directly (no temp files or messing with io.BytesIO!). This can be used in conjunction with IPython.display.Image to display static images directly in the notebook or QtConsole.

Orca management

If users install orca using conda or npm, they should be able to use the above methods immediately, without additional configuration. But for more technical users, and for general users if things go wrong, there is a new plotly.io.orca module.

Manual server management

The server may be manually started using plotly.io.orca.ensure_orca_server(), and it may be manually shut down using plotly.io.orca.shutdown_orca_server()

Orca config

plotly.io.orca.config is an orca configurations/settings object. Here are the properties that can be configured

orca configuration
------------------
    port: None
    executable: orca
    timeout: None
    default_width: None
    default_height: None
    default_scale: 1
    default_format: png
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js
    topojson: None
    mapbox_access_token: None

constants
---------
    plotlyjs: /path/to/plotly/package_data/plotly.min.js 
    config_file: /Users/username/.plotly/.orca

If automatic port selection is not desirable, an explicit port value may be set here. If an executable named orca cannot be found on the path, then the executable property may be set to the absolute path to an orca executable. This is where the timeout property is set. The default width, height, scale, and format, control the default values used by to_image when not otherwise specified.

I took the liberty of supplying a default mathjax CDN, this way latex image export just works as long as the user is online. For offline use, the mathjax property can be set to the path to a local mathjax installation. When topojson is None the plot.ly CDN will be used, but a local path can be supplied if working offline. Finally, the mapbox_access_token property can store a mapbox token that will automatically be applied when exporting mapbox traces.

Properties can be set using property assignment

plotly.io.orca.config.mapbox_access_token = 'xyz...'

or using the update method

plotly.io.orca.config.update(mapbox_access_token='xyz...')

The constants are not settable and are listed for informational purposes.

Saving configuration properties

The config values may optionally be saved to the ~/.plotly settings directory as ~/.plotly/.orca using the plotly.io.config.save() method. If present, these setting are automatically loaded on import.

Orca status

The current status of the orca server process can be displayed using the plotly.io.orca.status object.

At initial startup the state will be unvalidated

orca status
-----------
    executable: None
    version: None
    port: None
    pid: None
    state: unvalidated
    command: None

After a valid orca executable has been found, and the server is not yet running, the state will be `validated'

orca status
-----------
    executable: /anaconda3/envs/plotly_dev/bin/orca
    version: 1.1.0
    port: None
    pid: None
    state: validated
    command: None

Here the user can see which orca executable was found on the path, and what version it is.

When the server process is currently running, the state will be running

orca status
-----------
    executable: /anaconda3/envs/plotly_dev/bin/orca
    version: 1.1.0
    port: 59997
    pid: 83079
    state: running
    command: ['orca', 'serve', '-p', '59997', '--graph-only', '--plotly', '/path/to/plotly/package_data/plotly.min.js', '--mathjax', 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js', '--mapbox-access-token', 'pk...']

Here the user can see the details of the running process (port, pid) and the exact command line arguments that were passed to the orca server at startup.

Error messages

There are a lot of things that can potentially go wrong here, so I've tried to make the error messages as helpful as possible. For example here's the error that is raised if the orca executable cannot be found on the path:

The orca executable is required in order to export figures as static images,
but it could not be found on the system path.

Searched for executable 'orca2' on the following path:
    /anaconda3/envs/plotly_dev/bin
    /usr/local/bin
    /usr/bin
    /bin
    /usr/sbin
    /sbin
    /Applications/VMware Fusion.app/Contents/Public
    /Library/TeX/texbin

If you haven't installed orca yet, you can do so using conda as follows:

    $ conda install -c plotly plotly-orca

After installation is complete, no further configuration should be needed. 
For other approaches to installing orca, see the orca project README at
https://github.com/plotly/orca.

If you have installed orca, then for some reason plotly.py was unable to
locate it. In this case, set the `plotly.io.orca.config.executable`
property to the full path to your orca executable. For example:

    >>> plotly.io.orca.config.executable = '/path/to/orca'

After updating this executable property, try the export operation again.
If it is successful then you may want to save this configuration so that it
will be applied automatically in future sessions. You can do this as follows:

    >>> plotly.io.orca.config.save() 

If you're still having trouble, feel free to ask for help on the forums at
https://community.plot.ly/c/api/python

Testing

I've added two new test suites.

  • plotly/tests/test_orca/test_orca_server.py. These tests cover the logic for locating and validating the orca executable. And the logic for launching and shutting it down. This testing relies on psutil to check that the process with the right pid is running and then not running. And it relies on pinging the server to make sure it's running on the right port, and that it stops responding when it should be shut down.

  • plotly/tests/test_orca/test_to_image.py. These tests cover the image conversion logic. I've generated a set or reference images to compare against. These ensure that valid images are produces where they should be, and that the topojson and mathjax configuration is working properly. Unfortunately, the images are not exactly reproducible between my local mac and CircleCI, so for the time being there is a separate directory of reference images for OS X and Linux (though I'm not sure Linux is fine grained enough).

These tests are working on CircleCI. The new tests follow a new conda environment pathway so that orca can be installed using conda. The tests are run with Python 2.7, 3.5, and 3.7.

Performance

The whole reason for using this more complex client/server architecture is to improve image export performance. So how well does it do?

This is not an extensive performance comparison, but I did an initial comparison of matplotlib, this branch, and bokeh (setup instructions). The test was to create a 1000 point scatter plot with varying point size and color and then save it to a png.

screen shot 2018-08-20 at 7 46 33 pm

So after the orca server is running, the export time here is right on par with matplotlib (~215ms), and much faster than bokeh (~1.7s).

Being on par with matplotlib here is really exciting, and opens up a lot of new use cases for plotly.py. I'm thinking, in particular, of the possibility of a static image backend for interactive use outside of the notebook/browser context.

Side note: bokeh isn't doing any wrong here. This is just how expensive it is to launch a web browser from scratch. This is also about how long it takes the orca server to start up the first time. The advantage with this orca approach is that the server only needs to start up once per session, instead of once per image.

Produced images

And here are the images produced by matplotlib, this branch, and bokeh
screen shot 2018-08-20 at 7 47 04 pm
screen shot 2018-08-20 at 7 47 13 pm
screen shot 2018-08-20 at 7 47 21 pm

TODO

Various things still to do/look into:

  • Add validate option to to_image and write_image
  • Look into validating poppler installation, or providing better error message on eps failure.

Works in QtConsole if you initialize the qt event loop with:

from PyQt5.QtWebEngineWidgets import QWebEngineView
%gui qt

QWebEngineView import must precede the %gui qt command.
Added orca server process management tests
 - Save/load settings from ~/.plotly/.orca file
 - More validation
 - write image
 - add image options (format, size, scale)
Lets save some complexity and not support using an external orca
server for now.
The old approach required OS-specific process management and it still
didn't kill the child process for orca installed with npm.  Now
all of the OS-specifics are in psutil.

psutil is an optional import that is check when the server is first
requested.
We could leave the plotly.io._show module in place, so people could
experiment with the image backend concept.
It emits some errors when children are killed, but these are harmless
This way program exist won't wait for it to complete
…Mathjax CDN

 - Add topojson files to `plotly/package_data`
 - Add new config settings for plotly.js bundle (use local by default), topojson, mathjax, and mapbox access token
 - Add image tests for `topojson` images and mathjax images
 - Remove saving of orca config to ~/.plotly. Need more a more wholistic settings solution that handles environments
 - Shutdown server when setting config parameters that won't be active until server restarts (e.g. plotlyjs bundle)
 - Make default timeout None. So shutting down the server due to inactivity is now opt-in.
to bypass figure dict validation.

Also improve presentation of orca error messages and added a special
check for EPS failures that might be due to the needed poppler
dependency
… fails to communicate

with the orca server process.
On windows, this avoids Popen being unable to find the orca executable
when it is on the environment path. [ci skip]
If orca returns a 525: 'plotly.js error', and the figure contains
at least one mapbox trace, and not mapbox_access_token is configured,
then include a error message explaining what to do.
@jonmmease
Copy link
Contributor Author

Alright, time to merge this thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants