Optimize expressions over HTTP #29118

chrisdavies · 2019-01-22T16:50:21Z

Kibana is expanding its use of the Canvas expression DSL. The DSL's interpreter currently requires WebSockets in order to work properly. WebSockets are not well supported in several critical production Kibana installations, and possibly won't be for a long time.

We need to update the interpreter to work efficiently over HTTP/1. I've put a PoC together for how this might work. It involves batching requests and processing the response as a stream so that a long-running request doesn't block others in the batch.

Some possible directions are:

Drop the need for WebSockets, and use HTTP/1 with batching / streaming
Try to optimize socket.io's fallback to use batching / streaming

Making the interpreter 1-way instead of 2-way

The interpreter can be updated to behave as follows:

Server initializes itself by enumerating the functions it can run server-side
Server becomes a pretty dumb endpoint (just data fetching / things that can't be done client-side)
Client initializes itself by asking the server "what functions can I run on the server?"
Client interpreter calls the server as needed, but the server never calls / runs anything on the client
Client throws an error if it finds a function that is neither on the client nor on the server.

Evaluating the AST

As an example of the 1-way AST evaluation flow, imagine the browser has an AST that looks like this: (servera (clientb (serverc (clientd))).

It would do the following:

Run the clientd function locally (in the browser)
Send a request to the server like: POST /api/canvasfn/serverc {result of clientd}
Run the clientb function locally (in the browser) with the result of the serverc call
Send a request to the server like POST /api/canvasfn/servera {result of clientb}
Return the final value

That's all pseudo-implementation details, but the flow is workable.

Chattiness

The chattiness of this implementation can be somewhat mitigated with batching / buffering of requests, and by handling the response as a stream client-side. That plus keepalives will hopefully enable reasonable performance without WebSockets.

It's still possible that we fill the browser's connection pool and end up with long-running requests causing unwanted delays in the UI. If we see this, I have some thoughts about how we can mitigate it, but we'll cross that bridge only if necessary.

This is a follow up on an offline conversation with @alexbrasetvik, @jimgoodwin, @stacey-gammon, @timroes

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-22T16:50:23Z

Pinging @elastic/kibana-app

stacey-gammon · 2019-01-22T21:56:23Z

cc @clintandrewhall @w33ble

w33ble · 2019-01-23T23:56:50Z

@chrisdavies I'd love to see the poc you're working on.

As for feedback, under Making the interpreter 1-way instead of 2-way, my one concern is that if server functions and rest requetss are 1:1, that could be kind of slow since we change context a lot, especially with subexpressions. Since there's a lot of overlap in the common functions, that could be mitigated by letting the server run as much as it can and reporting back where it left off with the resulting output. You'd end up with all the benefits you pointed out, and still get some of the existing context passing benefits still without the socket.

chrisdavies · 2019-01-24T12:41:00Z

@w33ble let's sync. I'd love to show you and kick some ideas around.

I agree with the bit about "letting the server run as much as it can and reporting back where it left off with the resulting output". Otherwise, something like (servera (serverb (serverc (clienta)))) would mean 2 extra communications via the 1-way approach.

elasticmachine · 2019-01-29T13:05:19Z

Pinging @elastic/kibana-canvas

stacey-gammon · 2019-01-31T14:56:33Z

related: #29653

chrisdavies · 2019-01-31T14:58:39Z

The game plan, after chatting with @epixa and @ppisljar, @stacey-gammon and others is:

Make Canvas use polling-only so there is a consistent experience
Move the Canvas functions to the browser, and remove the server from the picture entirely (use elasticsearch API from the browser to make this happen)
Stabilize the interpreter to properly surface errors (they seem to get swallowed a lot)
Optimize Canvas functions that do call the server, by batching them and streaming the response

We may return to a contextual server / client exchange as a future optimization step, if we see the need, but for simplicity's sake, we want to avoid that right now.

cc @elastic/kibana-canvas

w33ble · 2019-01-31T18:10:14Z

Move the Canvas functions to the browser, and remove the server from the picture entirely (use elasticsearch API from the browser to make this happen)

That's seems fine for the Elasticsearch use case for sure, since we control that. The way you worded that sounds like what we were trying to get away from in Kibana (talking directly to ES), but perhaps that's not what you mean.

Optimize Canvas functions that do call the server, by batching them and streaming the response

Given this item, I'm not really concerned about the other point. It sounds like server functoins will still exist, we'll just try to use them less ourselves, which it fine; we don't have a lot of server-only functions right now anyway.

We may return to a contextual server / client exchange...

This part I'm not sure I understand correctly. Are you talking about the architecture specifically? It sounds like you're still going to send things off the server to allow execution there already. Can you clarify what you meant here?

jimgoodwin · 2019-01-31T18:18:04Z

I agree that the "use elasticsearch API from the browser" statement needs some clarification... are we abandoning batching up requests to ES for efficiency? Is that going to create a lot of traffic to ES that we only have one thread or so to manage?

chrisdavies · 2019-01-31T18:28:16Z

Sorry. The plan was to move the server functions to the browser, and have them use elasticsearch-browser to call Elastic through the Kibana API. That plan has been canned.

We're instead going to create an API endpoint for each of the current (I think 5) server functions, and we're going to create corresponding browser functions which call those endpoints. That way, the interpreter can run entirely in the browser, and dropping socket.io is a bit simpler once that's done.

w33ble · 2019-01-31T19:02:23Z

That makes sense. Just to be clear, you're not dropping server functions entirely then, just opting to coordinate with an API for our specific use cases, correct? Or are you saying that if a user wants to, for example, write a function that pulls data from some external API, they're going to need to write a Kibana endpoint and route through that?

rashidkpc · 2019-01-31T19:32:54Z

Sounds like there's a plan for tackling the use of socket.io. Let me know if I can help with any background on non-obvious existing implementation questions. Happy to Zoom.

ppisljar · 2019-02-01T11:10:38Z

@chrisdavies why was the plan to move esdocs/escount to browser canned ? (the idea about moving them to browser functions and using elasticsearch-browser to call elastic through the kibana proxy)
and why for other functions ? (demodata, pointseries)

chrisdavies · 2019-02-01T12:20:21Z

@ppisljar, @tylersmalley and I chatted, and we're trying to deprecate use of elasticsearch-browser. Some of the server functions can be moved to the client, though, such as pointseries. I think we should do that kind of thing as part of a separate PR, though, as it's not really related to removing WebSockets at this point.

chrisdavies · 2019-02-06T20:55:03Z

We've merged HTTP batching into master. That should get us close enough for a GA, I think. We still have optimizations that we could apply:

Streaming the batched HTTP responses back to the client and processing them as they come in
Return to the previous model of evaluation (where the server tries to evaluate as much as it can before handing evaluation back to the client)

These are both a bit more complex than the current solution, so I'd recommend not optimizing further until we know we need to. I'm happy to sync with anyone about how to stress / perf test this and determine if we need further optimizations.

mw-ding · 2019-02-11T07:08:10Z

Hey all, I am following this topic about WebSocket (including PR and issues like #29792 and #29653 (comment)) because our Code plugin for Kibana also relies on the socket.io. The plugin now lies in a separate branch of the Kibana repo but is going to merge into master soon.

As I am reading through the discussions here, I realize that intentions behind the adoptions of socket.io between the Canvas and Code plugins are slightly different. Correct me if I were wrong, the main reason for Canvas using socket.io is to maintain a stateful connection to interpret a series of function calls/expressions of its DSL. The Code plugin, instead, uses the WebSocket as a channel to notify the client/browser the current progress of the repositories (e.g. clone, index, delete), to deliver a real-time responsiveness user experiences.

Previously, we mostly follow the footsteps of the canvas app in terms of the usage of socket.io, without fully understanding the context behind it. Now since Canvas removes the dependency on socket.io, we are thinking about removing it as well to keep the behavior s align across the entire Kibana platform. Before doing the actual work for the Code plugin, just want to confirm with you all regarding the proposal, in case anything is missing since you guys have more knowledge on this.

To remove the dependency of WebSocket in Code plugin, we are going to use client-side polling to get the progress data from REST APIs (say every 5 seconds). This is the most straight-forward way to achieve this. We don’t need any sort of sticky session in a multiple nodes Kibana setup, because each request is independent to each other.

Another option is the Server-Sent Event. However, it’s not available to IE or some older versions of Chrome/Firefox, because it’s a part of HTML5. Does Kibana have any existing usages of SSE so far? If so, we can consider this path as well.

Let me know if you have any questions.

chrisdavies · 2019-02-11T14:24:54Z

Hi Mengwei,

We don’t need any sort of sticky session in a multiple nodes Kibana setup, because each request is independent to each other.

Is your progress / context stored in Elasticsearch? I'm assuming so, but if not, polling won't be reliable in HA Kibana clusters, nor would the socket.io fallback (which is essentially polling).

I don't know what our browser support story is regarding Server-Sent events, but someone from @elastic/kibana-operations could probably speak better to that.

Anyway, the impetus for this issue was that we were seeing degraded performance in some of our production environments. Most of that may have been due to sub-optimal Canvas plugin builds. But some of it seems to have been due to the socket.io fallback. WebSockets + fallback made performance issues more difficult to reason about, and in the case of the interpreter, weren't strictly necessary.

In addition, the interpreter was going to be used across Kibana, meaning that all of Kibana would suddenly have a socket.io dependency, and we started to see regular, and unnecessary polling requests happening on all screens.

If your changes are going to only live in Code, I think the impact to Kibana as a whole is not a concern.

So, the question becomes: do our production Kibana deployments have sufficient support to warrant the feature? Does it behave predictably in HA Kibana clusters? Is the support risk worth the reward (e.g. websockets behave very differently in different network configurations, and diagnosing what's going wrong can be tricky)?

mw-ding · 2019-02-11T15:02:31Z

Is your progress / context stored in Elasticsearch? I'm assuming so, but if not, polling won't be reliable in HA Kibana clusters, nor would the socket.io fallback (which is essentially polling).

Yes, it's persisted in ES as well.

So, the question becomes: do our production Kibana deployments have sufficient support to warrant the feature? Does it behave predictably in HA Kibana clusters? Is the support risk worth the reward (e.g. websockets behave very differently in different network configurations, and diagnosing what's going wrong can be tricky)?

Yes, these are all valid concerns. And given these concerns, I think the simple polling would be a safe path for now.

w33ble · 2019-02-11T18:12:26Z

@mw-ding For what it's worth, this is how the Reporting interface works as well, it just keeps asking Elasticsearch for a list of jobs and their status. The experience feels "real-timey", even though it's not, and that ~5 second delay isn't really noticeable in practice. As long as the UI updates automatically, users don't really care about a small delay in that information, and it's much easier to deal with than the socket connection for all the reasons that have been pointed out. If you just need to update a UI to show the state of things, polling is probably going to lead to the least amount of headaches.

SSR, as far as I know, isn't something anyone has really looked into. I think it presents a lot of the same challenges that websockets do, mostly with regards to proxies and networking in general, but also session and security things, as you've pointed out.

All of that said, it might be nice to have an abstraction in Kibana that makes polling easier to set up, but to my knowledge nothing like that has been proposed. Now that more apps do this, something like that could be worth raising with the Operations team.

mw-ding · 2019-02-13T02:37:23Z

@mw-ding For what it's worth, this is how the Reporting interface works as well, it just keeps asking Elasticsearch for a list of jobs and their status. The experience feels "real-timey", even though it's not, and that ~5 second delay isn't really noticeable in practice. As long as the UI updates automatically, users don't really care about a small delay in that information, and it's much easier to deal with than the socket connection for all the reasons that have been pointed out. If you just need to update a UI to show the state of things, polling is probably going to lead to the least amount of headaches.

It's good to learn that there are some precedents over there. Thanks for pointing this out.

All of that said, it might be nice to have an abstraction in Kibana that makes polling easier to set up, but to my knowledge nothing like that has been proposed. Now that more apps do this, something like that could be worth raising with the Operations team.

Completely agreed. So sounds like 5-second polling is the best bet for now. Later if there is a Kibana level abstraction, we can move to that. Can you maybe share the link of the code in Reporting if possible? Just want to align with its implementation as much as possible for the sake of potential future abstraction.

w33ble · 2019-02-13T23:48:08Z

@mw-ding here's the code... interestingly enough, it looks like someone already made a polling abstraction, which you can just use if you're working in the X-Pack code (and I believe you are). See the Poller class.

stacey-gammon · 2019-02-15T14:01:37Z

I think @joshdover looked into using esqueue for polling too, for the migration assistant work, and ended up doing something else, so that might be another example to use.

joshdover · 2019-02-15T16:20:21Z

Correct, we ended up creating our own mechanisms. We have a server-side poll for polling the state of Elasticsearch Tasks (probably not relevant to codesearch) and a client-side poll for updating the UI based on state stored in a saved object in the .kibana index.

For the client-side polling, we used a RxJS BehaviorSubject as the public abstraction for UI that is consuming the data. This makes switching out the polling mechanism under the hood very trivial.

epixa · 2019-02-15T16:49:09Z

+1 to using client-side polling for code.

Some things to keep in mind that have bitten us on previous polling implementations:

Ideally we'd only be polling the backend when we know there is something to poll for, rather than just polling forever on the off-chance something pops up that's worth tracking. This generally means that a user action (like cloning a repo) would kick off the polling process. Loading the list of repos could also potentially kick off a polling process if it's not already running and there is a clone in progress.
Any state from polling should use Elasticsearch as the source of truth so it scales to multiple instance setups. It sounds like this is already how code works.

chrisdavies · 2019-02-15T18:19:37Z

@stacey-gammon should we close this issue, since the main reason it was created is now resolved. If we need further optimization, we probably should have a specific issue for that. @mw-ding maybe you should open an issue to track replacing websockets with polling in code?

mw-ding · 2019-02-17T06:00:28Z

+1 to using client-side polling for code.

Some things to keep in mind that have bitten us on previous polling implementations:

Ideally we'd only be polling the backend when we know there is something to poll for, rather than just polling forever on the off-chance something pops up that's worth tracking. This generally means that a user action (like cloning a repo) would kick off the polling process. Loading the list of repos could also potentially kick off a polling process if it's not already running and there is a clone in progress.

Any state from polling should use Elasticsearch as the source of truth so it scales to multiple instance setups. It sounds like this is already how code works.

Thanks for the rules of thumbs. @epixa Very useful.

@chrisdavies Thanks, I think we have got what we want.

Randy-312 · 2019-03-13T22:16:13Z

I do hope you consider ECE and it's proxy-server when designing the new Browser to ES interactions.
Also, what we Can and Cannot do on AWS with an ALB should be considered.

chrisdavies added triage_needed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Jan 22, 2019

chrisdavies changed the title ~~Remove Kibana's reliance on WebSockets~~ Optimize WebSocket fallback Jan 23, 2019

rayafratkina added Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas and removed triage_needed labels Jan 29, 2019

rayafratkina added the canvasGA_0 label Jan 29, 2019

chrisdavies mentioned this issue Jan 30, 2019

Make Canvas use socket.io polling only #29647

Merged

stacey-gammon mentioned this issue Jan 31, 2019

interpreter: continuing even if socket connection failled #29732

Closed

chrisdavies mentioned this issue Feb 1, 2019

Remove WebSockets from Canvas expressions interpreter #29792

Merged

stacey-gammon changed the title ~~Optimize WebSocket fallback~~ Optimize expressions over HTTP Feb 6, 2019

stacey-gammon closed this as completed Feb 15, 2019

sorenlouv mentioned this issue Jun 13, 2020

[APM] Investigate support for async search #69095

Closed

poffdeluxe mentioned this issue Feb 22, 2021

[8.0 Research] Expression Caching / User-defined Expressions #88790

Closed

sophiec20 added Feature:Canvas and removed canvasGA_0 labels Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize expressions over HTTP #29118

Optimize expressions over HTTP #29118

chrisdavies commented Jan 22, 2019 •

edited

Loading

elasticmachine commented Jan 22, 2019

stacey-gammon commented Jan 22, 2019

w33ble commented Jan 23, 2019

chrisdavies commented Jan 24, 2019

elasticmachine commented Jan 29, 2019

stacey-gammon commented Jan 31, 2019

chrisdavies commented Jan 31, 2019 •

edited by rayafratkina

Loading

w33ble commented Jan 31, 2019

jimgoodwin commented Jan 31, 2019

chrisdavies commented Jan 31, 2019

w33ble commented Jan 31, 2019

rashidkpc commented Jan 31, 2019

ppisljar commented Feb 1, 2019

chrisdavies commented Feb 1, 2019

chrisdavies commented Feb 6, 2019

mw-ding commented Feb 11, 2019 •

edited

Loading

chrisdavies commented Feb 11, 2019 •

edited

Loading

mw-ding commented Feb 11, 2019

w33ble commented Feb 11, 2019 •

edited

Loading

mw-ding commented Feb 13, 2019

w33ble commented Feb 13, 2019

stacey-gammon commented Feb 15, 2019

joshdover commented Feb 15, 2019

epixa commented Feb 15, 2019

chrisdavies commented Feb 15, 2019

mw-ding commented Feb 17, 2019

Randy-312 commented Mar 13, 2019

Optimize expressions over HTTP #29118

Optimize expressions over HTTP #29118

Comments

chrisdavies commented Jan 22, 2019 • edited Loading

Making the interpreter 1-way instead of 2-way

Evaluating the AST

Chattiness

elasticmachine commented Jan 22, 2019

stacey-gammon commented Jan 22, 2019

w33ble commented Jan 23, 2019

chrisdavies commented Jan 24, 2019

elasticmachine commented Jan 29, 2019

stacey-gammon commented Jan 31, 2019

chrisdavies commented Jan 31, 2019 • edited by rayafratkina Loading

w33ble commented Jan 31, 2019

jimgoodwin commented Jan 31, 2019

chrisdavies commented Jan 31, 2019

w33ble commented Jan 31, 2019

rashidkpc commented Jan 31, 2019

ppisljar commented Feb 1, 2019

chrisdavies commented Feb 1, 2019

chrisdavies commented Feb 6, 2019

mw-ding commented Feb 11, 2019 • edited Loading

chrisdavies commented Feb 11, 2019 • edited Loading

mw-ding commented Feb 11, 2019

w33ble commented Feb 11, 2019 • edited Loading

mw-ding commented Feb 13, 2019

w33ble commented Feb 13, 2019

stacey-gammon commented Feb 15, 2019

joshdover commented Feb 15, 2019

epixa commented Feb 15, 2019

chrisdavies commented Feb 15, 2019

mw-ding commented Feb 17, 2019

Randy-312 commented Mar 13, 2019

chrisdavies commented Jan 22, 2019 •

edited

Loading

chrisdavies commented Jan 31, 2019 •

edited by rayafratkina

Loading

mw-ding commented Feb 11, 2019 •

edited

Loading

chrisdavies commented Feb 11, 2019 •

edited

Loading

w33ble commented Feb 11, 2019 •

edited

Loading