Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize expressions over HTTP #29118

Closed
chrisdavies opened this issue Jan 22, 2019 · 27 comments
Closed

Optimize expressions over HTTP #29118

chrisdavies opened this issue Jan 22, 2019 · 27 comments
Labels
Feature:Canvas Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@chrisdavies
Copy link
Contributor

chrisdavies commented Jan 22, 2019

Kibana is expanding its use of the Canvas expression DSL. The DSL's interpreter currently requires WebSockets in order to work properly. WebSockets are not well supported in several critical production Kibana installations, and possibly won't be for a long time.

We need to update the interpreter to work efficiently over HTTP/1. I've put a PoC together for how this might work. It involves batching requests and processing the response as a stream so that a long-running request doesn't block others in the batch.

Some possible directions are:

  • Drop the need for WebSockets, and use HTTP/1 with batching / streaming
  • Try to optimize socket.io's fallback to use batching / streaming

Making the interpreter 1-way instead of 2-way

The interpreter can be updated to behave as follows:

  • Server initializes itself by enumerating the functions it can run server-side
  • Server becomes a pretty dumb endpoint (just data fetching / things that can't be done client-side)
  • Client initializes itself by asking the server "what functions can I run on the server?"
  • Client interpreter calls the server as needed, but the server never calls / runs anything on the client
  • Client throws an error if it finds a function that is neither on the client nor on the server.

Evaluating the AST

As an example of the 1-way AST evaluation flow, imagine the browser has an AST that looks like this: (servera (clientb (serverc (clientd))).

It would do the following:

  • Run the clientd function locally (in the browser)
  • Send a request to the server like: POST /api/canvasfn/serverc {result of clientd}
  • Run the clientb function locally (in the browser) with the result of the serverc call
  • Send a request to the server like POST /api/canvasfn/servera {result of clientb}
  • Return the final value

That's all pseudo-implementation details, but the flow is workable.

Chattiness

The chattiness of this implementation can be somewhat mitigated with batching / buffering of requests, and by handling the response as a stream client-side. That plus keepalives will hopefully enable reasonable performance without WebSockets.

It's still possible that we fill the browser's connection pool and end up with long-running requests causing unwanted delays in the UI. If we see this, I have some thoughts about how we can mitigate it, but we'll cross that bridge only if necessary.

This is a follow up on an offline conversation with @alexbrasetvik, @jimgoodwin, @stacey-gammon, @timroes

@chrisdavies chrisdavies added triage_needed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Jan 22, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app

@stacey-gammon
Copy link
Contributor

cc @clintandrewhall @w33ble

@chrisdavies chrisdavies changed the title Remove Kibana's reliance on WebSockets Optimize WebSocket fallback Jan 23, 2019
@w33ble
Copy link
Contributor

w33ble commented Jan 23, 2019

@chrisdavies I'd love to see the poc you're working on.

As for feedback, under Making the interpreter 1-way instead of 2-way, my one concern is that if server functions and rest requetss are 1:1, that could be kind of slow since we change context a lot, especially with subexpressions. Since there's a lot of overlap in the common functions, that could be mitigated by letting the server run as much as it can and reporting back where it left off with the resulting output. You'd end up with all the benefits you pointed out, and still get some of the existing context passing benefits still without the socket.

@chrisdavies
Copy link
Contributor Author

@w33ble let's sync. I'd love to show you and kick some ideas around.

I agree with the bit about "letting the server run as much as it can and reporting back where it left off with the resulting output". Otherwise, something like (servera (serverb (serverc (clienta)))) would mean 2 extra communications via the 1-way approach.

@rayafratkina rayafratkina added Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas and removed triage_needed labels Jan 29, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-canvas

@stacey-gammon
Copy link
Contributor

related: #29653

@chrisdavies
Copy link
Contributor Author

chrisdavies commented Jan 31, 2019

The game plan, after chatting with @epixa and @ppisljar, @stacey-gammon and others is:

  • Make Canvas use polling-only so there is a consistent experience
  • Move the Canvas functions to the browser, and remove the server from the picture entirely (use elasticsearch API from the browser to make this happen)
  • Stabilize the interpreter to properly surface errors (they seem to get swallowed a lot)
  • Optimize Canvas functions that do call the server, by batching them and streaming the response

We may return to a contextual server / client exchange as a future optimization step, if we see the need, but for simplicity's sake, we want to avoid that right now.

cc @elastic/kibana-canvas

@w33ble
Copy link
Contributor

w33ble commented Jan 31, 2019

Move the Canvas functions to the browser, and remove the server from the picture entirely (use elasticsearch API from the browser to make this happen)

That's seems fine for the Elasticsearch use case for sure, since we control that. The way you worded that sounds like what we were trying to get away from in Kibana (talking directly to ES), but perhaps that's not what you mean.

Optimize Canvas functions that do call the server, by batching them and streaming the response

Given this item, I'm not really concerned about the other point. It sounds like server functoins will still exist, we'll just try to use them less ourselves, which it fine; we don't have a lot of server-only functions right now anyway.

We may return to a contextual server / client exchange...

This part I'm not sure I understand correctly. Are you talking about the architecture specifically? It sounds like you're still going to send things off the server to allow execution there already. Can you clarify what you meant here?

@jimgoodwin
Copy link

I agree that the "use elasticsearch API from the browser" statement needs some clarification... are we abandoning batching up requests to ES for efficiency? Is that going to create a lot of traffic to ES that we only have one thread or so to manage?

@chrisdavies
Copy link
Contributor Author

Sorry. The plan was to move the server functions to the browser, and have them use elasticsearch-browser to call Elastic through the Kibana API. That plan has been canned.

We're instead going to create an API endpoint for each of the current (I think 5) server functions, and we're going to create corresponding browser functions which call those endpoints. That way, the interpreter can run entirely in the browser, and dropping socket.io is a bit simpler once that's done.

@w33ble
Copy link
Contributor

w33ble commented Jan 31, 2019

That makes sense. Just to be clear, you're not dropping server functions entirely then, just opting to coordinate with an API for our specific use cases, correct? Or are you saying that if a user wants to, for example, write a function that pulls data from some external API, they're going to need to write a Kibana endpoint and route through that?

@rashidkpc
Copy link
Contributor

Sounds like there's a plan for tackling the use of socket.io. Let me know if I can help with any background on non-obvious existing implementation questions. Happy to Zoom.

@ppisljar
Copy link
Member

ppisljar commented Feb 1, 2019

@chrisdavies why was the plan to move esdocs/escount to browser canned ? (the idea about moving them to browser functions and using elasticsearch-browser to call elastic through the kibana proxy)
and why for other functions ? (demodata, pointseries)

@chrisdavies
Copy link
Contributor Author

@ppisljar, @tylersmalley and I chatted, and we're trying to deprecate use of elasticsearch-browser. Some of the server functions can be moved to the client, though, such as pointseries. I think we should do that kind of thing as part of a separate PR, though, as it's not really related to removing WebSockets at this point.

@stacey-gammon stacey-gammon changed the title Optimize WebSocket fallback Optimize expressions over HTTP Feb 6, 2019
@chrisdavies
Copy link
Contributor Author

We've merged HTTP batching into master. That should get us close enough for a GA, I think. We still have optimizations that we could apply:

  • Streaming the batched HTTP responses back to the client and processing them as they come in
  • Return to the previous model of evaluation (where the server tries to evaluate as much as it can before handing evaluation back to the client)

These are both a bit more complex than the current solution, so I'd recommend not optimizing further until we know we need to. I'm happy to sync with anyone about how to stress / perf test this and determine if we need further optimizations.

@mw-ding
Copy link
Contributor

mw-ding commented Feb 11, 2019

Hey all, I am following this topic about WebSocket (including PR and issues like #29792 and #29653 (comment)) because our Code plugin for Kibana also relies on the socket.io. The plugin now lies in a separate branch of the Kibana repo but is going to merge into master soon.

As I am reading through the discussions here, I realize that intentions behind the adoptions of socket.io between the Canvas and Code plugins are slightly different. Correct me if I were wrong, the main reason for Canvas using socket.io is to maintain a stateful connection to interpret a series of function calls/expressions of its DSL. The Code plugin, instead, uses the WebSocket as a channel to notify the client/browser the current progress of the repositories (e.g. clone, index, delete), to deliver a real-time responsiveness user experiences.

Previously, we mostly follow the footsteps of the canvas app in terms of the usage of socket.io, without fully understanding the context behind it. Now since Canvas removes the dependency on socket.io, we are thinking about removing it as well to keep the behavior s align across the entire Kibana platform. Before doing the actual work for the Code plugin, just want to confirm with you all regarding the proposal, in case anything is missing since you guys have more knowledge on this.

To remove the dependency of WebSocket in Code plugin, we are going to use client-side polling to get the progress data from REST APIs (say every 5 seconds). This is the most straight-forward way to achieve this. We don’t need any sort of sticky session in a multiple nodes Kibana setup, because each request is independent to each other.

Another option is the Server-Sent Event. However, it’s not available to IE or some older versions of Chrome/Firefox, because it’s a part of HTML5. Does Kibana have any existing usages of SSE so far? If so, we can consider this path as well.

Let me know if you have any questions.

@chrisdavies
Copy link
Contributor Author

chrisdavies commented Feb 11, 2019

Hi Mengwei,

We don’t need any sort of sticky session in a multiple nodes Kibana setup, because each request is independent to each other.

Is your progress / context stored in Elasticsearch? I'm assuming so, but if not, polling won't be reliable in HA Kibana clusters, nor would the socket.io fallback (which is essentially polling).

I don't know what our browser support story is regarding Server-Sent events, but someone from @elastic/kibana-operations could probably speak better to that.

Anyway, the impetus for this issue was that we were seeing degraded performance in some of our production environments. Most of that may have been due to sub-optimal Canvas plugin builds. But some of it seems to have been due to the socket.io fallback. WebSockets + fallback made performance issues more difficult to reason about, and in the case of the interpreter, weren't strictly necessary.

In addition, the interpreter was going to be used across Kibana, meaning that all of Kibana would suddenly have a socket.io dependency, and we started to see regular, and unnecessary polling requests happening on all screens.

If your changes are going to only live in Code, I think the impact to Kibana as a whole is not a concern.

So, the question becomes: do our production Kibana deployments have sufficient support to warrant the feature? Does it behave predictably in HA Kibana clusters? Is the support risk worth the reward (e.g. websockets behave very differently in different network configurations, and diagnosing what's going wrong can be tricky)?

@mw-ding
Copy link
Contributor

mw-ding commented Feb 11, 2019

Is your progress / context stored in Elasticsearch? I'm assuming so, but if not, polling won't be reliable in HA Kibana clusters, nor would the socket.io fallback (which is essentially polling).

Yes, it's persisted in ES as well.

So, the question becomes: do our production Kibana deployments have sufficient support to warrant the feature? Does it behave predictably in HA Kibana clusters? Is the support risk worth the reward (e.g. websockets behave very differently in different network configurations, and diagnosing what's going wrong can be tricky)?

Yes, these are all valid concerns. And given these concerns, I think the simple polling would be a safe path for now.

@w33ble
Copy link
Contributor

w33ble commented Feb 11, 2019

@mw-ding For what it's worth, this is how the Reporting interface works as well, it just keeps asking Elasticsearch for a list of jobs and their status. The experience feels "real-timey", even though it's not, and that ~5 second delay isn't really noticeable in practice. As long as the UI updates automatically, users don't really care about a small delay in that information, and it's much easier to deal with than the socket connection for all the reasons that have been pointed out. If you just need to update a UI to show the state of things, polling is probably going to lead to the least amount of headaches.

SSR, as far as I know, isn't something anyone has really looked into. I think it presents a lot of the same challenges that websockets do, mostly with regards to proxies and networking in general, but also session and security things, as you've pointed out.

All of that said, it might be nice to have an abstraction in Kibana that makes polling easier to set up, but to my knowledge nothing like that has been proposed. Now that more apps do this, something like that could be worth raising with the Operations team.

@mw-ding
Copy link
Contributor

mw-ding commented Feb 13, 2019

@mw-ding For what it's worth, this is how the Reporting interface works as well, it just keeps asking Elasticsearch for a list of jobs and their status. The experience feels "real-timey", even though it's not, and that ~5 second delay isn't really noticeable in practice. As long as the UI updates automatically, users don't really care about a small delay in that information, and it's much easier to deal with than the socket connection for all the reasons that have been pointed out. If you just need to update a UI to show the state of things, polling is probably going to lead to the least amount of headaches.

It's good to learn that there are some precedents over there. Thanks for pointing this out.

All of that said, it might be nice to have an abstraction in Kibana that makes polling easier to set up, but to my knowledge nothing like that has been proposed. Now that more apps do this, something like that could be worth raising with the Operations team.

Completely agreed. So sounds like 5-second polling is the best bet for now. Later if there is a Kibana level abstraction, we can move to that. Can you maybe share the link of the code in Reporting if possible? Just want to align with its implementation as much as possible for the sake of potential future abstraction.

@w33ble
Copy link
Contributor

w33ble commented Feb 13, 2019

@mw-ding here's the code... interestingly enough, it looks like someone already made a polling abstraction, which you can just use if you're working in the X-Pack code (and I believe you are). See the Poller class.

@stacey-gammon
Copy link
Contributor

I think @joshdover looked into using esqueue for polling too, for the migration assistant work, and ended up doing something else, so that might be another example to use.

@joshdover
Copy link
Contributor

Correct, we ended up creating our own mechanisms. We have a server-side poll for polling the state of Elasticsearch Tasks (probably not relevant to codesearch) and a client-side poll for updating the UI based on state stored in a saved object in the .kibana index.

For the client-side polling, we used a RxJS BehaviorSubject as the public abstraction for UI that is consuming the data. This makes switching out the polling mechanism under the hood very trivial.

@epixa
Copy link
Contributor

epixa commented Feb 15, 2019

+1 to using client-side polling for code.

Some things to keep in mind that have bitten us on previous polling implementations:

  • Ideally we'd only be polling the backend when we know there is something to poll for, rather than just polling forever on the off-chance something pops up that's worth tracking. This generally means that a user action (like cloning a repo) would kick off the polling process. Loading the list of repos could also potentially kick off a polling process if it's not already running and there is a clone in progress.
  • Any state from polling should use Elasticsearch as the source of truth so it scales to multiple instance setups. It sounds like this is already how code works.

@chrisdavies
Copy link
Contributor Author

@stacey-gammon should we close this issue, since the main reason it was created is now resolved. If we need further optimization, we probably should have a specific issue for that. @mw-ding maybe you should open an issue to track replacing websockets with polling in code?

@mw-ding
Copy link
Contributor

mw-ding commented Feb 17, 2019

+1 to using client-side polling for code.

Some things to keep in mind that have bitten us on previous polling implementations:

  • Ideally we'd only be polling the backend when we know there is something to poll for, rather than just polling forever on the off-chance something pops up that's worth tracking. This generally means that a user action (like cloning a repo) would kick off the polling process. Loading the list of repos could also potentially kick off a polling process if it's not already running and there is a clone in progress.
  • Any state from polling should use Elasticsearch as the source of truth so it scales to multiple instance setups. It sounds like this is already how code works.

Thanks for the rules of thumbs. @epixa Very useful.

@chrisdavies Thanks, I think we have got what we want.

@Randy-312
Copy link

I do hope you consider ECE and it's proxy-server when designing the new Browser to ES interactions.
Also, what we Can and Cannot do on AWS with an ALB should be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Canvas Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests