Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Potential high CPU usage causing stuttering UI #802

Open
albrow opened this issue May 5, 2020 · 14 comments
Open

Potential high CPU usage causing stuttering UI #802

albrow opened this issue May 5, 2020 · 14 comments
Assignees
Labels
browser performance Related to improving or measuring performance

Comments

@albrow
Copy link
Contributor

albrow commented May 5, 2020

Context

Augur has reported this issue on Discord. They are seeing UI stutters which they believe may be due to Mesh taking up too much CPU resources.

  • Are you running Mesh in the browser or as a standalone server? Are running Mesh inside of Docker or directly running the binary?

In the browser. Specifically Chrome. @pgebheim Do you know if this affects more browsers?

  • What version of Mesh are you running? Be as specific as possible (e.g., 8.0.1 instead of latest or 8).

@pgebheim can you answer this?

Expected Behavior

After an initial spike while compiling the Wasm, Mesh should not be using too much CPU.

Current Behavior

Mesh is apparently taking up a lot of CPU time and causing the UI to stutter.

Failure Information (for bugs)

See below regarding logs.

Steps to Reproduce

@pgebheim can you answer this? You mentioned that you are using a script to create Kovan orders? Can you send that our way so we can reproduce exactly what you are seeing?

Failure Logs

Logs and profiler output from Chrome have been sent to myself and @jalextowle but are too big for GitHub. Please reach out to me if you need them.

@albrow albrow added browser bug Something isn't working labels May 5, 2020
@albrow
Copy link
Contributor Author

albrow commented May 5, 2020

@pgebheim do you have any other WebAssembly running at the same time as Mesh? I noticed a big discrepancy in the profile data that you sent me compared to running the example in the Mesh repo.


Here's our example. This is a profile captured over a time period of 30 seconds after Mesh has already warmed up (it was running for about 60 seconds prior to taking the capture). I used the Task Manager feature of Chrome to keep an eye on the CPU usage. It typically stayed around 7-10% with occasional spikes to 25% when there is a new batch of incoming orders. (Update: I did see it spike to 60% after letting it run for a lot longer. It sunk back down to 5% after about 1 second, which I believe is the polling rate of the Task Manager). This is on mainnet with around 2k-4k orders.

Screen Shot 2020-05-04 at 7 51 56 PM

Two things to note:

  1. Our example does not have a UI, so I can't easily confirm whether this would cause stutters. However, based on past experience, I would not expect the CPU usage that we're seeing to cause any issues.
  2. Our example does not create any orders in the browser. It is possible that this creates more CPU demand. I would love to see whatever scripts you are using to create orders on Kovan so that we could confirm.

Here is the profile data that you sent me:

Screen Shot 2020-05-04 at 7 51 48 PM

While there are a lot of similarities, the top entry when sorted by "Self Time" is something called "wasm-unnamed". I have not seen this before when looking at Mesh profiles, and I think it may be unrelated to Mesh. Is that possible?

@albrow albrow added the needs more info More information is needed to help us understand what is causing the issue label May 5, 2020
@albrow
Copy link
Contributor Author

albrow commented May 5, 2020

I captured my own performance profile by following the instructions from #803 and building and running Augur locally. I waited a little while for the application to warm up and then captured 30 seconds. Interestingly, I don't see anything called "wasm-unnamed". I wonder if this is just a difference in Chrome versions or Chrome settings? Here's what it looks like for me:

Screen Shot 2020-05-05 at 1 48 49 PM

Note that the top entry is something called "Function Call" which unfortunately doesn't tell us much. Digging into the flame graph gives a better picture of what is going on:

Screen Shot 2020-05-05 at 1 50 51 PM

The red marks are pointing out function calls that took longer than usual. Looks like there is one every few seconds. Let's zoom into one of these calls:

Screen Shot 2020-05-05 at 1 53 22 PM

I highlighted in red the portion of CPU time that is coming from Mesh. It looks like most of the time comes from mesh.getStatsAsync(). This takes up about 25% of the total CPU time for this particular "Function Call".

Here's the snippet of code that I believe corresponds this particular "Function Call":

  private notifyNewBlockEvent = async (blockNumber: number, logs: ParsedLog[]): Promise<void> => {
    let lowestBlock = await (await this
      .db).syncStatus.getLowestSyncingBlockForAllDBs();

    if (lowestBlock === -1) {
      lowestBlock = blockNumber;
    }

    const blocksBehindCurrent = blockNumber - lowestBlock;
    const percentSynced = ((lowestBlock / blockNumber) * 100).toFixed(4);

    const timestamp = await this.augur.getTimestamp();

    let stats: ZeroXStats = {peers: 0, orders: 0};
    if(this.augur.zeroX) {
      stats = await this.augur.zeroX.getStats();
    }

    this.augur.events.emit(SubscriptionEventName.NewBlock, {
      eventName: SubscriptionEventName.NewBlock,
      highestAvailableBlockNumber: blockNumber,
      lastSyncedBlockNumber: lowestBlock,
      blocksBehindCurrent,
      percentSynced,
      timestamp: timestamp.toNumber(),
      logs,
      ...stats
    });
  };

Digging into the code further it looks like mesh.getStatsAsync() is relatively expensive in terms of CPU cost due to inefficiencies in our current database implementation in the browser. This will be hopefully addressed after our big database refactor.

However, as far as I can tell, this is only taking up about ~25% of the CPU time associated with what the Chrome profiler calls "Function Call". If you want to reduce the stress on the CPU, based on these results the first place to look should be the rest of the notifyNewBlockEvent function, outside of what is going on in mesh.getStatsAsync().

If you do need to reduce this ~25% CPU time that Mesh is using without waiting for our database upgrade, we have a few options for temporary workarounds. The most straightforward would be simply not calling mesh.getStatsAsync() for every new block. If that's not an option, we may be able to create a new function or add some new options so that you can get only the information you need in a more efficient way. Let me know what you want to do.

@albrow
Copy link
Contributor Author

albrow commented May 5, 2020

Note that I am not trying to measure CPU time on startup or during the warmup period while Mesh is getting orders for the first time. The reason is that there are a lot of variables that come into play (how good your browser is at compiling WebAssembly, how many orders are in the network, what's in the cache, etc.). I also think #803 is a confounding factor and could potentially be affecting the results. We can take a closer look at CPU time during startup after #803 is addressed.

@pgebheim
Copy link
Contributor

pgebheim commented May 5, 2020

Oh, we can remove the call to getStatsAsync() right now because we ended up using a courser grained method of determining the state of the orderbook to display in the UI. I'll remove this now and see what sorta results we get.

@albrow
Copy link
Contributor Author

albrow commented May 5, 2020

@pgebheim I removed the call to mesh.getStatsAsync() from inside of notifyNewBlockEvent on my end. Here's what the zoomed in flame graph looks like:

Screen Shot 2020-05-05 at 3 18 08 PM

It's very similar to before except that the part of the CPU time that Mesh was responsible for is gone. I think you should take a look at what else is going on inside of notifyNewBlockEvent and whether it could be optimized.

@albrow
Copy link
Contributor Author

albrow commented May 8, 2020

Just updating this issue for the sake of consistency. While it is true that under the conditions described above Mesh is not responsible for a significant portion of the CPU time (after removing the unnecessary call mesh.getStats()), we have also shifted our attention to different conditions for profiling. During the initial start and warm up time, Mesh is taking up a significant portion of CPU time, mostly due inefficiencies in the database and the ordersync protocol. We are working on determining which hot functions make the most sense to optimize and then will be implementing those optimizations one by one. To give you a rough idea of the current top culprits:

  • Inefficient database operations across the board
  • Converting between Go and JavaScript types
  • Regex matching for custom order filters
  • Keccak-256 hashing

@jalextowle please feel free to update as needed.

@mintcloud mintcloud assigned mintcloud and jalextowle and unassigned mintcloud May 11, 2020
@jalextowle
Copy link
Contributor

@pgebheim The main focus right now has been on optimizing the current JSON schema validation. This is how order filters are used to validate orders, and it's become apparent that our current JSON schema validation logic is quite slow when it is executed in the browser. I'm currently working on attempting to use a JSON schema validator library called ajv (https://ajv.js.org/) in the WebAssembly version of Mesh instead of gojsonschema. I'll have a better idea of whether or not this will be a performance improvement later today, and I will update with progress then.

Another optimization that I have worked on is attempting to release the main thread using go-routines. This seems to have been a dead end, but we are communicating with the Go team in order to see if there is a way to accomplish this goal in Go that has been compiled to WebAssembly. I'm interested to hear how things are working when you are running the browser node in a Web Worker.

@pgebheim
Copy link
Contributor

@jalextowle -- Thanks for the update. We're working on getting an initial cut of the PR in. According to the PR author this makes the UI not freeze up at all -- though the syncing is still rather slow.

@mintcloud mintcloud modified the milestones: JSON encoding/decoding , Match orders May 26, 2020
@loca555
Copy link

loca555 commented Jun 1, 2020

Any news?

@jalextowle
Copy link
Contributor

@loca555 We have recently released version 9.4.0 of the Mesh network. This release includes a performance optimization that reduces the CPU usage of one hot spot that we identified while digging into this issue.

We are now working on several more optimizations, and I'll post any progress that we make here.

@albrow
Copy link
Contributor Author

albrow commented Jun 8, 2020

See #820 for another optimization.

@albrow albrow added performance Related to improving or measuring performance and removed bug Something isn't working needs more info More information is needed to help us understand what is causing the issue labels Jun 8, 2020
@albrow
Copy link
Contributor Author

albrow commented Jun 17, 2020

#840 includes an optimization for FindMiniHeaders.

@albrow
Copy link
Contributor Author

albrow commented Jun 22, 2020

#846 includes an optimization for computeOptimalChunkSizes.

@albrow
Copy link
Contributor Author

albrow commented Aug 11, 2020

After a few rounds of optimization, we have greatly reduced CPU usage in the browser. In Chrome, we are seeing that Mesh does not hold onto the main thread for longer than 100ms at a time in most circumstances. In addition, Augur is now using Mesh in a Web Worker which eliminates stuttering altogether. There is still room for further CPU optimization if needed, but I think at this point we have solved the immediate problem of UI stuttering.

@pgebheim is it okay if we close this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
browser performance Related to improving or measuring performance
Projects
None yet
Development

No branches or pull requests

5 participants