Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[cherry-pick] Performance PRs #54

Conversation

john-bodley
Copy link
Collaborator

@john-bodley john-bodley commented Jun 15, 2018

This PR cherry-picks a couple of perf PRs.

to: @graceguo-supercat @michellethomas @timifasubaa @williaster

xiaohanyu and others added 2 commits June 15, 2018 07:42
By stop polling when presto query already finished.

When user make queries to Presto via SQL Lab, presto will run the query
and then it can return all data back to superset in one shot.

However, the default implementation of superset has enabled a default
polling for presto to:

- Get the fancy progress bar
- Get the data back when the query finished.

However, the polling implementation of superset is not right.

I've done a profiling with a table of 1 billion rows, here're some data:

- Total number of rows: 1.02 Billion
- SQL Lab query limit: 1 million
- Output Data: 1.5 GB
- Superset memory consumed: about 10-20 GB
- Time: 7 minutes to finish in Presto, takes additional 15 minutes for
  superset to get and store data.

The problems with default issue is, even if presto has finished the
query (7 minutes with above profiling), superset still do lots of wasted
polling, in above profiling, superset sent about 540 polling in total,
and at half of the polling is not necessary.

Part of the simplied polling response:

```
{
  "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u",
  "id": "20180525_042715_03742_nza9u",
  "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11",
  "stats": {
    "state": "FINISHED",
    "queuedSplits": 21701,
    "progressPercentage": 35.98264191882267,
    "elapsedTimeMillis": 1029,
    "nodes": 116,
    "completedSplits": 15257,
    "scheduled": true,
    "wallTimeMillis": 2571904,
    "peakMemoryBytes": 0,
    "processedBytes": 40825519532,
    "processedRows": 47734066,
    "queuedTimeMillis": 0,
    "queued": false,
    "cpuTimeMillis": 849228,
    "rootStage": {
      "state": "FINISHED",
      "queuedSplits": 0,
      "nodes": 1,
      "totalSplits": 17,
      "processedBytes": 16829644,
      "processedRows": 11495,
      "completedSplits": 17,
      "stageId": "0",
      "done": true,
      "cpuTimeMillis": 69,
      "subStages": [
        {
          "state": "CANCELED",
          "queuedSplits": 21701,
          "nodes": 116,
          "totalSplits": 42384,
          "processedBytes": 40825519532,
          "processedRows": 47734066,
          "completedSplits": 15240,
          "stageId": "1",
          "done": true,
          "cpuTimeMillis": 849159,
          "subStages": [],
          "wallTimeMillis": 2570374,
          "userTimeMillis": 730020,
          "runningSplits": 5443
        }
      ],
      "wallTimeMillis": 1530,
      "userTimeMillis": 50,
      "runningSplits": 0
    },
    "totalSplits": 42401,
    "userTimeMillis": 730070,
    "runningSplits": 5443
  }
  }
}
```

Superset will terminate the polling when it finds that `nextUri`
becomes none, but actually, when `["stats"]["state"] == "FINISHED"`,
it means that presto has already finished the query and superset can stop
polling and get the data back.

After this simple optimization, we get a 2-5x performance boost for
Presto SQL Lab queries.
(cherry picked from commit b71f551)
* [webpack] setup lazy loading for all visualizations

* [lazy-load] push renderVis function to <Chart /> state

* no mapbox token

* [lazy loading] use native webpack import func to fix chunk names, add babel-plugin-syntax-dynamic-import, fix rebase bug.

* fix geojson import, undefined t, and fix async css bug

* [lazy load] actually add babel-plugin-syntax-dynamic-import

* [webpack] working dev version of webpack v4

* [webpack 4] fix url issues, use mini-css-extract-plugin and webpack-assets-manifest plugins

* [webpack 4] use splitchunks for all files, update templates to multi-file entrypoints

* [webpack 4] multiple theme entry files for markup vis css, don't uglify mapbox

* [webpack 4] lint python manifest changes, update yarn lock.

* [webpack 4] fix tests with babel-plugin-dynamic-import-node

* [webpack 4] only use 'dynamic-import-node' plugin in tests, update <Chart /> vis promise when vis type changes

* [webpack 4] clean up package.json and yarn.lock after rebase

* [webpack 4] lint?

* [webpack 4] lint for real

(cherry picked from commit de0aaf4)
@john-bodley john-bodley changed the title [cherry-pick] 4855 5132 [cherry-pick] Performance PRs Jun 15, 2018
@john-bodley john-bodley deleted the john-bodley-cherry-pick-4855-5132 branch June 15, 2018 17:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants