Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speed up uuid column generation #11209

Merged
merged 5 commits into from
Oct 13, 2020
Merged

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Oct 9, 2020

SUMMARY

We have tens of thousands of dashboards, more than 200k slices, and 1.3 million table columns in our Superset deployment. It takes forever to run the db migration for #11098 . This PR tries to speed up the migration process by

  1. adding pagination to db queries, and
  2. using database built-in SQL functions to generate the UUIDs when possible.

I also tried to utilize ThreadPoolExecutor to parallelize the db operations and Python uuid generation, but it doesn't seem to help much.

Also fixes certain API errors while downgrading in MySQL and added the option to adjust batch size from command line.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Snip20201009_4

TEST PLAN

Make a copy of your very large database, then try:

superset db downgrade
superset db upgrade

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@codecov-io
Copy link

codecov-io commented Oct 9, 2020

Codecov Report

Merging #11209 into master will decrease coverage by 4.17%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11209      +/-   ##
==========================================
- Coverage   65.65%   61.47%   -4.18%     
==========================================
  Files         829      829              
  Lines       39213    39244      +31     
  Branches     3593     3593              
==========================================
- Hits        25744    24126    -1618     
- Misses      13357    14937    +1580     
- Partials      112      181      +69     
Flag Coverage Δ
#cypress ?
#javascript 62.38% <ø> (ø)
#python 60.93% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ns/b56500de1855_add_uuid_column_to_import_mixin.py 0.00% <0.00%> (ø)
superset-frontend/src/SqlLab/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/App.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/dashboard/index.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/setup/setupColors.js 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/chart/ChartContainer.jsx 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/setup/setupFormatters.js 0.00% <0.00%> (-100.00%) ⬇️
superset-frontend/src/explore/reducers/index.js 0.00% <0.00%> (-100.00%) ⬇️
... and 166 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6fc3d2...a2a86e2. Read the comment docs.

@@ -71,20 +74,31 @@ class ImportMixin:

models["dashboards"].position_json = sa.Column(utils.MediumText())

default_batch_size = int(os.environ.get("BATCH_SIZE", 200))
Copy link
Member Author

@ktmud ktmud Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the bigger the better. I tested many different numbers (100, 200, 500, 1000, 2000) and 200 seems to be working the best (but obviously this will depend on the machine so providing a way to override it via env variables).

with op.batch_alter_table(model) as batch_op:
batch_op.drop_constraint(f"uq_{table_name}_uuid")
with op.batch_alter_table(model.__tablename__) as batch_op:
batch_op.drop_constraint(f"uq_{table_name}_uuid", type_="unique")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL will throw an error if type_ is not specified.

@ktmud ktmud force-pushed the faster-uuid-migration branch 4 times, most recently from c63c1af to 96e3637 Compare October 9, 2020 20:40
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 9, 2020
@ktmud ktmud changed the title perf: add db pagination to speed up uuid column migration perf: speed up uuid column generation Oct 9, 2020
@ktmud ktmud force-pushed the faster-uuid-migration branch 6 times, most recently from 5090b41 to 9162f28 Compare October 9, 2020 23:55
@ktmud ktmud mentioned this pull request Oct 10, 2020
6 tasks
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, Jessie! Thanks for improving the perf and making the migration more resilient!

@ktmud ktmud merged commit d7eb1d4 into apache:master Oct 13, 2020
@bkyryliuk
Copy link
Member

got a bit too late to this pr, thanks @ktmud for the fix!

@ktmud ktmud deleted the faster-uuid-migration branch October 13, 2020 20:31
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants