fix(migration): Ensure the paginated update is deterministic #21778

john-bodley · 2022-10-12T02:30:43Z

SUMMARY

This PR fixes an issue with the paginated_update method which is used in a number of migrations. The problem is the pagination was not deterministic, i.e., per iteration it slices the query via a SQL statement using an OFFSET and LIMIT. The issue is if the results are not ordered in a consistent way, i.e., by primary key, hence the ordering of sliced results is random meaning that a record may never be processed or processed multiple times.

Here's is where the paginated_update method is called. It's used by the assign_uuids method, updating the key_value table, as well as migrating visualizations. The later is likely the only ones of concern given that the migration changes the visualization type which is part of the query filter and thus it continually skips over eligible records. Thankfully these migrations are actually only optional as the legacy visualization type is not deprecated. A future migration will be necessary regardless when the legacy visualization types are actually deprecated.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Tested locally.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

john-bodley · 2022-10-12T02:34:22Z

superset/migrations/shared/utils.py

-            yield obj
-            session.merge(obj)
+
+    while True:


The option was to use either query.slice(...) or session.execute(query).fetchmany(...). I opted for the later because:

Otherwise one would need to include an order_by(...) condition and there's no guarantee that it would be defined.

Re-executing the query n times using a different OFFSET per query is likely neither efficient nor guarantees correct pagination given that the filter condition could change.

john-bodley · 2022-10-12T02:34:46Z

superset/migrations/shared/utils.py

-        end = min(start + batch_size, count)
-        for obj in query[start:end]:
-            yield obj
-            session.merge(obj)


There's no need to merge the record. The caller should handle this if required.

john-bodley · 2022-10-12T02:35:21Z

superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py

@@ -66,7 +66,6 @@ def upgrade():
                state["anchor"] = state["hash"]
                del state["hash"]
            entry.value = pickle.dumps(value)
-    session.commit()


There's no need to commit as the paginated_update handles it.

john-bodley · 2022-10-12T02:35:27Z

superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py

@@ -87,5 +86,3 @@ def downgrade():
                state["hash"] = state["anchor"]
                del state["anchor"]
            entry.value = pickle.dumps(value)
-        session.merge(entry)
-    session.commit()


There's no need to commit as the paginated_update handles it.

john-bodley · 2022-10-12T02:36:37Z

superset/migrations/versions/2022-06-27_14-59_7fb8bca906d2_permalink_rename_filterstate.py

@@ -87,5 +86,3 @@ def downgrade():
                state["hash"] = state["anchor"]
                del state["anchor"]
            entry.value = pickle.dumps(value)
-        session.merge(entry)


There's no need to merge the existing entry. See here for details:

Used when you may have more than 1 in-memory objects which map to the same database record with some key.

codecov · 2022-10-12T02:54:18Z

Codecov Report

Merging #21778 (f8f44a5) into master (bd3166b) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #21778      +/-   ##
==========================================
- Coverage   66.88%   66.87%   -0.01%     
==========================================
  Files        1802     1802              
  Lines       68987    68988       +1     
  Branches     7345     7345              
==========================================
  Hits        46139    46139              
- Misses      20951    20952       +1     
  Partials     1897     1897

Flag	Coverage Δ
hive	`52.92% <0.00%> (-0.01%)`	⬇️
mysql	`78.25% <0.00%> (-0.01%)`	⬇️
postgres	`78.31% <0.00%> (-0.01%)`	⬇️
presto	`52.82% <0.00%> (-0.01%)`	⬇️
python	`81.45% <0.00%> (-0.01%)`	⬇️
sqlite	`76.80% <0.00%> (-0.01%)`	⬇️
unit	`51.05% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/migrations/shared/utils.py	`35.48% <0.00%> (-0.59%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ktmud

😱 Thanks for fixing this nasty bug! It must took a lot of time to debug.

Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition---using the original offset after previous batch was committed would mean some rows will be inevitably skipped.... which might be a bigger problem. This paginated update should really be used with caution. Maybe before each iteration to each batch, we should check whether the total count of the filtering query changes.

john-bodley · 2022-10-12T03:43:07Z

@ktmud

Your comment reminded me, what if an object was updated in the loop in a way that makes it no longer matches the query's filtering condition

Now that the query only executes once this isn't a problem. The result set in paginated rather than the query being sliced. This ensures the logic is indeed safe.

Note my default records from MySQL et al. may be ordered consistently. Your comment actually highlights the issue where the migration mutates the record resulting in a different result set per iteration. This regression likely only impacted the visualization migrations which are actually optional.

pull-request-size bot added the size/M label Oct 12, 2022

john-bodley commented Oct 12, 2022

View reviewed changes

fix(migration): Ensure the paginated update is deterministic

f8f44a5

john-bodley force-pushed the john-bodley--fix-paginated_update branch from 1cd04bd to f8f44a5 Compare October 12, 2022 02:37

john-bodley requested review from ktmud and etr2460 October 12, 2022 02:40

john-bodley marked this pull request as ready for review October 12, 2022 02:40

john-bodley requested a review from a team as a code owner October 12, 2022 02:40

ktmud approved these changes Oct 12, 2022

View reviewed changes

craig-rueda approved these changes Oct 12, 2022

View reviewed changes

john-bodley merged commit 88a89c9 into apache:master Oct 13, 2022

mistercrunch added the 🚢 2.1.3 label Feb 18, 2024

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(migration): Ensure the paginated update is deterministic #21778

fix(migration): Ensure the paginated update is deterministic #21778

john-bodley commented Oct 12, 2022 •

edited

Loading

john-bodley Oct 12, 2022 •

edited

Loading

john-bodley Oct 12, 2022

john-bodley Oct 12, 2022

john-bodley Oct 12, 2022

john-bodley Oct 12, 2022

codecov bot commented Oct 12, 2022 •

edited

Loading

ktmud left a comment

john-bodley commented Oct 12, 2022 •

edited

Loading

fix(migration): Ensure the paginated update is deterministic #21778

fix(migration): Ensure the paginated update is deterministic #21778

Conversation

john-bodley commented Oct 12, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

john-bodley Oct 12, 2022 • edited Loading

Choose a reason for hiding this comment

john-bodley Oct 12, 2022

Choose a reason for hiding this comment

john-bodley Oct 12, 2022

Choose a reason for hiding this comment

john-bodley Oct 12, 2022

Choose a reason for hiding this comment

john-bodley Oct 12, 2022

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2022 • edited Loading

Codecov Report

ktmud left a comment

Choose a reason for hiding this comment

john-bodley commented Oct 12, 2022 • edited Loading

john-bodley commented Oct 12, 2022 •

edited

Loading

john-bodley Oct 12, 2022 •

edited

Loading

codecov bot commented Oct 12, 2022 •

edited

Loading

john-bodley commented Oct 12, 2022 •

edited

Loading