[SO Migration] fix reindex race on multi-instance mode #104516

pgayvallet · 2021-07-06T17:28:01Z

Summary

Fix #99211
Integration test inspired from #100171

When in multi-instance mode, the leading node adds the write-block on the temp index as soon as it finishes the source->temp reindex. This was potentially causing the trailing nodes to fail if they were still performing the same reindex, as the temp index was now write-blocked. If this had theoretically no impact on the migration (as long as the leading node successfully completed the next steps), it was causing the trailing nodes to fail, requiring a restart (and also outputting migration failure logs when the migration was probably correctly completed).

This PR fixes it by having the bulkOverwriteTransformedDocuments action identify cluster_block_exception failures, and returning a specific left value in such case (instead of throwing), to allow the model to directly jump to the next step instead of failing.

Checklist

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

…ndex-race

pgayvallet · 2021-07-07T09:42:55Z

src/core/server/saved_objects/migrationsv2/actions/bulk_overwrite_transformed_documents.ts

-      const errors = (res.body.items ?? []).filter(
-        (item) => item.index?.error?.type !== 'version_conflict_engine_exception'
-      );
+      const errors = (res.body.items ?? [])
+        .filter((item) => item.index?.error)
+        .map((item) => item.index!.error!)
+        .filter(({ type }) => type !== 'version_conflict_engine_exception');


Not even sure how this was working previously with non-error items, as

>>({}).index?.error?.type !== 'version_conflict_engine_exception' << true

but In doubt, I added more steps for the sake of readability.

pgayvallet · 2021-07-07T09:44:14Z

src/core/server/saved_objects/migrationsv2/actions/bulk_overwrite_transformed_documents.ts

+        if (errors.every(isWriteBlockException)) {
+          return Either.left({
+            type: 'target_index_had_write_block' as const,
+          });
+        }


It's very likely that if any write_block exception is encountered, all the objects encountered it, but just in case another error was returned, we check that all errors are effectively write block exceptions.

pgayvallet · 2021-07-07T09:48:15Z

rfcs/text/0013_saved_object_migrations.md

+8. Reindex the source index into the new temporary index using a 'client-side' reindex, by reading batches of documents from the source, migrating them, and indexing them into the temp index.
+   1. Use `op_type=index` so that multiple instances can perform the reindex in parallel (last node running will override the documents, with no effect as the input data is the same)
+   2. Ignore `version_conflict_engine_exception` exceptions as they just mean that another node was indexing the same documents
+   3. If a `target_index_had_write_block` exception is encountered for all document of a batch, assume that another node already completed the temporary index reindex, and jump to the next step
+   4. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents (without writing them to the temp index). If any failures occured, log the complete list of documents that failed to transform, then fail the migration.


We forgot to update the RFC for the client-side reindex. Fixed it, and added the new target_index_had_write_block special case,

pgayvallet · 2021-07-07T09:52:34Z

src/core/server/saved_objects/migrationsv2/actions/es_errors.ts

+export const isWriteBlockException = ({ type, reason }: EsErrorCause): boolean => {
+  return (
+    type === 'cluster_block_exception' &&
+    reason.match(/index \[.+] blocked by: \[FORBIDDEN\/8\/.+ \(api\)\]/) !== null
+  );
+};


Depending on the type of operation, the reason identifying a write block can vary

e.g

index [.kibana_dolly] blocked by: [FORBIDDEN/8/index write (api)]
index [.kibana_dolly] blocked by: [FORBIDDEN/8/moving to block index write (api)]

I extracted the function previously present in wait_for_reindex_task.ts and made it more generic to match any FORBIDDEN/8/*** (api) text.

pgayvallet · 2021-07-07T09:53:54Z

src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts

+      ).resolves.toMatchInlineSnapshot(`
+              Object {
+                "_tag": "Left",
+                "left": Object {
+                  "type": "target_index_had_write_block",
+                },
+              }
+            `);


Not fan of snapshots to test resolved results (especially as the resolved object is small), but this is what is done in the other tests of the file, and I didn't want to fix them all in this PR.

not for all the tests

kibana/src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts

Lines 134 to 144 in 5cbb075

expect(res.right).toEqual(

expect.objectContaining({

existing_index_with_docs: {

aliases: {},

mappings: expect.anything(),

settings: expect.anything(),

},

})

);

});

});

so I wouldn't use snapshots in this case.

pgayvallet · 2021-07-07T10:08:58Z

src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts

-    it('rejects if there are errors', async () => {
+    it('resolves left if there are write_block errors', async () => {


I wanted to also add a IT test to assert that it still rejects for other errors, but it seems that using an non-existing index surprisingly leads to a timeout (which is handled by catchRetryableEsClientErrors) instead of a more final error, and I couldn't find a way to trigger another kind of error from ES.

If anyone has an idea, I'll take it. Else it's probably fine as it's covered in unit tests anyway.

pgayvallet · 2021-07-07T10:13:26Z

src/core/server/saved_objects/migrationsv2/integration_tests/multiple_kibana_nodes.test.ts

+  it('migrates saved objects normally when multiple Kibana instances are started at the same time', async () => {
+    const setupContracts = await Promise.all([rootA.setup(), rootB.setup(), rootC.setup()]);
+
+    setupContracts.forEach((setup) => setup.savedObjects.registerType(fooType));
+
+    await startWithDelay([rootA, rootB, rootC], 0);


Added tests with 0, 1, 5, 20sec

elasticmachine · 2021-07-07T11:50:56Z

Pinging @elastic/kibana-core (Team:Core)

mshustov · 2021-07-07T15:51:39Z

src/core/server/saved_objects/migrationsv2/integration_tests/multiple_kibana_nodes.test.ts

+  };
+
+  afterAll(async () => {
+    await new Promise((resolve) => setTimeout(resolve, 10000));


a mistic delay wandering through all the files 😄

Yea... I guess I did as everyone else, wondering what that was for, and then in doubt, still copied it 😄 . Do you know if it's safe to delete those?

well, I'd test it and remove it for the tests at once

src/core/server/saved_objects/migrationsv2/integration_tests/multiple_kibana_nodes.test.ts

mshustov · 2021-07-07T16:00:20Z

src/core/server/saved_objects/migrationsv2/actions/bulk_overwrite_transformed_documents.ts

-      );
+      const errors = (res.body.items ?? [])
+        .filter((item) => item.index?.error)
+        .map((item) => item.index!.error!)


wow! so many ?, ! 😅 I'd write it as

.map((item) => item.index?.error) .filter(Boolean) .filter(({ type }) => type !== 'version_conflict_engine_exception');

Yea, the problem is that TS is stupid with map/filter.

.map((item) => item.index?.error) .filter(Boolean)

Is not sufficient to have the | undefined part removed from error. The third line complains that ErrorContainer | undefined does not have a type property, which is why I kinda was forced to filter first then force-cast using !

mshustov · 2021-07-07T16:02:51Z

src/core/server/saved_objects/migrationsv2/actions/es_errors.test.ts

+describe('isWriteBlockError', () => {
+  it('returns true for a `index write` cluster_block_exception', () => {
+    expect(
+      isWriteBlockException({


Maybe we can add an integration test instead? Since it's easy to reproduce for an ES instance.

There is already an integration test for the action using the function, but it doesn't hurt to do it for the helper itself.

mshustov · 2021-07-07T16:05:51Z

src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts

+      ).resolves.toMatchInlineSnapshot(`
+              Object {
+                "_tag": "Left",
+                "left": Object {
+                  "type": "target_index_had_write_block",
+                },
+              }
+            `);


not for all the tests

kibana/src/core/server/saved_objects/migrationsv2/actions/integration_tests/actions.test.ts

Lines 134 to 144 in 5cbb075

expect(res.right).toEqual(

expect.objectContaining({

existing_index_with_docs: {

aliases: {},

mappings: expect.anything(),

settings: expect.anything(),

},

})

);

});

});

so I wouldn't use snapshots in this case.

kibanamachine · 2021-07-07T19:50:56Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #136798 succeeded 5cbb075
💚 Build #136678 succeeded 1e26c56
💔 Build #136651 failed 641e63e

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md

… (#104761) * [SO Migration] fix reindex race on multi-instance mode (#104516) * fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md * fix dataset for 7.14

#104760) * [SO Migration] fix reindex race on multi-instance mode (#104516) * fix reindex race condition * fix some IT tests * fix reindex cause detection * add integration test * update RFC * review comments * add integration test for isWriteBlockException # Conflicts: # rfcs/text/0013_saved_object_migrations.md * fix dataset for 7.15

…-of-max-results * 'master' of github.com:elastic/kibana: (36 commits) Lower Kibana app bundle limits (elastic#104688) [Security Solutions] Fixes bug with the filter query compatibility for transforms (elastic#104559) [RAC] Add mapping update logic to RuleDataClient (elastic#102586) Fix import workpad (elastic#104722) [canvas] Fix Storybook service decorator (elastic#104750) [Detection Rules] Add 7.14 rules (elastic#104772) [Enterprise Search] Fix beta notification in sidebar (elastic#104763) Fix engine routes that are meta engine or non-meta-engine specific (elastic#104757) [Fleet] Fix policy revision number getting bumped for no reason (elastic#104696) persistable state migrations (elastic#103680) [Fleet] Fix add agent in the package policy table (elastic#104749) [DOCS] Creates separate doc for security in production (elastic#103973) [SO Migration] fix reindex race on multi-instance mode (elastic#104516) [Security Solution] Update text in Endpoint Admin pages (elastic#104649) [package testing] Decrease timeout to 2 hours (elastic#104668) Fix background styling of waterfall chart sidebar tooltip. (elastic#103997) [Fleet + Integrations UI] Integrations UI Cleanup (elastic#104641) [Fleet] Link to download page of current stack version on Agent install instructions (elastic#104494) [Workplace Search] Fix Media Type field preview is unformatted bug (elastic#104684) [ML] add marker body (elastic#104672) ... # Conflicts: # x-pack/plugins/fleet/public/search_provider.test.ts

pgayvallet added 2 commits July 6, 2021 19:26

fix reindex race condition

a4ca9dc

Merge remote-tracking branch 'upstream/master' into kbn-99211-fix-rei…

641e63e

…ndex-race

pgayvallet added 2 commits July 6, 2021 20:33

fix some IT tests

390adf8

fix reindex cause detection

1e26c56

pgayvallet mentioned this pull request Jul 7, 2021

[saved objects] Add migrations v2 integration test for scenario with multiple Kibana instances. #100171

Closed

pgayvallet added 2 commits July 7, 2021 11:23

add integration test

f9fa0fe

update RFC

5cbb075

pgayvallet commented Jul 7, 2021

View reviewed changes

pgayvallet marked this pull request as ready for review July 7, 2021 11:50

pgayvallet requested a review from a team as a code owner July 7, 2021 11:50

mshustov approved these changes Jul 7, 2021

View reviewed changes

spalger added v7.15.0 v8.0.0 labels Jul 7, 2021

pgayvallet added 2 commits July 7, 2021 18:42

review comments

c1bc869

add integration test for isWriteBlockException

7f3f35c

pgayvallet merged commit d64c3fb into elastic:master Jul 7, 2021

pgayvallet mentioned this pull request Jul 7, 2021

[7.x] [SO Migration] fix reindex race on multi-instance mode (#104516) #104760

Merged

pgayvallet mentioned this pull request Jul 7, 2021

[7.14] [SO Migration] fix reindex race on multi-instance mode (#104516) #104761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SO Migration] fix reindex race on multi-instance mode #104516

[SO Migration] fix reindex race on multi-instance mode #104516

pgayvallet commented Jul 6, 2021 •

edited

Loading

pgayvallet Jul 7, 2021

pgayvallet Jul 7, 2021

pgayvallet Jul 7, 2021

pgayvallet Jul 7, 2021

pgayvallet Jul 7, 2021

mshustov Jul 7, 2021

pgayvallet Jul 7, 2021

pgayvallet Jul 7, 2021

elasticmachine commented Jul 7, 2021

mshustov Jul 7, 2021

pgayvallet Jul 7, 2021

mshustov Jul 7, 2021 •

edited

Loading

mshustov Jul 7, 2021

pgayvallet Jul 7, 2021

mshustov Jul 7, 2021

pgayvallet Jul 7, 2021

mshustov Jul 7, 2021

kibanamachine commented Jul 7, 2021

	expect(res.right).toEqual(
	expect.objectContaining({
	existing_index_with_docs: {
	aliases: {},
	mappings: expect.anything(),
	settings: expect.anything(),
	},
	})
	);
	});
	});

		it('rejects if there are errors', async () => {
		it('resolves left if there are write_block errors', async () => {

[SO Migration] fix reindex race on multi-instance mode #104516

[SO Migration] fix reindex race on multi-instance mode #104516

Conversation

pgayvallet commented Jul 6, 2021 • edited Loading

Summary

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Jul 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshustov Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibanamachine commented Jul 7, 2021

💚 Build Succeeded

Metrics [docs]

History

pgayvallet commented Jul 6, 2021 •

edited

Loading

mshustov Jul 7, 2021 •

edited

Loading