RCORE-1900 Make "next launch" metadata actions multiprocess-safe #7576

tgoyne · 2024-04-10T17:36:59Z

File actions are performed on "next launch" since that's a time where we know it's safe to delete or move Realm files which may have been in use. This previously meant when the metadata store was next constructed, which was incorrect when multiple processes are sharing a metadata file. Instead, it tries to claim the sync agent flag on the metadata Realm file, and only performs the launch actions if it's able to.

There is still a race condition here: process 1 could initialize the metadata realm, claim the flag, then get suspended. Process 2 initializes, opens a Realm, and then removes the user. Process 1 then unsuspends and proceeds to run file actions, deleting the Realm process 2 has open. This is probably not actually a problem but I'll continue to try to find a fix.

Keeping the metadata Realm open is required for this to work, for change notification to work (a future change), and greatly improves performance of metadata operations. It slightly increases memory usage but the metadata realm is typically very small.

update_user() could previously overwrite data with stale data previously read. It was very difficult for this to actually happen with where and when we called it, but better to fix the problem.

There's two groups of not directly related updates to the tests. There were some spots where we called assert_no_open_realms() where the sync metadata Realm now should be open. These now check for the specific file that we're expecting to be closed instead. To help track down some scheduler-related problems, I added an assertion to verify that the libuv scheduler is only created on the main thread, as it doesn't support background threads. This turned out to reveal a bunch of places where we were creating them on background threads, which I then fixed.

coveralls-official · 2024-04-15T05:27:58Z

Pull Request Test Coverage Report for Build thomas.goyne_436

Details

348 of 356 (97.75%) changed or added relevant lines in 15 files are covered.
53 unchanged lines in 14 files lost coverage.
Overall coverage increased (+0.01%) to 90.998%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
test/object-store/sync/flx_schema_migration.cpp	45	47	95.74%
src/realm/object-store/sync/impl/app_metadata.cpp	58	64	90.63%

Files with Coverage Reduction	New Missed Lines	%
src/realm/sort_descriptor.cpp	1	94.06%
src/realm/sync/client.cpp	1	91.18%
src/realm/sync/instructions.hpp	1	76.03%
test/test_util_network_ssl.cpp	1	89.59%
test/object-store/sync/flx_schema_migration.cpp	2	96.93%
src/realm/object-store/util/apple/scheduler.hpp	3	77.08%
src/realm/sync/instruction_applier.cpp	3	68.01%
src/realm/sync/noinst/protocol_codec.hpp	3	73.54%
src/realm/unicode.cpp	3	83.83%
test/test_thread.cpp	3	66.14%

Totals
Change from base Build 2457:	0.01%
Covered Lines:	215179
Relevant Lines:	236466

💛 - Coveralls

michael-wb

Still reviewing, but posting a couple of questions/comments I have so far.

test/object-store/sync/metadata.cpp

michael-wb · 2024-04-18T15:11:31Z

test/object-store/sync/metadata.cpp

+        create_metadata_store(config, file_manager);
+        REQUIRE(File::exists(path));
+
+        store_1.reset();
+
+        create_metadata_store(config, file_manager);
+        REQUIRE(File::exists(path));
+
+        store_2.reset();


Just to verify: in this case, store_1 and/or store_2 are still open, so reopening the metadata store doesn't run the file actions. It's not until both are closed that the file actions are performed the next time the metadata store is opened.
From the PR description, it sounds like this is the current behavior, but may be changing in the future?

michael-wb · 2024-04-18T15:31:10Z

test/object-store/sync/metadata.cpp

-        data->access_token.token.clear();
-        data->refresh_token.token.clear();
-        store->update_user("user 3", *data);
+        store->update_user("user 3", [](auto& data) {


Should you also check that the current user is "user 3" prior to making these changes?

michael-wb · 2024-04-18T17:43:38Z

src/realm/object-store/sync/app.cpp

+                m_metadata_store->update_user(user->user_id(), [&](auto& data) {
+                    data.identities = std::move(identities);
+                    data.profile = UserProfile(get<BsonDocument>(profile_json, "data"));
+                    user->update_backing_data(data); // FIXME


What's the difference between updating the user in the metadata store and updating the user's backing data?
Is one a cached copy vs persistent/saved to disk? Or is this the trigger for the client app to update their own user metadata storage? Or is there some other purpose.
I see the FIXME comment, so perhaps this will be updated in the future?

The metadata store is the source of truth for the user data, and User stores an in-memory copy of it. The intention is to move to a one-way data flow where the metadata store pushes updates to the active users, but for now everything that updates the store also manually updates the users too.

michael-wb · 2024-04-18T17:44:46Z

src/realm/object-store/sync/app.cpp

-                    m_metadata_store->update_user(user->user_id(), *data);
-                    user->update_backing_data(std::move(data));
-                }
+                m_metadata_store->update_user(user->user_id(), [&](auto& data) {


Is this function a no-op if the user isn't found?

Yes. Another process could delete the user concurrently with us updating the user, so it's not an error for the user to be missing.

tgoyne · 2024-04-22T02:53:27Z

I've added another commit which fixes the race condition mentioned in the PR description and added more of a description of what's going on.

danieltabacaru · 2024-05-22T14:07:54Z

Instead, it tries to claim the sync agent flag on the metadata Realm file, and only performs the launch actions if it's able to.

I guess there needs to be a guarantee that the sync agent flag on the realm file is not claimed right? Is it that it's always a single process that claims both? (and so if the metadata one is claimed then it must mean that the realm file is not in use?)

tgoyne · 2024-05-22T17:41:31Z

There's a single metadata Realm per app, and claiming the sync agent on it has no connection to claiming the sync agent on any of the Realms which the user opens. One process claiming the sync agent on the metadata realm and performing launch actions while another process is opening Realms and being the sync agent on those is completely fine.

danieltabacaru · 2024-05-22T19:36:57Z

One process claiming the sync agent on the metadata realm and performing launch actions while another process is opening Realms and being the sync agent on those is completely fine.

I thought that deleting a realm file while another process uses it would be an issue.

tgoyne · 2024-05-22T20:35:21Z

The point of all this is to avoid deleting a Realm file in use. The metadata Realm is always open at any point where a sync Realm is open, and once the sync agent on the metadata Realm is claimed it remains claimed until everyone has closed the metadata Realm. This means that if we are able to claim the sync agent on the metadata Realm, we know that at that precise moment in time there were no open Realms associated with the current app in any process. It is invalid to reopen a Realm after creating a file action for that path, so once we have a point in time where we know that there are no open Realms, the Realms associated with all file actions created before that point cannot be open. While this is happening another process may be opening other Realms associated with the same app, but they cannot be the same ones as we're deleting.

danieltabacaru · 2024-06-26T23:18:46Z

It is invalid to reopen a Realm after creating a file action for that path

I updated "immediately run DeleteRealm action" test to open the realm after create_file_action() and it works.

tgoyne · 2024-06-26T23:29:10Z

"Invalid" in that it's a misuse of the API which may lead to data loss without warning. The manual client reset API is basically a giant pile of sharp edges that is very difficult to use correctly.

danieltabacaru

The changes look good. You should try rebasing as the migration test failure has been fixed a few weeks ago. There are some other failures though (i.e, SharedRealm: async writes)

The libuv scheduler used in tests on non-Apple platforms does not support this.

…de the write transaction This avoids overwriting data with stale values in multiprocess scenarios.

tgoyne · 2024-07-02T04:06:39Z

The CI failures were a few new calls which called _impl::RealmCoordinator::assert_no_open_realms() where the metadata Realm is now expected to be open (which I replaced with REQUIRE_FALSE(_impl::RealmCoordinator::get_existing_coordinator(config.path)) to check that the specific expected Realm was closed as with all the previous occurrences) and one new spot where we were illegally using the default scheduler on a background thread in a test. No non-test changes were needed after rebasing.

tgoyne self-assigned this Apr 10, 2024

cla-bot bot added the cla: yes label Apr 10, 2024

tgoyne force-pushed the tg/multi-process-launch-actions branch 3 times, most recently from fa29e67 to 350ca2c Compare April 15, 2024 04:38

tgoyne force-pushed the tg/multi-process-launch-actions branch 9 times, most recently from b6589ac to c28f550 Compare April 17, 2024 22:19

tgoyne changed the base branch from master to tg/promise-race April 17, 2024 22:34

Base automatically changed from tg/promise-race to master April 17, 2024 23:19

tgoyne requested review from danieltabacaru and michael-wb April 18, 2024 02:16

tgoyne force-pushed the tg/multi-process-launch-actions branch from c28f550 to 3916ad1 Compare April 19, 2024 21:15

michael-wb reviewed Apr 19, 2024

View reviewed changes

tgoyne force-pushed the tg/multi-process-launch-actions branch 3 times, most recently from 1b39976 to f696863 Compare April 26, 2024 18:26

tgoyne force-pushed the tg/multi-process-launch-actions branch from f696863 to 3855df6 Compare May 29, 2024 18:26

danieltabacaru approved these changes Jun 27, 2024

View reviewed changes

tgoyne force-pushed the tg/multi-process-launch-actions branch 5 times, most recently from 8d22399 to 3241380 Compare July 2, 2024 02:36

tgoyne added 7 commits July 1, 2024 20:11

Fix many tests which used the default scheduler on background threads

c1632a4

The libuv scheduler used in tests on non-Apple platforms does not support this.

Read the original data for PersistedMetadataStore::update_data() insi…

a9c9b6c

…de the write transaction This avoids overwriting data with stale values in multiprocess scenarios.

Keep the metadata Realm open rather than reopening it on each use

a592484

Only perform launch actions if the metadata store initiated the session

19d40bb

Fix a race condition in a test

82a17a4

Fix a race where users could get cleaned up too early

9efccbd

Use the successfully_async_open_realm() helper in more tests

b2a419b

tgoyne force-pushed the tg/multi-process-launch-actions branch from 3241380 to b2a419b Compare July 2, 2024 03:12

tgoyne merged commit f66e24d into master Jul 2, 2024
40 checks passed

tgoyne deleted the tg/multi-process-launch-actions branch July 2, 2024 04:06

clementetb mentioned this pull request Jul 10, 2024

Metadata Realm is always accessible even with a wrong encryption key. #7876

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCORE-1900 Make "next launch" metadata actions multiprocess-safe #7576

RCORE-1900 Make "next launch" metadata actions multiprocess-safe #7576

tgoyne commented Apr 10, 2024 •

edited

Loading

coveralls-official bot commented Apr 15, 2024 •

edited

Loading

michael-wb left a comment

michael-wb Apr 18, 2024

michael-wb Apr 18, 2024

michael-wb Apr 18, 2024

tgoyne Apr 19, 2024

michael-wb Apr 18, 2024

tgoyne Apr 19, 2024

tgoyne commented Apr 22, 2024

danieltabacaru commented May 22, 2024

tgoyne commented May 22, 2024

danieltabacaru commented May 22, 2024

tgoyne commented May 22, 2024

danieltabacaru commented Jun 26, 2024

tgoyne commented Jun 26, 2024

danieltabacaru left a comment

tgoyne commented Jul 2, 2024

RCORE-1900 Make "next launch" metadata actions multiprocess-safe #7576

RCORE-1900 Make "next launch" metadata actions multiprocess-safe #7576

Conversation

tgoyne commented Apr 10, 2024 • edited Loading

coveralls-official bot commented Apr 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build thomas.goyne_436

Details

💛 - Coveralls

michael-wb left a comment

Choose a reason for hiding this comment

michael-wb Apr 18, 2024

Choose a reason for hiding this comment

michael-wb Apr 18, 2024

Choose a reason for hiding this comment

michael-wb Apr 18, 2024

Choose a reason for hiding this comment

tgoyne Apr 19, 2024

Choose a reason for hiding this comment

michael-wb Apr 18, 2024

Choose a reason for hiding this comment

tgoyne Apr 19, 2024

Choose a reason for hiding this comment

tgoyne commented Apr 22, 2024

danieltabacaru commented May 22, 2024

tgoyne commented May 22, 2024

danieltabacaru commented May 22, 2024

tgoyne commented May 22, 2024

danieltabacaru commented Jun 26, 2024

tgoyne commented Jun 26, 2024

danieltabacaru left a comment

Choose a reason for hiding this comment

tgoyne commented Jul 2, 2024

tgoyne commented Apr 10, 2024 •

edited

Loading

coveralls-official bot commented Apr 15, 2024 •

edited

Loading