Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-2063 Fix some client resets potentially failing with AutoClientResetFailed if a new client reset condition occurred before the first one completed #7542

Merged
merged 9 commits into from
Jun 4, 2024

Conversation

michael-wb
Copy link
Contributor

@michael-wb michael-wb commented Apr 3, 2024

What, How & Why?

If a new client reset occurs with a different action (e.g. rolled back to PBS), the original client reset tracking info (e.g. for migrated to FLX) will be removed and the new client reset will be allowed to continue. (Note: the action and error were originally included in this PR, but moved to a new PR #7649.

Added extra sync client hook events to capture different steps along the way during a client reset - this allows the "Test client migration and rollback with recovery" test to pause the client reset and roll back to PBS while the FLX migration client reset is in progress, effective reproducing the condition that was intermittently failing in the past. This test was previously taking 90+ secs to complete and about a third of this time was waiting for the reconnect timer to expire for the sync session that was active during migration and rollback. Added handle_reconnect() call after migration/rollback to cancel this timer, shaving around 30 secs off this test.

Fixes #7539

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

@michael-wb michael-wb self-assigned this Apr 3, 2024
@cla-bot cla-bot bot added the cla: yes label Apr 3, 2024
Copy link

coveralls-official bot commented Apr 3, 2024

Pull Request Test Coverage Report for Build michael.wilkersonbarker_1147

Details

  • 61 of 66 (92.42%) changed or added relevant lines in 5 files are covered.
  • 84 unchanged lines in 12 files lost coverage.
  • Overall coverage decreased (-0.008%) to 90.835%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/sync/noinst/client_reset.cpp 34 39 87.18%
Files with Coverage Reduction New Missed Lines %
src/realm/array_blobs_big.cpp 2 98.58%
src/realm/array_key.cpp 2 96.55%
src/realm/cluster.cpp 2 75.6%
src/realm/sync/client.cpp 3 91.78%
src/realm/sync/instruction_replication.cpp 3 91.48%
src/realm/sync/noinst/server/server_history.cpp 4 63.51%
test/sync_fixtures.hpp 7 75.56%
src/realm/index_string.cpp 8 84.63%
src/realm/sync/noinst/client_impl_base.cpp 8 82.47%
src/realm/sync/instruction_applier.cpp 12 68.51%
Totals Coverage Status
Change from base Build 2376: -0.008%
Covered Lines: 214628
Relevant Lines: 236283

💛 - Coveralls

@danieltabacaru
Copy link
Collaborator

Don't we want to fix the real issue too (clearing tracking the client reset in case of a rollback)?

@michael-wb
Copy link
Contributor Author

oh, yes - the test was originally rolling back during client reset on purpose to see what the issues are and the client reset tracking entry was being left over, causing the original failure.
I'll fix this, too.

@michael-wb
Copy link
Contributor Author

michael-wb commented Apr 19, 2024

Added client reset error and action tracking to the client reset metadata storage (producing v2) and if the new client reset action is different from the current client reset action, then the older client reset tracking will be removed in favor of the new client reset. Also updated the "Test client migration and rollback with recovery" test to reflect this operation:

Realm.Sync.Client.Reset - Connection[2]: Session[12]: Found a previous recoverable client reset of type: 'Recover' for 'MigrateToFLX' at: 2024-04-19 14:04:15.848902000
Realm.Sync.Client.Reset - Connection[2]: Session[12]: Originating client reset error: WrongSyncType: Client connected using partition-based sync when app has been migrated to flexible sync
Realm.Sync.Client.Reset - Connection[2]: Session[12]: New recoverable client reset of type: 'Recover' for 'RevertToPBS' is incompatible - clearing previous reset
Realm.Sync.Client.Reset - Connection[2]: Session[12]: New client reset error: WrongSyncType: Client connected using flexible sync when app has been reverted back to partition-based sync

@michael-wb michael-wb linked an issue Apr 19, 2024 that may be closed by this pull request
@michael-wb michael-wb marked this pull request as ready for review April 19, 2024 15:47
@jbreams
Copy link
Contributor

jbreams commented Apr 29, 2024

Can we break this PR up into the three separate PRs? 1) adding the extra info to the client reset tracker to show what the original error was 2) changing it so that if a new client reset of a different type starts it will be allowed to continue 3) any changes to fix up the test ?

@michael-wb
Copy link
Contributor Author

michael-wb commented Apr 29, 2024

Can we break this PR up into the three separate PRs? 1) adding the extra info to the client reset tracker to show what the original error was 2) changing it so that if a new client reset of a different type starts it will be allowed to continue 3) any changes to fix up the test ?

Sure - I can do that, although it may be better to create two PRs - one for the additions and another that allows a different type to continue + updates to migration test

@michael-wb michael-wb marked this pull request as draft May 30, 2024 18:52
michael-wb pushed a commit that referenced this pull request May 31, 2024
…et cycles (#7649)

* Broke out the client reset error and action storage from PR #7542
* Removed client reset recovery_allowed flag and other updates from review
* Updated pending_client_reset store to use the schema metadata tables
* Fixed pausing a session does not hold the DB open test
* Moved ownership of reset store to SessionWrapper
* Fixed migration test crash - need to save client reset error in handle fresh realm downloaded
* Updated PendingResetStore to be static functions instead of an initialized object; updates from review
* Make ClientReset::error no longer optional; fixed subscriptions tests
* updated changelog after release
* updates from review
@michael-wb michael-wb marked this pull request as ready for review May 31, 2024 04:08
CHANGELOG.md Outdated
@@ -17,7 +17,7 @@
-----------

### Internals
* None.
* Fix client reset failure during sync migration due to previous incomplete client reset. ([PR #7542](https://github.com/realm/realm-core/pull/7542), since v13.11.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually belongs under the Fixed section rather than the Internals section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would be helpful to describe how the client reset failure would manifest. Like what would the error message be for users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Collaborator

@danieltabacaru danieltabacaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Should we rename the PR to mention the bug being fixed (and not the test)?

@@ -132,6 +132,11 @@ enum class SyncClientHookEvent {
SessionActivating,
SessionSuspended,
BindMessageSent,
IdentMessageReceived,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't seem to use (yet) most of these events, so maybe we should add them when they're needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused events

@michael-wb michael-wb changed the title RCORE-2063 Fix flakey 'Test client migration and rollback with recovery' test RCORE-2063 Fix some client resets potentially failing with AutoClientResetFailed if a new client reset condition occurred before the first one completed Jun 4, 2024
@michael-wb michael-wb merged commit ab1361f into master Jun 4, 2024
38 checks passed
@michael-wb michael-wb deleted the mwb/fix-migration-test branch June 4, 2024 14:21
@github-actions github-actions bot mentioned this pull request Jun 7, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix flakey 'Test client migration and rollback with recovery' test
3 participants