Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCORE-1386 Track client reset reason in table that detects client reset cycles #7649

Merged
merged 32 commits into from
May 31, 2024

Conversation

michael-wb
Copy link
Contributor

What, How & Why?

This PR with the client reset error and action storage was broken out from PR #7542
These changes add the originating error Status and server requests action values from the SessionErrorInfo into the pending client reset tracking object that is used to provide client reset cycle detection and identify when a client reset is in progress so it can be closed out once the realm has been sync'ed with the server.
Fixes #6154

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • [ ] C-API, if public C++ API changed
  • [ ] bindgen/spec.yml, if public C++ API changed

@michael-wb michael-wb self-assigned this Apr 30, 2024
@cla-bot cla-bot bot added the cla: yes label Apr 30, 2024
@michael-wb michael-wb requested review from danieltabacaru, ironage and jbreams and removed request for ironage April 30, 2024 00:28
Copy link

coveralls-official bot commented Apr 30, 2024

Pull Request Test Coverage Report for Build michael.wilkersonbarker_1128

Details

  • 598 of 636 (94.03%) changed or added relevant lines in 17 files are covered.
  • 120 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-0.02%) to 90.825%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/sync/noinst/client_reset.cpp 19 20 95.0%
src/realm/sync/noinst/sync_metadata_schema.cpp 41 42 97.62%
test/object-store/util/sync/sync_test_utils.cpp 5 6 83.33%
src/realm/sync/client.cpp 22 32 68.75%
src/realm/db.hpp 0 12 0.0%
src/realm/sync/noinst/pending_reset_store.cpp 190 203 93.6%
Files with Coverage Reduction New Missed Lines %
src/realm/array_string.cpp 1 87.23%
src/realm/sort_descriptor.cpp 1 94.06%
src/realm/sync/noinst/client_reset.cpp 1 94.38%
src/realm/sync/noinst/server/server_history.cpp 1 63.7%
src/realm/uuid.cpp 1 98.48%
test/test_dictionary.cpp 1 99.83%
test/test_query2.cpp 1 98.73%
src/realm/sync/network/http.hpp 2 82.27%
src/realm/sync/noinst/server/server.cpp 2 73.93%
src/realm/table_view.cpp 2 92.99%
Totals Coverage Status
Change from base Build 2359: -0.02%
Covered Lines: 214684
Relevant Lines: 236371

💛 - Coveralls

test/test_client_reset.cpp Outdated Show resolved Hide resolved
src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
@michael-wb michael-wb requested a review from ironage May 4, 2024 03:15
src/realm/object-store/sync/sync_session.hpp Outdated Show resolved Hide resolved
src/realm/object-store/sync/sync_session.hpp Outdated Show resolved Hide resolved
src/realm/sync/client.cpp Outdated Show resolved Hide resolved
@@ -62,11 +62,12 @@ class SyncSocketProvider;
struct ClientReset {
realm::ClientResyncMode mode;
DBRef fresh_copy;
bool recovery_is_allowed = true;
bool recovery_is_allowed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just capture the original SessionErrorInfo that caused the client reset here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SessionErrorInfo can't be default created and there are only a few parts needed, I don't think we need to. I did remove the recovery_is_allowed since it is no longer needed.

src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
auto mode = config(&SyncConfig::client_resync_mode);
if (mode == ClientResyncMode::Recover) {
handle_fresh_realm_downloaded(
nullptr,
{ErrorCodes::RuntimeError,
"A client reset is required but the server does not permit recovery for this client"},
server_requests_action);
error_info);
return;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just capture error_info in a member variable after this line rather than passing it as an argument to handle_fresh_realm_downloaded below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now being saved to a member variable in download_fresh_realm() and no longer being passed to handle_fresh_realm_downloaded()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my earlier comment was still in "Pending"...

I tried that, but saving it in download_fresh_realm() is too soon and caused issues if the SyncSession was paused/resumed while a client reset is in progress - the error info was being consumed when the main Sync Session was manually resumed instead of when it was resumed after the fresh realm was downloaded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate anymore. It's still being passed to handle_fresh_realm_downloaded and that's where m_client_reset_error is updated.

src/realm/sync/noinst/client_reset.cpp Outdated Show resolved Hide resolved
TableRef table = get_or_create_client_reset_table(wt);
REALM_ASSERT(table);
// Even if the table is being updated to V2, an existing entry will still throw an exception
size_t table_size = table->size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 583/584 can just be if(table->size() > 0). it doesn't look like we use the table_size variable anywhere else.

src/realm/sync/client.cpp Outdated Show resolved Hide resolved
else if (type == 1) {
pending.type = ClientResyncMode::Recover;
// Version 2 columns
if (pending.version >= 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use the sync metadata schema machinery here to create all these column/look up these colkeys for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrated the pending reset function to use the sync metadata schema and made it a "store" object that is owned by SyncSession.

CHANGELOG.md Show resolved Hide resolved
@@ -62,11 +62,11 @@ class SyncSocketProvider;
struct ClientReset {
realm::ClientResyncMode mode;
DBRef fresh_copy;
bool recovery_is_allowed = true;
sync::ProtocolErrorInfo::Action action = sync::ProtocolErrorInfo::Action::ClientReset;
std::optional<Status> error;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there always an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is - I had made it optional since Status doesn't have a default constructor and ClientReset is default constructed and then populated throughout the code.

I could add an appropriate constructor and remove the optional if you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just store the reason? I think that's all we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to have the error code as well and it shouldn't be that expensive to store...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. it does feel a bit wrong to have the error optional though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the ClientReset structure so error is no longer an optional and updated the usages to use the initializers.

src/realm/sync/noinst/client_reset_operation.cpp Outdated Show resolved Hide resolved
bool load_schema(const TransactionRef& rd_tr);
void create_schema(const TransactionRef& rd_tr);

std::optional<PendingReset> read_legacy_pending_reset(const TransactionRef& rd_tr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having this, isn't it easier to migrate the v1 data to v2 and use it as if the metadata was always at v2? We do something similar with the SubscriptionStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose a more "lazy" approach to writing to the pending reset store and not write to the realm file when the session is started, but wait until the pending reset info is written or cleared.
In the original code, there is an instance where the original pending reset code was provided a frozen transaction, so I defaulted the reads to use frozen transactions as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's just that we could do the same with less code I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I streamlined the legacy pending reset code a bit - it had extra logic for future schema updates, but took that out.

REALM_ASSERT(reset_store);
DB& db_remote = *reset_config.fresh_copy;
auto actual_mode =
reset_precheck_guard(reset_store, reset_config.mode, reset_config.action, reset_config.error, logger);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should reset the guard part of the same write transaction (as before), so everything rolls back if it fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right - I see what you mean. I'll update it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns there there is one commit transaction when track_reset() is called, so even if it rolls back, the pending reset info is still stored. I updated the PendingResetStore to be a set of static functions (again) with the schema metadata info used under the hood. This allows better control over the transactions and when they are committed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, I just followed the code path and that's the case. I think there is still a difference because the write lock was not released before.

src/realm/sync/client.cpp Outdated Show resolved Hide resolved
src/realm/sync/noinst/pending_reset_store.cpp Show resolved Hide resolved
m_sess->logger.info("Tracking pending client reset of type \"%1\" from %2", pending_reset->type,
pending_reset->time);
std::optional<PendingReset> pending_reset;
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it doesn't seem to need its own scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we return on line 1991, I don't think we need to wrap pending_reset in a std::optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I moved the PendingResetStore back to being a set of static functions, I ended up keeping the extra scope for the read transaction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reusing the old code?

auto pending_reset = _impl::client_reset::has_pending_reset(m_db->start_frozen());
if (!pending_reset)
     return;

// Use the original sync config, not the updated one from the migration store
session_config.client_reset_config =
make_client_reset_config(m_config, m_original_sync_config, std::move(m_client_reset_fresh_copy),
allowed_to_recover, m_previous_schema_version.has_value());
std::move(*m_client_reset_error), m_previous_schema_version.has_value());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you make a copy here anyway, so you could just pass an lvalue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could also use auto client_reset_error = std::exchange(m_client_reset_error, std::nullopt); so you don't need to reset it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rvalue is being passed into make_client_reset_config() and then moved into the Config::ClientReset structure returned by that function. Added the std::exchange() call to remove the reset()

@@ -513,8 +512,9 @@ void SyncSession::download_fresh_realm(sync::ProtocolErrorInfo::Action server_re
auto self = shared_from_this();
using SubscriptionState = sync::SubscriptionSet::State;
fresh_sub.get_state_change_notification(SubscriptionState::Complete)
.then([=](SubscriptionState) -> util::Future<sync::SubscriptionSet> {
if (server_requests_action != sync::ProtocolErrorInfo::Action::MigrateToFLX) {
.then([self, error_info, fresh_sync_session, fresh_sub_store,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've noticed that throughout this PR you're changing lambdas that implicitly capture by value to explicitly list their captures. is this fixing a bug or just making the code style more to your liking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically in download_fresh_realm(), I was thinking listing the specific value used would reduce the overhead of capture by values so the entire function scope is not copied. I can revert back to the original capture list if the compiler does this optimization for us already..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a default capture of [=] { lambda body } captures any referenced variables by copying them, but doesn't copy unreferenced variables. The old syntax and new syntax should generate the exact same instructions. See this godbolt if you want to double-check https://godbolt.org/z/q7KP41e8K.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't how [=] works. It doesn't capture the function scope; it captures by value the specific things referenced inside the lambda. This isn't a compiler optimization but rather is a pretty important part of the semantics of lambda capturing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK - thanks for the clarification!

}
});
}
fresh_sync_session->revive_if_needed();
}

void SyncSession::handle_fresh_realm_downloaded(DBRef db, Status status,
sync::ProtocolErrorInfo::Action server_requests_action,
std::optional<sync::SessionErrorInfo> client_reset_error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we always either pass a Status::OK() and a client_reset_error or a not-okay status. Should we make this a StatusWith<sync::SessionErrorInfo>?

m_sess->logger.info("Tracking pending client reset of type \"%1\" from %2", pending_reset->type,
pending_reset->time);
std::optional<PendingReset> pending_reset;
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we return on line 1991, I don't think we need to wrap pending_reset in a std::optional.


class PendingResetStore {
public:
~PendingResetStore() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to explicitly define the destructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why I defined that? I don't think it's needed.

}

// Closing the realm should leave one ref for the SyncSession and one for the local dbref.
REQUIRE_THAT(
[&] {
return dbref.use_count() < 4;
logger->trace("DBRef PAUSED use count: %1", dbref.use_count());
return dbref.use_count() < 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this work before if we weren't counting the MigrationStore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually 3 refs open, including the MigrationStore - it was increased by 1 when I had originally created the PendingResetStore object...

Realm - DBRef ACTIVE use count: 5
Realm - DBRef PAUSING called use count: 5
Realm - DBRef PAUSED use count: 3
Realm - DBRef PAUSED use count: 3
Realm - DBRef TEARDOWN use count: 1
Realm - DBRef TEARDOWN use count: 1

I updated the count and the comments to indicate the DB references

@michael-wb michael-wb requested a review from jbreams May 28, 2024 20:42
m_sess->logger.info("Tracking pending client reset of type \"%1\" from %2", pending_reset->type,
pending_reset->time);
std::optional<PendingReset> pending_reset;
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not reusing the old code?

auto pending_reset = _impl::client_reset::has_pending_reset(m_db->start_frozen());
if (!pending_reset)
     return;

if (status == ErrorCodes::OperationAborted) {
return;
}
auto& logger = self->m_sess->logger;
if (!status.is_ok()) {
logger.error("Error while tracking client reset acknowledgement: %1", status);
logger.error(util::LogCategory::reset, "Error while tracking client reset acknowledgement: %1", status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

class PendingResetStore {
public:
// Store the pending reset tracking information - it is an error if the tracking info already
// esists in the store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exists

Comment on lines +59 to +62
static int64_t from_reset_action(PendingReset::Action action);
static PendingReset::Action to_reset_action(int64_t action);
static ClientResyncMode to_resync_mode(int64_t mode);
static int64_t from_resync_mode(ClientResyncMode mode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods should be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left them public so I can create tests for them without needing to use a class friend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. shouldn't they be tested through the methods invoking them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spread out the values to try to validate them during the other tests, but the error cases are harder to test.

ColKey mode_col = table->get_column_key(s_v1_reset_mode_col_name);
Obj reset_entry = *table->begin();

if (version_col && static_cast<int>(reset_entry.get<int64_t>(version_col)) == 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to cast it. you can probably retrieve an int directly or compare the value with 1LL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, ints aren't supported, but the comparison to 1LL works great

"int realm::Obj::get<int>(realm::ColKey) const", referenced from:
      realm::sync::PendingResetStore::read_legacy_pending_reset(std::__1::shared_ptr<realm::Transaction> const&) in librealm-sync-dbg.a[12](pending_reset_store.cpp.o)
ld: symbol(s) not found for architecture arm64

else {
tr->promote_to_write();
// Create the metadata schema
create_sync_metadata_schema(tr, &internal_tables);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why swapping the order? Is it important? also, why is a new SyncMetadataSchemaVersions needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally swapped them to ensure the schema was written before the schema version, but, since they are in the same commit, it really doesn't matter. I also made some updates around this area so the initial creation of the schema versions table is in its own commit.

// Writing is disabled
return false; // Either table is not initialized or version does not exist
}
else { // writable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for else

bool operator==(const sync::PendingReset& lhs, const sync::PendingReset& rhs);
bool operator==(const sync::PendingReset& lhs, const PendingReset::Action& action);

class PendingResetStore {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit odd that all methods are static in this class, and you create temporary instances used in some of these methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class instances are just for holding the table/column key variables after the schema metadata info is loaded. I had gone back and forth with this and think the static functions provides the caller control over the transactions and when the changes are committed (e.g. remove old and add new pending reset info in a single transaction, like was being done before). Although it did require some updates to the SchemaVersion classes and how they handle transactions.

Plus, this class is only used in a very small window and it seemed excessive to keep an object (and dbref) around during every SessionWrapper lifetime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. I wasn't suggesting using a store object, but maybe the same old static functions if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could pull the static functions outside the class and retain the old namespace for them... Not sure if it's necessary, since this way everything is contained in a single unit/class and it incorporates the internal object for holding the schema info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I don't think its worth refactoring this again just to use the old function names.

Michael Wilkerson-Barker added 2 commits May 29, 2024 12:18
@@ -2,6 +2,7 @@

### Enhancements
* <New feature description> (PR [#????](https://github.com/realm/realm-core/pull/????))
* Report the originating error that caused a client reset to occur. ([#6154](https://github.com/realm/realm-core/issues/6154))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we actually surfacing the error? could not find where it's done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

Realm.Sync.Client.Reset - Connection[1] Session[5]: Tracking pending client reset of type: 'Recover' at: 2024-05-31 00:38:03.146563000 for error: SyncClientResetRequired: Bad client file identifier (IDENT)
. . .
Realm.Sync.Client.Reset - Connection[1] Session[5]: Server has acknowledged pending client reset of type: 'Recover' at: 2024-05-31 00:38:03.146563000 for error: SyncClientResetRequired: Bad client file identifier (IDENT)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant is surfacing it through a SyncError in handle_error so the users know what caused the initial client reset.

@michael-wb michael-wb merged commit 25212bc into master May 31, 2024
37 checks passed
@michael-wb michael-wb deleted the mwb/add-extra-client-reset-fields branch May 31, 2024 02:09
@github-actions github-actions bot mentioned this pull request May 31, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track client reset reason in table that detects client reset cycles
5 participants