Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[delta-XDS] delta-XDS state tracking is broken when there is an exception thrown when watchmap loading the config. #20790

Closed
stevenzzzz opened this issue Apr 12, 2022 · 5 comments · Fixed by #23214 or #23343
Assignees
Labels
area/xds bug no stalebot Disables stalebot from closing an issue

Comments

@stevenzzzz
Copy link
Contributor

Title: state tracking is broken when there is a partial accept

Description:
if there is a partial accept in Delta-XDS, i.e. route-table #50's loading throws an exception, out of 100 routeConfigurations in the batch, the RDS state is updated with the other 50 route tables not yet updated/inserted.

Even reconnection won't fix this issue, as the client believes it has the 50 not loaded route tables with some version already.

  {
    const auto scoped_update = ttl_.scopedTtlUpdate();
    if (requested_resource_state_.contains(Wildcard)) {
      for (const auto& resource : message.resources()) {
        addResourceStateFromServer(resource);
      }
    } else {
      // We are not subscribed to wildcard, so we only take resources that we explicitly requested
      // and ignore the others.
      for (const auto& resource : message.resources()) {
        if (requested_resource_state_.contains(resource.name())) {
          addResourceStateFromServer(resource);
        }
      }
    }
  }

  watch_map_.onConfigUpdate(non_heartbeat_resources, message.removed_resources(),
                            message.system_version_info());

Repro steps:
See description.

@stevenzzzz stevenzzzz added bug triage Issue requires triage labels Apr 12, 2022
@stevenzzzz
Copy link
Contributor Author

/cc @yanavlasov
Please add a no-stale-bot to this issue.

@stevenzzzz
Copy link
Contributor Author

One way to resolve this would be tracking state of resources using the actual per-resource subscriptions, with the state dictionary taking resource_name->reference to the source of truth. Requires quite some plumbing tho.

@adisuissa
Copy link
Contributor

cc @adisuissa

@adisuissa adisuissa added area/xds no stalebot Disables stalebot from closing an issue and removed triage Issue requires triage labels Apr 12, 2022
@stevenzzzz stevenzzzz changed the title [delta-XDS] delta-XDS state tracking is broken when there is a partial accept [delta-XDS] delta-XDS state tracking is broken when there is an exception thrown when watchmap loading the config. Sep 14, 2022
@adisuissa
Copy link
Contributor

Envoy should probably delay the update of the tracked resource version (for non-wildcard resource and for wildcard resource) until after the call to the watch_map_.onConfigUpdate() method.

@adisuissa
Copy link
Contributor

Envoy should probably delay the update of the tracked resource version (for non-wildcard resource and for wildcard resource) until after the call to the watch_map_.onConfigUpdate() method.

Note that this impacts a reconnect scenario, as Envoy will send an initial_resource_version with resources and versions that may have been NACK'ed.

@adisuissa adisuissa self-assigned this Sep 22, 2022
htuch pushed a commit that referenced this issue Sep 30, 2022
…23214)

Fixes delta-xDS protocol resource state tracking.
Prior to this PR, the implementation updated the state of a resource (tracked version, TTL), even if the fetched config was NACKed. This PR changes it to only update the tracked state after a resource was successfully ingested, and an ACK will be sent.
Note that this PR impacts the TTL of a resource, as now TTL timer starts after the resource was ingested rather than before.

Risk Level: Medium for delta-xDS usage.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: Added release note.
Platform Specific Features: N/A.
Runtime guard: Can be reverted by setting envoy.reloadable_features.delta_xds_subscription_state_tracking_fix to false.
Fixes #20790

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
htuch pushed a commit that referenced this issue Oct 4, 2022
Fixes delta-xDS protocol resource state tracking.
Prior to this PR, the implementation updated the state of a resource (tracked version, TTL), even if the fetched config was NACKed. This PR changes it to only update the tracked state after a resource was successfully ingested, and an ACK will be sent.
Note that this PR impacts the TTL of a resource, as now TTL timer starts after the resource was ingested rather than before.
This PR is essentially #23214 (which was reverted in #23335) with more coverage.

Risk Level: Medium for delta-xDS usage.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: Added release note.
Platform Specific Features: N/A.
Runtime guard: Can be reverted by setting envoy.reloadable_features.delta_xds_subscription_state_tracking_fix to false.
Fixes #20790

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/xds bug no stalebot Disables stalebot from closing an issue
Projects
None yet
2 participants