Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track fetch exceptions for shard follow tasks #33047

Merged
merged 7 commits into from
Aug 24, 2018

Conversation

jasontedor
Copy link
Member

This commit adds tracking and reporting for fetch exceptions. We track fetch exceptions per fetch, keeping track of up to the maximum number of concurrent fetches. With each failing fetch, we associate the from sequence number with the exception that caused the fetch. We report these in the CCR stats endpoint, and add some testing for this tracking.

Relates #30086

This commit adds tracking and reporting for fetch exceptions. We track
fetch exceptions per fetch, keeping track of up to the maximum number of
concurrent fetches. With each failing fetch, we associate the from
sequence number with the exception that caused the fetch. We report
these in the CCR stats endpoint, and add some testing for this tracking.
@jasontedor jasontedor added review :Distributed/CCR Issues around the Cross Cluster State Replication features labels Aug 22, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor
Copy link
Member Author

jasontedor commented Aug 22, 2018

Here is what a version of the response from this API looks like:

{
  "j" : [
    {
      "shard_id" : 0,
      "leader_global_checkpoint" : 1023,
      "leader_max_seq_no" : 1023,
      "follower_global_checkpoint" : 1023,
      "follower_max_seq_no" : 1023,
      "last_requested_seq_no" : 1023,
      "number_of_concurrent_reads" : 1,
      "number_of_concurrent_writes" : 0,
      "number_of_queued_writes" : 0,
      "index_metadata_version" : 7,
      "total_fetch_time_millis" : 17,
      "number_of_successful_fetches" : 1,
      "number_of_failed_fetches" : 9,
      "operations_received" : 0,
      "total_transferred_bytes" : 0,
      "total_index_time_millis" : 0,
      "number_of_successful_bulk_operations" : 0,
      "number_of_failed_bulk_operations" : 0,
      "number_of_operations_indexed" : 0,
      "fetch_errors" : [
        {
          "from_seq_no" : 1024,
          "reason" : {
            "type" : "exception",
            "reason" : "NoShardAvailableActionException[No shard available for [Request={fromSeqNo=1024, maxOperationsCount=1024, shardId=[i][0], maxOperationsSizeInBytes=9223372036854775807}]]; nested: RemoteTransportException[[p_hIWo8][127.0.0.1:9300][indices:data/read/xpack/ccr/shard_changes[s]]]; nested: IndexNotFoundException[no such index];",
            "caused_by" : {
              "type" : "no_shard_available_action_exception",
              "reason" : "No shard available for [Request={fromSeqNo=1024, maxOperationsCount=1024, shardId=[i][0], maxOperationsSizeInBytes=9223372036854775807}]",
              "caused_by" : {
                "type" : "index_not_found_exception",
                "reason" : "no such index",
                "index_uuid" : "TfDzmXK_RmaQvyzop98mDA",
                "index" : "i"
              }
            }
          }
        }
      ]
    }
  ]
}

Note that I have added a Object#toString implementation in this PR for ShardChangesAction$Request so that we have a more useful output in the above when formatting the cause.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -166,7 +166,7 @@ public Ccr(final Settings settings) {
ShardFollowTask::fromXContent),

// Task statuses
new NamedXContentRegistry.Entry(ShardFollowNodeTask.Status.class, new ParseField(ShardFollowNodeTask.Status.NAME),
new NamedXContentRegistry.Entry(ShardFollowNodeTask.Status.class, new ParseField(ShardFollowNodeTask.Status.STATUS_PARSER_NAME),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is too long, causing a check style violation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I pushed 477c3bf.

@@ -72,4 +72,9 @@ protected boolean supportsUnknownFields() {
protected ToXContent.Params getToXContentParams() {
return ToXContent.EMPTY_PARAMS;
}

protected boolean assertToXContentEquivalence() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this was just missing in this base class. It does exist in the AbstractXContentTestCase base class.
Should we add this change back into master / 6.x separately after this PR is merged or just wait for when the ccr branches are merged? I have no strong opinion here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline. We will keep this in this branch for now, and it will come to 6.x/master when we integrate there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up opening #33114 since I have to wait for a build on #33113, and then subsequently a build here.

@@ -224,6 +241,7 @@ private void sendShardChangesRequest(long from, int maxOperationCount, long maxR
synchronized (ShardFollowNodeTask.this) {
totalFetchTimeMillis += TimeUnit.NANOSECONDS.toMillis(relativeTimeProvider.getAsLong() - startTime);
numberOfSuccessfulFetches++;
fetchExceptions.remove(from);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to the approach to keep track of exceptions by from seqno in a fixed size linked hashmap.

@jasontedor
Copy link
Member Author

This PR needs #33113 first.

* ccr: (71 commits)
  Make CCR QA tests build again (elastic#33113)
  Add hook to skip asserting x-content equivalence (elastic#33114)
  Muted testListenersThrowingExceptionsDoNotCauseOtherListenersToBeSkipped
  [Rollup] Move getMetadata() methods out of rollup config objects (elastic#32579)
  fixed not returning response instance
  Muted testEmptyAuthorizedIndicesSearchForAllDisallowNoIndices
  Update Google Cloud Storage Library for Java (elastic#32940)
  Remove unsupported Version.V_5_* (elastic#32937)
  Required changes after merging in master branch.
  [DOCS] Add docs for Application Privileges (elastic#32635)
  Add versions 5.6.12 and 6.4.1
  Do NOT allow termvectors on nested fields (elastic#32728)
  [Rollup] Return empty response when aggs are missing (elastic#32796)
  [TEST] Add some ACL yaml tests for Rollup (elastic#33035)
  Move non duplicated actions back into xpack core (elastic#32952)
  Test fix - GraphExploreResponseTests should not randomise array elements Closes elastic#33086
  Use `addIfAbsent` instead of checking if an element is contained
  TESTS: Fix Random Fail in MockTcpTransportTests (elastic#33061)
  HLRC: Fix Compile Error From Missing Throws (elastic#33083)
  [DOCS] Remove reload password from docs cf. elastic#32889
  ...
@jasontedor
Copy link
Member Author

@elasticmachine test this please

@jasontedor jasontedor merged commit ef9607e into elastic:ccr Aug 24, 2018
jasontedor added a commit that referenced this pull request Aug 24, 2018
This commit adds tracking and reporting for fetch exceptions. We track
fetch exceptions per fetch, keeping track of up to the maximum number of
concurrent fetches. With each failing fetch, we associate the from
sequence number with the exception that caused the fetch. We report
these in the CCR stats endpoint, and add some testing for this tracking.
@jasontedor jasontedor deleted the ccr-fetch-exceptions branch August 24, 2018 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants