Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support in the SDK to retrieve IndexingPressure information from OpenSearch #655

Open
Tracked by #720
joshpalis opened this issue Apr 6, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@joshpalis
Copy link
Member

joshpalis commented Apr 6, 2023

Is your feature request related to a problem?

The ADResultBulkTransportAction handles bulk indexing requests to OpenSearch to index Anomaly Results. This is used for multi-entity detectors.

This particular transport action requires an object of type IndexingPressure to be injected via guice which tracks the incoming index requests per shard/node in the cluster and provides memory accounting. Anomaly Detection uses this indexing pressure here to calculate an indexing pressure limit. This calculation is then used to determine whether to continue indexing Anomaly Results or to queue them for later.

What solution would you like?

The SDK should provide a mechanism to request the state of the IndexingPressure instantiated in OpenSearch and enable extensions to retrieve this information from the ExtensionRunner to perform these claculations. This workflow should follow a similar design to the SDK's sendClusterStateRequest.

Here is a high level overview of what the workflow should look like :

  1. The SDK's extensionRunner should define a method called sendIndexingPressureRequest(Transportservice) that instantiates a IndexingPressureResponseHandler and uses the transport service to send a request to the ExtensionsManager.
  2. The ExtensionsManager should include the IndexingPressureService as a class field and provide a method to set this field after the IndexingPressureService is instantiated here in Node.java
  3. A request handler should be registered to handle an indexing pressure request, which receives the request and provides the following information from the IndexingPressureService's ShardIndexingPressure object,
  1. the response should be an object with the data mentioned above

Do you have any additional context?

Within Node.java, the IndexingPressureService is instantiated here and is then bound to guice here. Internally this includes an object of type ShardIndexingPressure here that extends the IndexingPressure class.

@joshpalis joshpalis added enhancement New feature or request untriaged labels Apr 6, 2023
@joshpalis joshpalis added good first issue Good for newcomers CCI Part of the College Contributor Initiative more-challenging Good issues for new contributors that present a small challenge and removed untriaged labels Apr 7, 2023
@dbwiddis dbwiddis removed good first issue Good for newcomers CCI Part of the College Contributor Initiative more-challenging Good issues for new contributors that present a small challenge labels Apr 12, 2023
@dbwiddis
Copy link
Member

This is available in the Nodes API:

GET /_nodes/<node_id>/stats/<metric>/<index_metric>

Should be a quick add of the appropriate client call in SDK client similar to Field Mappings.

@dbwiddis dbwiddis self-assigned this Apr 12, 2023
@dbwiddis
Copy link
Member

See example

      "indexing_pressure" : {
        "memory" : {
          "current" : {
            "combined_coordinating_and_primary_in_bytes" : 0,
            "coordinating_in_bytes" : 0,
            "primary_in_bytes" : 0,
            "replica_in_bytes" : 0,
            "all_in_bytes" : 0
          },
          "total" : {
            "combined_coordinating_and_primary_in_bytes" : 40256,
            "coordinating_in_bytes" : 40256,
            "primary_in_bytes" : 45016,
            "replica_in_bytes" : 0,
            "all_in_bytes" : 40256,
            "coordinating_rejections" : 0,
            "primary_rejections" : 0,
            "replica_rejections" : 0
          },
          "limit_in_bytes" : 53687091
        }
      },

@dbwiddis
Copy link
Member

Have tried and failed a few approaches, recording here for posterity:

  1. RestHighLevelClient doesn't have a built in for this.
  2. OpenSearchClient (Java Client) has a built-in for Node Stats but the response object doesn't include indexing_pressure.

This leaves two options:

  1. Use performRequestAsync with the endpoint GET /_nodes/stats/indexing_pressure/ and manually parse the JSON return (above, looks pretty easy to parse)
  2. Do like we do with cluster state and pull the IndexingPressure object from OpenSearch.
    • Big Problem: I think it's node-local, and a transport request will just return whichever node handles the request, so getting it for all nodes (like we can with REST) isn't ideal.

So assuming I go with option 1, I'm thinking to create an SDK-side Transport Action that does the request.

However, that leads to the follow-on question, "what are we going to do with this information"? In the AD application, the indexing pressure is considered on the node which is currently executing the request. In the AD Extension we're processing data on a remote node and sending it back via REST calls, which we don't know which node will handle it (Hello Hash Ring?).

SO I'm thinking this issue needs to be paused pending Hash Ring implementation. Thoughts, anyone?

@dbwiddis
Copy link
Member

Useful blog post, I think this is the way forward: https://opensearch.org/blog/shard-indexing-backpressure-in-opensearch

Current code is local node based and just measures memory as a signal whether to index.

Instead we should query the REST APIs documented in this and use the unthrottled/soft limit thresholds as our critereon.

@dbwiddis
Copy link
Member

Putting this issue back into the backlog for now. Future plans:

  1. In the AD extension where indexing pressure is called, just skip the check for now. We can do our initial performance tests without any throttling. It's entirely possible that allowing OpenSearch to distribute its requests on its own may provide better performance than the highly localized per-node throttling that currently exists. If we end up with too many requests, we can collect data on when/where/why to better address the problem.
  2. Eventually we should add this capability but it should be based on the "primary metrics" leading indicators of node health per the blog post.

@owaiskazi19
Copy link
Member

Isn't this issue a blocker for multi entity detectors?
cd: @joshpalis

@joshpalis
Copy link
Member Author

Yes, Indexing pressure is needed for the ADResultBulkTransportAction. This action is executed by the MultiEntityResultHandler to bulk index AD results for HCAD

@dbwiddis
Copy link
Member

Not if throttling is not needed.

If it is we can look at the above api to determine what thresholds to use.

Existing code assumes you are on the node doing the work, not injecting via api.

@dbwiddis
Copy link
Member

Summary: replicating AD code has no meaning on an extension. we may need a different api call to decide whether to skip some bulk indexing but I don't know what thresholds we should use with the new call. We should try performance testing without any limit. If we end up needing to add it then the test will help us know what to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants