Compatibility with segment replication #697

dreamer-89 · 2023-06-29T02:20:18Z

Summary

With 2.9.0 release, there are lot of enhancements going in for segment replication[1][2] feature (went GA in 2.7.0), we need to ensure different plugins are compatible with current state of this feature. Previously, we ran tests on plugin repos to verify this compatibility but want plugin owners to be aware of these changes so that required updates (if any) can be made. With 2.10.0 release, remote store feature is going GA which internally uses SEGMENT replication strategy only i.e. it enforces all indices to use SEGMENT replication strategy. So, it is important to validate plugins are compatible with segment replication feature.

What changed

1. Refresh policy behavior

RefreshPolicy.IMMEDIATE will only refresh primary shards but not replica shards immediately. Instead post refresh, primary will start a round of segment replication to update the replica shard copies leading to eventual consistency.
RefreshPolicy.WAIT_UNTIL ensures the indexing operation is searchable in your cluster i.e. RAW (Read after write guarantee). With segment replication, this guarantee is not promised due to delay in replica shared updates from asynchronous background refreshes.

2. Refresh lag on replicas

With segment replication, there is inherent delay in documents to be searchable on replica shard copies. This is due to the fact that replica shard copies over data (segment) files from primary. Thus, compared to document replication, there will be on average increase in amount of time the replica shards are consistent with primaries.

3. System/hidden indices support

With opensearch-project/OpenSearch#8200, system and hidden indices are now supported with SEGMENT replication strategy. We need to ensure there are no bottlenecks which prevents system/hidden indices with segment replication.

Next steps

With segment replication strong reads are not guaranteed. Thus, if the plugin needs strong reads guarantees specially as alternative to change in behavior of refresh policy and lag on replicas (point 1 and 2 above), we need to update search requests to target primary shard only. With opensearch-project/OpenSearch#7375, core now supports primary shards only based search. Please follow documentation for examples and details

Open questions

In case of any questions or issues, please post it in core issue

Reference

[1] Design

[2] Documentation

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2023-06-29T19:53:29Z

Request owners to add v2.9.0 label on this issue. Tagging @gaobinlong

gaobinlong · 2023-07-10T08:57:21Z

@dreamer-89 hi, few questions here, with segment replication, is the delay between primary shard and its' replicas just the processing time of segments copy? Is the delay in a range or maybe too long when the cluster's load is high? Can we just set the setting index.replication.type to DOCUMENT for the system indices in this plugin?

dreamer-89 · 2023-07-11T00:37:10Z

@dreamer-89 hi, few questions here, with segment replication, is the delay between primary shard and its' replicas just the processing time of segments copy? Is the delay in a range or maybe too long when the cluster's load is high? Can we just set the setting index.replication.type to DOCUMENT for the system indices in this plugin?

Thanks for sharing your use case.

The delay for replicas to catch up with primary depends on multiple factors. I request you to go through Segment replication design and documentation as mentioned in the issue description. Yes, this delay can be long when your cluster is under load. You can override system/hidden indices to use DOCUMENT replication but then we are not truely testing the system indices with segment replication. This validation is important because going forward SEGMENT will be the only replication strategy supported for certain configurations (e.g. Remote store at cluster level). Thus, there is no way other than to validate indices actually created with SEGMENT replication.

Will using primary shard based searching as mentioned in the issue description solve your use case ?

gaobinlong · 2023-07-11T08:14:17Z

@dreamer-89 thanks for your explanation, our team discussed internally, we'll do some validation with system indices created with segment replication for this plugin, and we think the delay is tolerable, we won't make any code change in this plugin as the code freeze date of 2.9 is near, will use primary shard based searching in next release.

dreamer-89 · 2023-07-11T22:36:34Z

Thanks @gaobinlong for sharing the update. I just wanted to call out that using primary based search may add addtional load on primary shards and may have some performance implications. I assume you do not have heavy read workload and already verified this to be not a problem in your use-case.

Also, can you please update the label to 2.10.0

dreamer-89 · 2023-08-21T19:37:28Z

@gaobinlong: Thanks for working on this issue. I just wanted to update that core now supports realtime reads for segment replication enabled indices with https://github.com/opensearch-project/OpenSearch/issues/8536h. So, if you are performing realtime reads then request should automatically return latest data now (with 2.10.0+ OS core) for segment replication enabled indices. Please check opensearch-project/OpenSearch#8536 for more details.

Also, if there is no action item needed in this plugin, please close the issue.

gaobinlong · 2023-08-22T03:13:37Z

@dreamer-89 , Thanks for your update, it seems that the issue you mentioned only aims to resolve the realtime reads for GET API, but I've checked the code in this repo, we use both GET API and Search API to fetch the documents, so for Search API, do we still need to use primary shard based search by setting preference to '_primary'?

dreamer-89 · 2023-08-22T03:26:01Z

Thanks @gaobinlong for the prompt response. Yes, you are right opensearch-project/OpenSearch#8536 fix on core handles only GET & mGET APIs. I am wondering why _primary preference be needed for generic search queries ? Even with DOCUMENT replication, the search queries may return stale data (if request hitting replica shards).

gaobinlong · 2023-08-22T04:13:11Z

@dreamer-89 with segment replication, the delay between primary shard and replica shards is longer than the delay with document replication, so I think we should use _primary preference to ensure the read consistency. And with segment replication, if some indexing requests use RefreshPolicy.IMMEDIATE, the replica shards will not be refreshed immediately, this maybe also a problem when searching documents.

dreamer-89 · 2023-08-22T04:44:35Z

@dreamer-89 with segment replication, the delay between primary shard and replica shards is longer than the delay with document replication, so I think we should use _primary preference to ensure the read consistency. And with segment replication, if some indexing requests use RefreshPolicy.IMMEDIATE, the replica shards will not be refreshed immediately, this maybe also a problem when searching documents.

RefreshPolicy.IMMEDIATE with DOCUMENT replication does not provide read after write (RAW) guarantees. Today, RAW is provided with use of WAIT_UNTIL refresh policy and GET/mGET requests. So, in my opinion if you are not using either of these mechanism, your use case does not need RAW guarantees. Please correct me if I am wrong.

This is not correct. IMMEDIATE does provide strong reads with DOCUMENT replication today.

CC @mch2

dreamer-89 · 2023-08-22T19:39:46Z

@dreamer-89 with segment replication, the delay between primary shard and replica shards is longer than the delay with document replication, so I think we should use _primary preference to ensure the read consistency. And with segment replication, if some indexing requests use RefreshPolicy.IMMEDIATE, the replica shards will not be refreshed immediately, this maybe also a problem when searching documents.

@gaobinlong : You are right. Please dis-regard my last message.
If you are using IMMEDIATE refresh policy, then adding _primary routing makes sense to me.

The other alternative is to rely on get and mget APIs which by default returns strong reads. I will let you decide the right approach for this plugin.

gaobinlong · 2023-08-23T07:10:44Z

@dreamer-89 Thanks, I've checked the code in this repo, IMMEDIATE refresh policy is not used when indexing documents, and we use both get API and search API to fetch the documents, for search API, even though with segment replication the lag between primary shard and replica shards may increase, but we think it's tolerable, we can close this issue now.

dreamer-89 added enhancement New feature or request untriaged labels Jun 29, 2023

dreamer-89 mentioned this issue Jun 29, 2023

[Meta] Validate plugins compatibility with segment replication opensearch-project/OpenSearch#8211

Closed

37 tasks

gaobinlong removed the untriaged label Jun 29, 2023

gaobinlong added the v2.9.0 label Jun 30, 2023

gaobinlong self-assigned this Jul 5, 2023

Hailong-am mentioned this issue Jul 11, 2023

Compatibility with segment replication opensearch-project/dashboards-notifications#64

Closed

gaobinlong added v2.10.0 and removed v2.9.0 labels Jul 12, 2023

gaobinlong closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with segment replication #697

Compatibility with segment replication #697

dreamer-89 commented Jun 29, 2023 •

edited

Loading

dreamer-89 commented Jun 29, 2023

gaobinlong commented Jul 10, 2023

dreamer-89 commented Jul 11, 2023

gaobinlong commented Jul 11, 2023

dreamer-89 commented Jul 11, 2023 •

edited

Loading

dreamer-89 commented Aug 21, 2023 •

edited

Loading

gaobinlong commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023

gaobinlong commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023 •

edited

Loading

dreamer-89 commented Aug 22, 2023

gaobinlong commented Aug 23, 2023

Compatibility with segment replication #697

Compatibility with segment replication #697

Comments

dreamer-89 commented Jun 29, 2023 • edited Loading

Summary

What changed

1. Refresh policy behavior

2. Refresh lag on replicas

3. System/hidden indices support

Next steps

Open questions

Reference

dreamer-89 commented Jun 29, 2023

gaobinlong commented Jul 10, 2023

dreamer-89 commented Jul 11, 2023

gaobinlong commented Jul 11, 2023

dreamer-89 commented Jul 11, 2023 • edited Loading

dreamer-89 commented Aug 21, 2023 • edited Loading

gaobinlong commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023

gaobinlong commented Aug 22, 2023

dreamer-89 commented Aug 22, 2023 • edited Loading

dreamer-89 commented Aug 22, 2023

gaobinlong commented Aug 23, 2023

dreamer-89 commented Jun 29, 2023 •

edited

Loading

dreamer-89 commented Jul 11, 2023 •

edited

Loading

dreamer-89 commented Aug 21, 2023 •

edited

Loading

dreamer-89 commented Aug 22, 2023 •

edited

Loading