-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Proposal : Pluggable Translog #1319
Comments
I'm curious what other folks think about this. I know some are strongly opposed giving the option to disable this level of durability. I'm not in that camp. I like the idea of giving the option to completely disable the xlog and save that processing storage time for simple fast/unreliable uses cases. Similar to folks choosing UDP over TCP and not caring if certain packets are lost. |
As of now Do we want to provide control over when to |
@jayeshathila I have updated the the sequence diagram for translog interaction, for better understanding of the existing system, would recommend you go through the same |
Will this I know this Path have method to map a URI |
For my understanding, who (which class) will be caller to |
I'm pretty interested in this feature primarily for the pluggable transport option. I want the capability to encrypt the translog byte stream with different keys (different indexes must have different encryption keys). The different keys per index requirement is what pushes me away from using EBS; mounting different EBS volumes per index doesn't make sense for my use-case either due to volume and variability of indexes/keys. @Bukhtawar Looking at the diagram and description, it seems like AbstractStore and the remote store idea is more about a backing store that is used to copy the data locally then read it from the local filesystem when needed. Is that a correct interpretation? Can we ensure the interfaces provided for the plugin are able to be used to layer read/write operations from the store? For example: remote-store-to-memory or data encryption purpose (plugin would prevent unencrypted translog data from hitting disk) |
@Bukhtawar I saw this didn't have the 2.1 tag. Is this still targeting 2.1? |
+1 to that
|
Translog is really helpful to maintain the linearizability of atomic commits in OpenSearch Clusters which may lead to Raft consensus, @Bukhtawar your thoughts on this #6369. |
@Bukhtawar @rramachand21 Are this tracking green for 2.9? |
Overview
The purpose of this document is to propose various mechanisms of achieving durability of uncommitted indexing operations and outlines the advantages of making the implementation pluggable. This is not a complete design document, it is a proposal to seek community feedback on.
Motivation
Background
What is the role of a translog?
All operations that affect an OpenSearch index (i.e. adding, updating, or removing documents) are added to an in-memory buffer, which is periodically flushed to create segments on disk on a Lucene commit. These operations also need to be durably written to a write-ahead transaction log (aka translog) before they are acknowledged back to the client since Lucene commits are too expensive to be performed per request. In the event of a process crash, recent operations that have been acknowledged but not yet included in the last Lucene commit are recovered from the translog.
The following are the various mechanisms through which the engine interacts with the translog—
Ingestion Interactions
For efficiency, the translog operations are first buffered to an in-memory queue so that not all the indexing threads have to perform the costly fsync operation ** to disk. One of the indexing threads acquire a lock to dequeue and subsequently fsync, the buffered translog operations to disk. Once the fsync operation is complete, for operation consistency, OpenSearch also writes a checkpoint to the disk with the last sequence number that is written to the translog, which helps prevent translog corruption. Once all translog operations for a given request have been durably persisted to disk, the OpenSearch sends an acknowledgment back to the client
When the translog grows beyond a configurable size, in order to prevent recoveries from taking too long, the OpenSearch engine triggers a flush in the background. OpenSearch flush performs a Lucene commit, moving segments to disk also rolling over and starting a new translog file.
Shard Recovery
In the event of a crash, existing segment files are first recovered from the local disk. Translog files are opened and recent operations that have not been committed to Lucene are recovered by replaying all operations from the translog beyond the sequence number from the last Lucene commit checkpoint. A flush is triggered creating Lucene segments on disk and purging all unreferenced translog files.
Recovery mechanisms
Cross cluster replication
Cross Cluster Replication in OpenSearch uses a pull-based replication model to pull translog operations into the follower index. The translog in the leader cluster is pulled incrementally from the current sequence number of the follower shard upto the global checkpoint of the shard being replicated from at the leader, once the bootstrap from committed segments are complete.
Downsides of current translog implementation
Use Cases
The pluggable translog support is meant to cater to a broad spectrum of users who would want to customise their durability options based on specific needs.
High Level Proposal
We propose decoupling translog from the engine by abstracting out translog code from the server code base and moving it to a separate module. The engine would provide knobs to choose the translog option, whether or not translogs need to be maintained by the system. The translog module would further support extension points for a remote store which would be implemented as separate extended plugins. The translog module would support the default store as the local store so that the code is fully backward compatible, while providing users an option to customize their own durability by installing the specific durability extension plugins. All IO interactions across different storage interactions would be checksummed to detect or prevent translog corruption.
Translog durability
Local : This is same as the durability option we have today, translogs will be co-located on the local disk with the segments and would behave the same for both segment and document based replication i.e all indexing operation would require translog to be replicated synchronously within the shard replication group. For a non-replicated index, incase of a node failure, shards cannot be recovered on a different node.
Remote : Translogs can be stored on a remote store as provided by the durability extension plugin, in which case we don’t need translogs to be redundantly written on replicas. The indexing operations in case of segment based replication, need not even be replicated to the replica as primary is responsible for both segment creation and translog persistence to a durable store
Local recovery process will need to pull translogs from the remote store and replay the needed translog operations once segments have been restored.
For a non-replicated index, incase of a node failure, shards can be recovered on a different node if segments are also present on the remote store, by first restoring the segments and then pulling the translogs and re-playing the missing operations
No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.
Low-level proposal
Future Work
FAQs
What is a Lucene commit?
Lucene commit is a process of flushing all pending changes (added & deleted documents, segment merges, added indexes, etc.) to the index, and syncing all referenced index files. The data on disk is ready to be used for searching. and the index updates will survive an OS or machine crash or power loss.
What are shard retention leases?
Shard retention leases is a mechanism used for peer recoveries, aimed at preventing shard recoveries from having to fallback to expensive file copy operations if shard history is available from a certain sequence number. This lease would be acquired by a recovering replica copy to prevent any operations on the primary beyond this sequence number from being merged away
What is a soft-delete?
Lucene has a functionality to keep deleted documents alive based on time or any other constraint in the index. The soft delete merge policy allows to control when soft deletes are claimed by merges.
The text was updated successfully, but these errors were encountered: