-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed shard should never open new engine #47186
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @dnhatn, I believe I know what you want to achieve, but I also think this might revert back blocking the cluster state applier thread (more details in comments)?
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java
Outdated
Show resolved
Hide resolved
@henningandersen Thank you for your thoughtful review.
Great catch. I spent some time on this, but I did not come up with a solution unless we introduce the "start" method to Engine. Do you have any suggestions for this? |
@dnhatn I also lean towards the "start" method that runs outside the mutex. The other options I can think of are:
Unfortunately, using the generic thread pool for the last approach might lead to issues if it can be called during shutdown? I think I prefer to add a method to do the warmup, seems simpler overall. The benefit of doing close async though is that it avoids waiting for the rest of the IO happening during InternalEngine constructor. Not sure if this has any benefits? |
+1. Thank you for your suggestion. I will work on this solution. |
@henningandersen I worked on a change that moves the engine warming out of the constructor. It is pretty straightforward. However, it does not eliminate the blocking issue. Closing an engine acquires the writeLock, which can be blocked by an engine warming as it holds the readLock (via refresh). We can fix the refresh, but indexing and flushing can cause the same problem. I will reach out to discuss this with you. |
Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.
Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.
With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.
With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.
This reverts commit a56d9ff.
I have two tests in a56d9ff that can reliably reproduce the test failure reported in #47060. However, neither of them works with the latest change as we now hold engineMutex while closing a shard. I am not sure if I can come up with a useful test for this change. Any suggestions would be great. @henningandersen This is ready again. Can you please take another look? Thank you. |
Today, we hold the engine readLock while refreshing. Although this choice simplifies the correctness reasoning, it can block IndexShard from closing if warming an external reader takes time. The current implementation of refresh does not need to hold readLock as ReferenceManager can handle errors correctly if the engine is closed in midway. This PR is a prerequisite that we need to solve #47186.
With this change, we won't warm up searchers until we externally refresh an engine. We explicitly refresh before allowing reading from a shard (i.e., move to post_recovery state) and during resetting. These guarantees that we have warmed up the engine before exposing the external searcher. Another prerequisite for #47186.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the large effort on this, @dnhatn .
@henningandersen Thank you so much for your review and discussion on this. |
We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060
We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock. Closes #47060
We should not open new engines if a shard is closed. We break this assumption in #45263 where we stop verifying the shard state before creating an engine but only before swapping the engine reference. We can fail to snapshot the store metadata or checkIndex a closed shard if there's some IndexWriter holding the index lock.
Closes #47060