-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10983. EC Key read corruption when the replica index of container in DN mismatches #6779
Conversation
…r in DN mismatches Change-Id: Ic88575f31305bde9d78b0e4d0c7bbf25c53a7ccb
Change-Id: I4386a0de5d61d1cec51f63ed7f75d531fa389e9c
Change-Id: Ic1175d69441957382fb9213f7e37e3047d284806
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/client/BlockID.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/client/BlockID.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/client/BlockID.java
Outdated
Show resolved
Hide resolved
@@ -42,6 +42,9 @@ public enum ClientVersion implements ComponentVersion { | |||
"This client version has support for Object Store and File " + | |||
"System Optimized Bucket Layouts."), | |||
|
|||
ERASURE_CODING_READ_CHUNK_CORRUPTION_FIX(4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this new client version? Can we not simple check on the server that:
if (blockIDProto.hasReplicaIndex()) {
// do the verify.
}
I suspect we add new fields in places fairly often without creating new client versions.
I wonder what the trigger for a new client version should be, verses just checking for the presence of a field in the proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the new version constant (and all the refactoring to be able to use it).
New client version constant is needed when server has to decide whether it should send some piece of information (e.g. a new type of replication config, new bucket type, or new ports) to the client.
If server only wants to decide whether to perform some server-side processing, it can do so based on request content received from client. Like if client sends chunk data as a list, it can handle that, otherwise handle the single blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we want to enforce that the block is validated everytime we do a Getblock command or ReadChunk command for newer client versions. This is to ensure that the newer client doesn't have the same regression in some other flows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure for newer client would atleast raise an alarm when things fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have ensured to the best of knowledge taken care of all the existing flows. But in future, if we forget to set replicaIndex in some other request, it should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining why the new version is required.
READ_CHUNK_CORRUPTION_FIX
is too generic.
Let's rename the constant to something like EC_REPLICA_INDEX_REQUIRED_IN_BLOCK_REQUEST
. This describes the expected client behavior.
Please also update the description text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC_REPLICA_INDEX_REQUIRED_IN_BLOCK_REQUEST sounds good to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me. I have a few questions I added as comments. Thanks for working on this.
Change-Id: Ifd60cb93ccb5e6ff7c81995734459719f6d5ecc3
Change-Id: I8f6e02e917dbc4d99294ca3c6d61109b1dafb897
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM when the find bugs are fixed and CI is green.
* Verify if request block BCSID is supported. | ||
* | ||
* @param container container object. | ||
* @param blockID requested block info | ||
* @throws IOException if cannot support block's blockCommitSequenceId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comments leftover from verifyBCSId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -42,6 +42,9 @@ public enum ClientVersion implements ComponentVersion { | |||
"This client version has support for Object Store and File " + | |||
"System Optimized Bucket Layouts."), | |||
|
|||
ERASURE_CODING_READ_CHUNK_CORRUPTION_FIX(4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the new version constant (and all the refactoring to be able to use it).
New client version constant is needed when server has to decide whether it should send some piece of information (e.g. a new type of replication config, new bucket type, or new ports) to the client.
If server only wants to decide whether to perform some server-side processing, it can do so based on request content received from client. Like if client sends chunk data as a list, it can handle that, otherwise handle the single blob.
xceiverClient = xceiverClientFactory.acquireClientForReadData( | ||
pipelineSupplier.get()); | ||
updateDatanodeBlockId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please store Pipeline pipeline = pipelineSupplier.get()
and pass it to updateDatanodeBlockId()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more pipelineSupplier.get()
leftover in updateDatanodeBlockId()
:
ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/ChunkInputStream.java
Line 301 in 1476d59
int replicaIdx = pipelineSupplier.get().getReplicaIndex(closestNode); |
public void setReplicaIndex(Integer replicaIndex) { | ||
this.replicaIndex = replicaIndex; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused. If removed, replicaIndex
member can be final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@adoroszlai We want to enforce replica index is sent by the newer clients all the time. This could cause another regression otherwise in the future if someone forgets to set the replica index. The proto parameter replicaIndex is optional, I see this server side check is the only way to achieve this. If there is some other way you think this could be done, I am open to suggestions. |
Thanks for the explanation. You are right, client version can be used to enforce different client behavior depending on version, too. |
I can add a unit test case to ensure, older client requests don't fail. |
Change-Id: If2af2959ab79ee0ffd275c21909c9dc86e1c2c38
Change-Id: I44de245342a7622d1c3fd4b49dd1db4c2bcac5b4
Change-Id: I9a8905dad8f4db731b110644fa03e9e80e1df859
Thanks @swamirishi for updating the patch. Can you please check test failures, they seem to be related? |
Change-Id: I822685eb14a347cd4dfa256a5c14d58ccad6b12e
Change-Id: If2ca384a0e026617fa8893c5dd306c3470971931
Change-Id: I9744461813f5b6729039becb0d02d3934a8908c4
Change-Id: Ib28d150af433ab1f298d7892cd3e243c7b509522
Change-Id: Id81b80a6c4d6137ac287022cd9386ad29ca70194
Change-Id: I2c01971613bf33a26507d2f6b464feb71bfe5d6e # Conflicts: # hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestContainerReplication.java
Change-Id: Icb4405b0f899b0a83babd1ea49221f4ffdedf2cd
@adoroszlai @sodonnel Can you take another look on the patch? I have fixed the acceptance test failures. I will run the CI once I get green ci run on my fork. For the getBlock request I am passing the replicaIndexMap since getBlock command is also sent to get the EC checksum of the blocks which means, even though each of the DN would have different replica index it would still have the same checksum info. The nodes in the pipeline would have different replica indexes. ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientManager.java Line 237 in 8db644c
xceiverClient.getPipeline() & pipelineRef.get() may have different replicaIndexes. This is the reason why the acceptance test was failing. |
Change-Id: I9fda20fc747f5989cc571ce7d088b5b449a2734f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @swamirishi for continuing work on this.
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockInputStream.java
Outdated
Show resolved
Hide resolved
String traceId = TracingUtil.exportCurrentSpan(); | ||
if (traceId != null) { | ||
builder.setTraceID(traceId); | ||
} | ||
final DatanodeBlockID.Builder datanodeBlockID = blockID.getDatanodeBlockIDProtobufBuilder(); | ||
int replicaIndex = replicaIndexes.getOrDefault(datanode, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the replica index from the pipeline here, instead of requiring a map to be passed?
int replicaIndex = replicaIndexes.getOrDefault(datanode, 0); | |
int replicaIndex = xceiverClient.getPipeline().getReplicaIndex(datanode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we cannot use this. My previous commit had the same thing. XceiverClient maintains a cache for the pipeline. 2 pipelines are same if the list of nodes are the same.
ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientManager.java
Line 237 in 8db644c
return clientCache.get(key, new Callable<XceiverClientSpi>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So 2 GetBlock requests to the same pipeline can interfere with one another. This was the reason why acceptance test was failing previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so we need to use the refreshed pipeline because indexes may be different. In that case, I think we can still pass the pipeline instead of the replicaIndexes
map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this create a confusion between xceiverClient.getPipeline() vs the pipeline object we are sending. That is why I thought it would be just better to send the replicaIndexMap instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 containers having the same nodes but different replica index ordering
This should not be possible right? Placement policy should not allow this to place 2 different indexes in same node.Do we have a scenario?
Does this impact performance? for every getBlock we refreshPipeline ? Even though client has old index and if it is able to read same index block, then things should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a chat with @swamirishi offline. Current pipeline ids created per ECBlockOutStream is based on dn uuid. There is a chance two distinct containers can be part os same node and pipeline cache can get connection correctly but client also holds pipeline object, which can be pointing to other container rep index as part of dn->replIndex map.
@swamirishi please provide the description what we discussed here for better understanding. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standalone pipeline getting initialized sets the pipeline id as the Datanode UUID
ozone/hadoop-hdds/client/src/main/java/org/apache/hadoop/ozone/client/io/ECBlockInputStream.java
Line 179 in c4dc6a0
.setId(PipelineID.valueOf(dataLocation.getUuid())).setReplicaIndexes( |
Consider the case:
DN1-uuid = 1 = Standalone PipelineId
Container 1(Block1), | Container 2(Block2)
DN1 => R3 | DN1 => R2
The XceiverClientManager maitains a cache based on the pipeline id. So for a standalone pipeline it would be nothing but the DN uuid itself. If the client for the DN has been initialized before then the same client would be reused. But the replicaIndexes for the pipelines are different. Now that we have added a validation on the replication index on the getBlock request. The get Block2 call to getting DN1 will fail if we depend on the pipeline object inside the xceiverClient object, since the pipeline object expects a replicaIndex 3 in DN1 but the actual block2 data present in the datanode is for replica index 2. So the server throws an exception because of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same could also occur when a refresh pipeline as well, wherein the DNs have changed the replica index, the client has updated the pipeline object but the cached XceiverClient has not updated the replicaIndexes. This is the same case @adoroszlai previously mentioned in the thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Swami for the explanation.
@@ -204,7 +205,7 @@ private DataStreamOutput setupStream(Pipeline pipeline) throws IOException { | |||
// it or remove it completely if possible | |||
String id = pipeline.getFirstNode().getUuidString(); | |||
ContainerProtos.ContainerCommandRequestProto.Builder builder = | |||
ContainerProtos.ContainerCommandRequestProto.newBuilder() | |||
getContainerCommandRequestProtoBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch is huge (122K), hard to review. Size could be reduced by prefactoring:
- introduce this factory method
- replace all calls of
newBuilder()
in a separate patch, without any functional changes, before the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will raise a patch for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adoroszlai @sodonnel I have raised another patch since this patch is becomning too big to review #6812
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other PR ,merged now.
...tegration-test/src/test/java/org/apache/hadoop/ozone/container/TestContainerReplication.java
Show resolved
Hide resolved
Change-Id: I371912e557e4770b0bbd0f25e823a186224b79c5
Change-Id: I7868729f0910168e6ed4610943255c79b1b00012
Change-Id: I43551feebf091706bd9ab2bed5d0e671cf3d4077
Change-Id: Id88ef2be0dbbfa1502642f147e9004ffaaa83722
Change-Id: Ib736cce10e071142f9f9c060cc1cbb7e10d700a7
@adoroszlai @sodonnel Can you review this patch? Now that #6812 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @swamirishi for updating the patch.
@@ -42,6 +42,9 @@ public enum ClientVersion implements ComponentVersion { | |||
"This client version has support for Object Store and File " + | |||
"System Optimized Bucket Layouts."), | |||
|
|||
ERASURE_CODING_READ_CHUNK_CORRUPTION_FIX(4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining why the new version is required.
READ_CHUNK_CORRUPTION_FIX
is too generic.
Let's rename the constant to something like EC_REPLICA_INDEX_REQUIRED_IN_BLOCK_REQUEST
. This describes the expected client behavior.
Please also update the description text.
ChunkBuffer data; | ||
try { | ||
BlockID blockID = BlockID.getFromProtobuf( | ||
request.getReadChunk().getBlockID()); | ||
ChunkInfo chunkInfo = ChunkInfo.getFromProtoBuf(request.getReadChunk() | ||
.getChunkData()); | ||
Preconditions.checkNotNull(chunkInfo); | ||
|
||
if (request.hasVersion() && request.getVersion() >= ERASURE_CODING_READ_CHUNK_CORRUPTION_FIX.toProtoValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a helper function for this condition (and reuse in handleGetBlock
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
BlockData getBlock(Container container, BlockID blockID, boolean isReplicaCheckRequired) | ||
throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the caller should verify replicaIndex as needed, instead of overloading getBlock
. Verification requires only container
and blockID
, both inputs of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought since the function verifys BcsID, it would be logical to verifyReplicaIndex as well. Since like how BCSId is for Ratis. ReplicaIndex is for EC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and having those two verifications in the same place has some merit. But bcsID is always verified, while replicaIndex verification is conditional on yet another parameter. So it is entirely the responsibility of the caller. I think this, and the need to increase the size of the BlockManager
interface makes having the check inside getBlock
worse than having it outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately in today's date this is conditional because of the need to support the older clients. We can eventually remove this check when we deprecate older clients and always perform the replica index check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would atleast make the caller aware that there is a replicaIndex check that should happen for EC and would not lead to future regressions. Anyhow there is a default function they need not always make a call to the 3 parameter function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adoroszlai I have made the change and moved the check outside and removed the boolean parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would like to make verifyReplicaIdx
and verifyBCSId
calls consistent (i.e. happen in the same places), we should move verifyBCSId
out of getBlock
.
Current call hierarchy:
verifyBCSId
> BlockManagerImpl.getBlock
> BlockManagerImpl.getBlock
> FilePerChunkStrategy.readChunk
> KeyValueHandler.handleGetSmallFile
> KeyValueHandler.handleReadChunk
> KeyValueHandler.handleGetSmallFile
> KeyValueHandler.handleGetBlock
> KeyValueHandler.handleGetCommittedBlockLength
> KeyValueHandler.handleReadChunk
So all callers originate from 4 methods of KeyValueHandler
. I think that's where verification should happen. Some of them already do, as shown above.
Verifying either bcsID
or replicaIndex
multiple times for the same handleReadChunk
operation is unnecessary.
handleGetSmallFile
calls the old getBlock
, but I think it should also verifyReplicaIndex
.
So the two verify...
calls only need to be added in handleGetBlock
and handleGetSmallFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can handleGetSmall have EC replication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no EC replication there we are going to be fetching the data as it is we don't need the verify replica index for Small files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take up this refactoring task item as part of another patch.
Change-Id: I80025cd3e99c3cf44887e85871261467323d29b9
Change-Id: If7339e4092797573a63d861092fe053b3653454e
*/ | ||
public static void verifyReplicaIdx(Container container, BlockID blockID) | ||
throws IOException { | ||
Integer containerReplicaIndex = container.getContainerData().getReplicaIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify replicaIndex only when container is EC?
If this check is happening with any container, can you think what happens when containers upgraded from old version where there are no replica index persisted? I think we are not persisting replica index for Ratis containers, so replica index will be null for them? Can you verify auto conversion are safe from Integer to primitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could avoid conversion, null
values, etc. with a dedicated value (e.g. -1) for "not set". I didn't want to nitpick on that. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, currently it is assigned replicaIndex to Integer object, so it will attempt to do auto conversion for if check validation in next line. so I am worried whether we are taking care of that, otherwise the code like the below can hit NPE. ( ex: Integer i = null; if(i>0) hits NPE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container uses primitive for replicaIndex, so NPE is not a concern here.
Lines 202 to 203 in 59560a1
public int getReplicaIndex() { | |
return replicaIndex; |
However, it is unnecessarily boxed, only to be able to use equals
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container.getContainerData().getReplicaIndex() returns a primitive int. So there is no chance of NPE. I am converting primitive to non primitive Integer. I wanted to use .equals method so that I don't have to do the null check for blockID.getReplicaIndex() which can return a null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the details.
Thanks for reviews on the patch @adoroszlai @sodonnel @umamaheswararao |
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
…r in DN mismatches (apache#6779) (cherry picked from commit 769d09e)
What changes were proposed in this pull request?
The ReadChunk & GetBlock APIs don't validate replicaIndex parameter in case of replica index mismatch. This leads to read corruption when there is an already existing open stream, when background SCM service moves container data around datanodes.
The patch aims to add a validation check on the Datanode on the getBlock request and also pass replication index as part of the GetBlock and ReadChunk request.
Things done in the patch:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10983
How was this patch tested?
Unit Tests and Integration tests