Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10685. Short circuit read support in Ozone #7236

Closed
wants to merge 1 commit into from

Conversation

ChenSammi
Copy link
Contributor

What changes were proposed in this pull request?

support short-circuit read for datanode.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10685

How was this patch tested?

new unit tests

  • TestDomainSocketFactory
  • TestXceiverServerDomainSocket
  • TestShortCircuitChunkInputStream

manual test with freon

sammi@SAMMICHEN-MB0 ozone-1.5.0-SNAPSHOT % bin/ozone --jvmargs -Djava.library.path=/Users/sammi/workspace/hadoop-ozone/hadoop-hdds/client/src/test/resources freon rk --num-of-volumes=1 --num-of-buckets=10 --num-of-keys=1000 --replication=ONE --replication-type=RATIS --validate-writes --validate-channel short-circuit --num-of-validate-threads=10 --key-size 1MB
SLF4J(W): Class path contains multiple SLF4J providers.
SLF4J(W): Found provider [org.slf4j.reload4j.Reload4jServiceProvider@18ef96]
SLF4J(W): Found provider [org.slf4j.simple.SimpleServiceProvider@6956de9]
SLF4J(W): See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J(I): Actual provider is of type [org.slf4j.reload4j.Reload4jServiceProvider@18ef96]
2024-09-24 15:37:01,844 [main] INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2024-09-24 15:37:01,877 [main] INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2024-09-24 15:37:01,877 [main] INFO impl.MetricsSystemImpl: ozone-freon metrics system started
2024-09-24 15:37:02,469 [main] INFO storage.DomainSocketFactory: Trying to load the custom-built native-hadoop library...
2024-09-24 15:37:02,469 [main] INFO storage.DomainSocketFactory: Loaded the native-hadoop library
2024-09-24 15:37:02,469 [main] INFO storage.DomainSocketFactory: short-circuit local reads is enabled within 294209 ns.
2024-09-24 15:37:02,516 [main] INFO freon.RandomKeyGenerator: Number of Threads: 10
2024-09-24 15:37:02,524 [main] INFO freon.RandomKeyGenerator: Number of Volumes: 1.
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: Number of Buckets per Volume: 10.
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: Number of Keys per Bucket: 1000.
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: Key size: 1048576 bytes
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: Buffer size: 4096 bytes
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: validateWrites : true
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: Number of Validate Threads: 10
2024-09-24 15:37:02,525 [main] INFO freon.RandomKeyGenerator: cleanObjects : false
2024-09-24 15:37:02,527 [main] INFO freon.RandomKeyGenerator: Data validation is enabled.
2024-09-24 15:37:02,527 [main] INFO freon.RandomKeyGenerator: Starting progress bar Thread.

 0.00% |█                                                                                                    |  0/10000 Time: 0:00:00|  2024-09-24 15:37:02,550 [pool-3-thread-1] INFO rpc.RpcClient: Creating Volume: vol-0-30931, with sammi as owner and space quota set to -1 bytes, counts quota set to -1
2024-09-24 15:37:02,616 [pool-3-thread-8] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-6-04589, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,616 [pool-3-thread-2] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-0-10826, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,616 [pool-3-thread-5] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-3-67064, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,616 [pool-3-thread-7] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-5-65340, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,616 [pool-3-thread-1] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-9-04405, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,616 [pool-3-thread-4] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-2-22496, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,619 [pool-3-thread-3] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-1-81033, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,619 [pool-3-thread-6] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-4-66996, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,621 [pool-3-thread-10] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-8-83075, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,626 [pool-3-thread-9] INFO rpc.RpcClient: Creating Bucket: vol-0-30931/bucket-7-63535, with server-side default bucket layout, sammi as owner, Versioning false, Storage Type set to DISK and Encryption set to false, Replication Type set to server-side default replication type, Namespace Quota set to -1, Space Quota set to -1 
2024-09-24 15:37:02,817 [pool-3-thread-6] WARN impl.MetricsSystemImpl: ozone-freon metrics system already initialized!
2024-09-24 15:37:03,022 [pool-3-thread-4] INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl
 0.10% |█                                                                                                    |  10/10000 Time: 0:00:01|  2024-09-24 15:37:03,773 [pool-5-thread-7] INFO scm.XceiverClientShortCircuit: Pipeline-df315eb9-30ee-4765-9e68-7fc16335ce48-XceiverClientShortCircuit is created
2024-09-24 15:37:03,775 [pool-5-thread-7] INFO storage.DomainSocketFactory: DomainSocket(fd=311,path=/Users/sammi/ozone_dn_socket) is created within 1572500 ns
 100.00% |█████████████████████████████████████████████████████████████████████████████████████████████████████|  10000/10000 Time: 0:00:24|  
2024-09-24 15:37:27,539 [main] INFO freon.RandomKeyGenerator: Data generation is completed
2024-09-24 15:37:27,539 [main] INFO freon.RandomKeyGenerator: Data validation is completed
2024-09-24 15:37:31,231 [Pipeline-df315eb9-30ee-4765-9e68-7fc16335ce48-XceiverClientShortCircuit-ReceiveResponse] INFO scm.XceiverClientShortCircuit: receiveResponseTask is closed with java.nio.channels.ClosedChannelException
2024-09-24 15:37:31,243 [main] INFO scm.XceiverClientShortCircuit: DomainSocket(fd=311,path=/Users/sammi/ozone_dn_socket) is closed for 249539d9-f678-4ecf-aa6d-fcbcea75f627(localhost/127.0.0.1)
2024-09-24 15:37:31,243 [Pipeline-df315eb9-30ee-4765-9e68-7fc16335ce48-XceiverClientShortCircuit-SendRequest] INFO scm.XceiverClientShortCircuit: sendRequestTask is interrupted

***************************************************
Status: Success
Git Base Revision: bab77db646e17201dd953cc9f0a805502d24ea45
Number of Volumes created: 1
Number of Buckets created: 10
Number of Keys added: 10000
Replication: RATIS/ONE
Average Time spent in volume creation: 00:00:00,008
Average Time spent in bucket creation: 00:00:00,014
Average Time spent in key creation: 00:00:03,689
Average Time spent in key write: 00:00:02,927
Total bytes written: 10485760000
Total number of writes validated: 10000
Writes validated: 100.0 %
Successful validation: 10000
Unsuccessful validation: 0
Total Execution time: 00:00:29,326
***************************************************

Still working on whether it's possible to test it with robot test.

@jojochuang jojochuang self-requested a review September 30, 2024 19:28
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Since this is a big PR, it'll take me more time to go through it entirely but here are some comments.

defaultValue = "600",
type = ConfigType.LONG,
description = "If some unknown IO error happens on Domain socket read, short circuit read will be disabled " +
"temporary for this period of time(seconds).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
"temporary for this period of time(seconds).",
"temporarily for this period of time(seconds).",

type = ConfigType.SIZE,
description = "Buffer size of reader/writer.",
tags = { ConfigTag.CLIENT, ConfigTag.DATANODE })
private int shortCircuitBufferSize = 128 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious to know how the default value is determined.

The HDFS "dfs.client.read.shortcircuit.buffer.size" which is similar, is 1MB in size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short-circuit channel only exchange getBlock request and response. The value is determined by how big will a request and a response of a 256MB block size, 4MB chunk size, 16KB checksum size block. The request is round 500 bytes, and the response is around
30 + 53 * 64 + 6 * 16384 ~ 100k
block size (exclude chunks) - 30 bytes
chunk size (one checksums) - 53 bytes
one checksum size - 6 bytes

dataOut.write(bytes);
dataOut.flush();
}
// LOG.info("send {} {} bytes with id {}", type, bytes.length, key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

port = dn.getPort(DatanodeDetails.Port.Name.STANDALONE).getValue();
InetSocketAddress addr = NetUtils.createSocketAddr(dn.getIpAddress(), port);
if (OzoneNetUtils.isAddressLocal(addr) &&
dn.getCurrentVersion() >= SHORT_CIRCUIT_READS.toProtoValue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need compatibility and upgrade tests for this.

See #7110 for a reference to the upgrade tests
and #7130 for compat tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It will be addressed with a new patch.

LOG.error("Short-circuit read is not enabled");
return null;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should abort if it's neither grpc nor short-circuit.

Comment on lines +30 to +32
import org.apache.hadoop.hdds.utils.IOUtils;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.hdfs.server.datanode.DatanodeUtil;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these org.apache.hadoop.hdfs.* dependency required? Ideally we should get away with HDFS.

Suggested change
import org.apache.hadoop.hdds.utils.IOUtils;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.hdfs.server.datanode.DatanodeUtil;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

Comment on lines +33 to +36
import org.apache.hadoop.hdfs.protocol.ExtendedBlock;
import org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.hdfs.server.datanode.DatanodeUtil;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these org.apache.hadoop.hdfs.* dependencies are needed.

Suggested change
import org.apache.hadoop.hdfs.protocol.ExtendedBlock;
import org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.hdfs.server.datanode.DatanodeUtil;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +26 to +33
import static org.apache.hadoop.hdfs.DFSConfigKeys.DFS_DATANODE_KERBEROS_PRINCIPAL_KEY;

/**
* Security protocol for a secure OzoneManager.
*/
@KerberosInfo(
serverPrincipal = DFS_DATANODE_KERBEROS_PRINCIPAL_KEY
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +112 to +113
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO an IOException should fail.

Comment on lines +158 to +159
} catch (IOException | InterruptedException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, too.

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also https://issues.apache.org/jira/browse/HDFS-16198
we will need to handle block token expirations in secure environment too.

public static final String OZONE_READ_SHORT_CIRCUIT = "ozone.client.read.short-circuit";
public static final boolean OZONE_READ_SHORT_CIRCUIT_DEFAULT = false;
public static final String OZONE_DOMAIN_SOCKET_PATH = "ozone.domain.socket.path";
public static final String OZONE_DOMAIN_SOCKET_PATH_DEFAULT = "/var/lib/ozone/dn_socket";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document this property somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to add a document about this feature. Will explain all these new properties there.

Comment on lines +42 to +43
public static final String OZONE_READ_SHORT_CIRCUIT = "ozone.client.read.short-circuit";
public static final boolean OZONE_READ_SHORT_CIRCUIT_DEFAULT = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is redundant; See: shortCircuitEnabled.

Suggested change
public static final String OZONE_READ_SHORT_CIRCUIT = "ozone.client.read.short-circuit";
public static final boolean OZONE_READ_SHORT_CIRCUIT_DEFAULT = false;

Comment on lines +76 to +86
try {
request = ContainerProtos.ContainerCommandRequestProto.parseFrom(requestInBytes);
assertTrue(request.hasGetBlock());
assertEquals(ContainerProtos.Type.GetBlock, request.getCmdType());
assertEquals(containerID, request.getContainerID());
assertEquals(datanodeID, request.getDatanodeUuid());
assertEquals(localBlockID, request.getGetBlock().getBlockID().getLocalID());
assertEquals(bcsid, request.getGetBlock().getBlockID().getBlockCommitSequenceId());
} catch (InvalidProtocolBufferException e) {
throw new RuntimeException(e);
}
Copy link
Contributor

@jojochuang jojochuang Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just let it throw exception and exit. Change method signature accordingly too.

Suggested change
try {
request = ContainerProtos.ContainerCommandRequestProto.parseFrom(requestInBytes);
assertTrue(request.hasGetBlock());
assertEquals(ContainerProtos.Type.GetBlock, request.getCmdType());
assertEquals(containerID, request.getContainerID());
assertEquals(datanodeID, request.getDatanodeUuid());
assertEquals(localBlockID, request.getGetBlock().getBlockID().getLocalID());
assertEquals(bcsid, request.getGetBlock().getBlockID().getBlockCommitSequenceId());
} catch (InvalidProtocolBufferException e) {
throw new RuntimeException(e);
}
request = ContainerProtos.ContainerCommandRequestProto.parseFrom(requestInBytes);
assertTrue(request.hasGetBlock());
assertEquals(ContainerProtos.Type.GetBlock, request.getCmdType());
assertEquals(containerID, request.getContainerID());
assertEquals(datanodeID, request.getDatanodeUuid());
assertEquals(localBlockID, request.getGetBlock().getBlockID().getLocalID());
assertEquals(bcsid, request.getGetBlock().getBlockID().getBlockCommitSequenceId());

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also take a look at https://issues.apache.org/jira/browse/HDFS-15202
where having multiple short circuit cache instances imporves throughput.

dfs.client.short.circuit.num

Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai
Copy link
Contributor

Is it possible to split this into subtasks? 220K patch is hard to review.

@adoroszlai adoroszlai marked this pull request as draft October 17, 2024 09:11
BlockID blockID = BlockID.getFromProtobuf(
request.getGetBlock().getBlockID());
if (replicaIndexCheckRequired(request)) {
BlockUtils.verifyReplicaIdx(kvContainer, blockID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we keep this line to verify replica index?

}

@Override
public CompletableFuture<XceiverClientReply> watchForCommit(long index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest master changed the interface defintion of this method. It is now returning XceiverClientReply instead.

@ChenSammi
Copy link
Contributor Author

See also https://issues.apache.org/jira/browse/HDFS-16198 we will need to handle block token expirations in secure environment too.

Same KeyValueHandler#handleGetBlock is used to process the getBlock request from the short-circuit channel. Block token will be verified. If its expires, block will not be opened at server side. We don't use shared memory in current implementation. So I think we should be fine.

@ChenSammi
Copy link
Contributor Author

I will seperate the patch into sub patches. Close this one.

@ChenSammi ChenSammi closed this Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants