Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: add capability to automatically enable read affinity #3639

Merged
merged 2 commits into from
Feb 14, 2023

Conversation

Rakshith-R
Copy link
Contributor

Describe what this PR does

  • util: add new cmdline args to enable read affinity

This commit adds new cmdline args enable-read-affinity
and crush-location-labels to facilitate new feature of
using crush location labels from node labels to enable
read affinity using ceph's read_from_replica=localize
feature. This feature redirect reads to the nearest OSD.
This will be currently used only during rbd map cmd.

  • rbd: add capability to automatically enable read affinity

This commit makes use of crush location labels from node
labels to supply crush_location and read_from_replica=localize
options during rbd map cmd. Using these options, ceph
will be able to redirect reads to the closest OSD,
improving performance.

  • docs: add documentation regarding read affinity

This commit adds documentation about read affinity supported
for rbd volumes.

@Rakshith-R Rakshith-R requested review from nixpanic and Madhu-1 and removed request for nixpanic February 2, 2023 12:05
@mergify mergify bot added the component/rbd Issues related to RBD label Feb 2, 2023
docs/deploy-rbd.md Outdated Show resolved Hide resolved
docs/deploy-rbd.md Outdated Show resolved Hide resolved
internal/rbd/rbd_attach.go Outdated Show resolved Hide resolved
internal/rbd/rbd_attach.go Outdated Show resolved Hide resolved
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be tested? Some steps/guidance and complete examples for that should probably be documented too.

topology map[string]string
topology map[string]string
// crushLocationMap that will be used to enable read affinity.
CrushLocationMap map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be exported? Ideally it isn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be exported? Ideally it isn't.

Yes, it needs to exported to be accessible to individual drivers to use in rpc function call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is then this different from topology?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is then this different from topology?

Currently processing required for topology is identical to cephfs and rbd (topology feature is itself in beta stage with lot of TODOs),
Whereas processing required for CrushLocation maybe different for cephfs and rbd,(only rbd support read affinity at the moment.). Therefore, its better to let individual drivers handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSIDriver struct does not expose any attributes at the moment. It would be better to provide access/get function(s) for it instead. Sortof what you have in internal/util/crushlocation.go, but maybe even further and generate the crush_location= rbd commandline option?

internal/rbd/rbd_util.go Outdated Show resolved Hide resolved
internal/rbd/rbd_attach.go Outdated Show resolved Hide resolved
@Rakshith-R
Copy link
Contributor Author

How can this be tested? Some steps/guidance and complete examples for that should probably be documented too.

I've documented it here https://github.com/ceph/ceph-csi/pull/3639/files#diff-ffd3d77379c6dad83a2a35da3df2161ac6e62ee31396ce0657ba06662ef0b965 .

We just need to add labels to kubernetes node, set those cmdline args for rbd nodeplugin daemonset container and
required options will be passed during rbd map command.

The rendered form will look like this https://github.com/ceph/ceph-csi/blob/37925274b7de1150ca0adc79f703698d89d539ad/docs/deploy-rbd.md#read-affinity-using-crush-locations-for-rbd-volumes

@idryomov @travisn , Can you please take a look too?

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to add an e2e test for this one. To make sure the labels are set on the rbd mapped device.

@@ -81,6 +81,13 @@ func init() {
"",
"list of kubernetes node labels, that determines the topology"+
" domain the node belongs to, separated by ','")
flag.BoolVar(&conf.EnableReadAffinity, "enable-read-affinity", false, "enable read affinity")
flag.StringVar(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a document that setting this will use the same affinity to all the ceph clusters that cephcsi is connected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a document that setting this will use the same affinity to all the ceph clusters that cephcsi is connected.

I think that is implied. There's no statement informing users otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a note doesn't hurt and it helps users to understand the impact of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single Ceph-CSI deployment supports multiple Ceph clusters... What is the reason to have this as a commandline parameter, and not a setting in the StorageClass or the ConfigMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single Ceph-CSI deployment supports multiple Ceph clusters... What is the reason to have this as a commandline parameter, and not a setting in the StorageClass or the ConfigMap?

Its simpler to have it as comandline parameter.
In most of the cases, clusters will share the same crush map characteristics and this design
should be sufficient.

StorageClasses are immutable, therefore existing rbd volumes will not be able to use this feature.

The current implementation covers most of the cases and there should be no need to have this in
a configmap with unique labels per cluster. If such a need arises in the future, it will be easy to
override cmdline labels and prefer labels per cluster in configmap.

One more reason is of performance, using cmdline args will enable us to restart the container and
fetch the labels only at the time of restart instead of having to fetch it for each map/mount requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its simpler to have it as comandline parameter.

For who? You as a developer, or administrators deploying Ceph-CSI? I'd guess a StorageClass+ConfigMap option would be much easier for administrators.

I have no idea if CRUSH maps are commonly the same on different clusters. My assumption would be that it is highly unlikely, and they would be different per cluster.

IMHO, putting this change in the StorageClass would be suitable, as a SC already refers to a particular cluster.

In case of users that upgrade and want to benefit from this feature, placing it in the ConfigMap should work too?

Fetching labels of nodes needs to be dynamic anyway, I guess? What if a cluster changes, nodes get removed, added, relocated? Autoscaling a Kubernetes cluster is a thing, so fetching the labels only at startup of the provisioner isn't correct.

Copy link
Contributor Author

@Rakshith-R Rakshith-R Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its simpler to have it as comandline parameter.

For who? You as a developer, or administrators deploying Ceph-CSI? I'd guess a StorageClass+ConfigMap option would be much easier for administrators.

Both, I'd like to think of this feature as set and forget.
Topology of cluster with existing nodes is rarely changed. OSDs are assinged using topology labels on the node.
rbd volumes use the same (with cephcsi's help) to figure out their nearest OSD for read affinity.

I have no idea if CRUSH maps are commonly the same on different clusters. My assumption would be that it is highly unlikely, and they would be different per cluster.

https://docs.ceph.com/en/quincy/rados/operations/crush-map/#types-and-buckets

We really only care about the CRUSH map type and value, not the exact structure.
If the cluster is situated in kubernetes, then
OSD are assigned according to topology labels of the node. This is the same set of topology that will be used by cephcsi to extract the crush location and pass it on to the device while mapping.

CRUSH maps will be not be very different from each other among ceph clusters running on the same kubernetes cluster.

IMHO, putting this change in the StorageClass would be suitable, as a SC already refers to a particular cluster.

Existing pvc/storageclasses will not be able to take advantage of this feature.

In case of users that upgrade and want to benefit from this feature, placing it in the ConfigMap should work too?

Fetching labels of nodes needs to be dynamic anyway, I guess? What if a cluster changes, nodes get removed, added, relocated? Autoscaling a Kubernetes cluster is a thing, so fetching the labels only at startup of the provisioner isn't correct.

This is happening in the nodeplugin pod and will be used during mapping.
In any of the scenario described above, cephcsi nodeplugin pod will get restarted and we will fetch it again.
Each nodeplugin pod cares only about its node's labels.

  • avoids repetition of same labels in each storageclass / configmap per cluster section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can always move it to the ConfigMap if more flexibility is required.

Right, of course it is only used in the node-plugin, so changes to the number of nodes in the Kubernetes cluster is handled automatically.

docs/deploy-rbd.md Outdated Show resolved Hide resolved
docs/deploy-rbd.md Outdated Show resolved Hide resolved
internal/rbd/rbd_attach.go Outdated Show resolved Hide resolved
internal/rbd/rbd_attach_test.go Outdated Show resolved Hide resolved
internal/util/crushlocation.go Outdated Show resolved Hide resolved
internal/util/crushlocation.go Outdated Show resolved Hide resolved
internal/util/crushlocation.go Outdated Show resolved Hide resolved
internal/util/crushlocation_test.go Outdated Show resolved Hide resolved
@@ -168,7 +182,7 @@ func (r *Driver) Run(conf *util.Config) {
if err != nil {
Copy link
Contributor Author

@Rakshith-R Rakshith-R Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 @nixpanic
I am removing this segment of code.

if !conf.IsControllerServer && !conf.IsNodeServer {
topology, err = util.GetTopologyFromDomainLabels(conf.DomainLabels, conf.NodeID, conf.DriverName)
if err != nil {
log.FatalLogMsg(err.Error())
}
r.ns, err = NewNodeServer(r.cd, conf.Vtype, topology)
if err != nil {
log.FatalLogMsg("failed to start node server, err %v\n", err)
}
r.cs = NewControllerServer(r.cd)
}

changes to controllerserver and nodeserver standalone if statements are not propagated to this part from a long time (see krbd feature and replication sever),
and I don't think this section is necessary.

The complexity of this function will go down and it'll be much clearer.
wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to remove it not required and make its less complex

Madhu-1
Madhu-1 previously approved these changes Feb 3, 2023
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good to see if you can add an E2E for this one to make sure functionality works and never breaks in future.

Comment on lines -166 to -176
if !conf.IsControllerServer && !conf.IsNodeServer {
topology, err = util.GetTopologyFromDomainLabels(conf.DomainLabels, conf.NodeID, conf.DriverName)
if err != nil {
log.FatalLogMsg(err.Error())
}
r.ns, err = NewNodeServer(r.cd, conf.Vtype, topology)
if err != nil {
log.FatalLogMsg("failed to start node server, err %v\n", err)
}
r.cs = NewControllerServer(r.cd)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes out the possibility of running cephcsi as both controller and provisioner. good to add this to release notes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done added a note in tracker issue #3336 (comment)

opened todo issue for e2e and helm options.
#3642

}

var b strings.Builder
b.WriteString("read_from_replica=localize,crush_location=")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be use the input string as const ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be use the input string as const ?

This currently used at only one place, let's do so when we'll be using it at multiple places.

@Rakshith-R Rakshith-R requested review from humblec and a team February 6, 2023 10:36
@@ -68,6 +68,15 @@ spec:
# and pass the label names below, for CSI to consume and advertise
# its equivalent topology domain
# - "--domainlabels=failure-domain/region,failure-domain/zone"
#
# Options to enable read affinity.
# If enabled Ceph CSI will fetch labels from kubernetes node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If enabled Ceph CSI will fetch labels from kubernetes node
# If enabled Ceph CSI will fetch labels from kubernetes node and

cmd/cephcsi.go Outdated
"crush-location-labels",
"",
"list of kubernetes node labels, that determines the"+
" crush location the node belongs to, separated by ','")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRUSH is an abbreviation so it should be capitalized in the option descriptions and documentation:

Suggested change
" crush location the node belongs to, separated by ','")
" CRUSH location the node belongs to, separated by ','")

args: input{
crushLocationLabels: "topology.io/zone,topology.io/rack",
nodeLabels: map[string]string{
"topology.io/region": "zone1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"topology.io/region": "zone1",
"topology.io/region": "region1",

nameIdx := strings.IndexRune(key, keySeparator)
crushLocationType := strings.TrimSpace(key[nameIdx+1:])
// replace "." with "-" to satisfy ceph crush map.
value = strings.Replace(strings.TrimSpace(value), ".", "-", -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? Are dots in zone and region names expected?

If nothing else, these kinds of transformations should be highlighted in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? Are dots in zone and region names expected?

yes there maybe dots in the values, there's no such restriction on kubernetes side.

If nothing else, these kinds of transformations should be highlighted in the documentation.

This is adapted from rook code which said the same thing,
I'll add a note in documentation.

to Ceph CSI RBD daemonset pod "csi-rbdplugin" container, resulting in Ceph CSI adding
`"--options read_from_replica=localize,crush_location=zone:east"` krbd options during
rbd map operation.
If enabled, this option will be added to all RBD volumes mapped by Ceph CSI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also note that, for this setting to be effective, the OSDs should be assigned to the same zones, regions, etc respectively on the cluster side (in the CRUSH map). Just declaring that a particular node belongs to a particular zone on the client side is insufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the first sentence to elaborate on it,
But users should find more info on ceph krbd map options site which is linked in second sentence.

Comment on lines 391 to 392
// appendCrushLocationMapOptions parses crushLocationMap to the format
// "--read_from_replica=localize,--crush_location=key1:value1|key3:value2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual parsing happens in getCrushLocationMap:

Suggested change
// appendCrushLocationMapOptions parses crushLocationMap to the format
// "--read_from_replica=localize,--crush_location=key1:value1|key3:value2"
// appendCrushLocationMapOptions uses crushLocationMap to the format
// "read_from_replica=localize,crush_location=key1:value1|key2:value2"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to converts now

{
name: "filled mapOptions and crushLocationMap",
args: input{
mapOptions: "read-only=true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for this unit test, but all krbd options use underscores (so read_only, not read-only). Someone grepping for mapOptions examples in the future may get confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.ceph.com/en/latest/man/8/rbd/#kernel-rbd-krbd-options

changed it to notrim instead to avoid confusion

@mergify mergify bot dismissed Madhu-1’s stale review February 6, 2023 12:30

Pull request has been modified.

@github-actions
Copy link

/test ci/centos/mini-e2e-helm/k8s-1.26

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.23

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.24

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.25

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.26

@github-actions
Copy link

/test ci/centos/upgrade-tests-cephfs

@github-actions
Copy link

/test ci/centos/upgrade-tests-rbd

@nixpanic
Copy link
Member

/retest ci/centos/mini-e2e/k8s-1.25

@nixpanic
Copy link
Member

@Rakshith-R
Copy link
Contributor Author

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Feb 14, 2023

refresh

✅ Pull request refreshed

@mergify mergify bot added ok-to-test Label to trigger E2E tests and removed ok-to-test Label to trigger E2E tests labels Feb 14, 2023
@mergify mergify bot merged commit db8320c into ceph:devel Feb 14, 2023
@mergify mergify bot removed the ok-to-test Label to trigger E2E tests label Feb 14, 2023
@github-actions
Copy link

/test ci/centos/k8s-e2e-external-storage/1.23

@github-actions
Copy link

/test ci/centos/k8s-e2e-external-storage/1.24

@github-actions
Copy link

/test ci/centos/k8s-e2e-external-storage/1.25

@github-actions
Copy link

/test ci/centos/k8s-e2e-external-storage/1.26

@github-actions
Copy link

/test ci/centos/mini-e2e-helm/k8s-1.23

@github-actions
Copy link

/test ci/centos/mini-e2e-helm/k8s-1.24

@github-actions
Copy link

/test ci/centos/mini-e2e-helm/k8s-1.25

@github-actions
Copy link

/test ci/centos/mini-e2e-helm/k8s-1.26

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.23

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.24

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.25

@github-actions
Copy link

/test ci/centos/mini-e2e/k8s-1.26

@github-actions
Copy link

/test ci/centos/upgrade-tests-cephfs

@github-actions
Copy link

/test ci/centos/upgrade-tests-rbd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rbd Issues related to RBD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants