Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement metrics for cephfs CSI driver. #310

Merged
merged 1 commit into from
Jul 25, 2019
Merged

Conversation

humblec
Copy link
Collaborator

@humblec humblec commented Apr 4, 2019

Depends on

#kubernetes/kubernetes#76188
##439

Signed-off-by: Humble Chirammal hchiramm@redhat.com

@humblec humblec changed the title Implement metrics for ceph-csi driver. [WIP] [DNM]Implement metrics for ceph-csi driver. Apr 4, 2019
@Madhu-1 Madhu-1 added the DNM DO NOT MERGE label Apr 4, 2019
@humblec humblec force-pushed the metrics branch 3 times, most recently from 26316b6 to 5a12f32 Compare April 4, 2019 13:36
@mykaul
Copy link
Contributor

mykaul commented Apr 14, 2019

@humblec - any updates?

@ShyamsundarR
Copy link
Contributor

@humblec discussions around dropping credential passing to the plugins, and using what is passed down form the StorageClass make it so that, the NodeGetVolumeStats RPC will have to function without credentials. Noting it here as you may want to factor that into your implementation.

@humblec
Copy link
Collaborator Author

humblec commented May 5, 2019

@humblec - any updates?

This PR is waiting for kubernetes support . I will update the PR once the support is added in upstream

@anmolsachan
Copy link

Dependency merged to kubernetes kubernetes/kubernetes#76188

@humblec
Copy link
Collaborator Author

humblec commented Jun 4, 2019

Yeah, the dependency patch in kubernetes got merged on May 17th which was before 1.15 code freeze.

@humblec
Copy link
Collaborator Author

humblec commented Jun 27, 2019

Couple of days back the kubernetes dependency of our CSI repo has been upgraded to 1.15 which support metrics code base. This PR also depends on new CephFS volume provisioning

@humblec
Copy link
Collaborator Author

humblec commented Jul 17, 2019

The final dependency has also got merged last week and part of 1.1.0 release https://github.com/ceph/ceph-csi/releases . Revisiting this PR as all the dependencies are in.

@humblec humblec force-pushed the metrics branch 2 times, most recently from 8ffa64e to 2ab5c28 Compare July 17, 2019 04:55
@humblec humblec changed the title [WIP] [DNM]Implement metrics for ceph-csi driver. Implement metrics for ceph-csi driver. Jul 17, 2019
@humblec humblec force-pushed the metrics branch 6 times, most recently from d7a39f5 to e4e7600 Compare July 17, 2019 07:19
@humblec humblec added release-1.2.0 Priority-0 highest priority issue and removed DNM DO NOT MERGE labels Jul 17, 2019
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 17, 2019

is this for rbd also? if not please change the title and also add an E2E for this feature

@humblec humblec changed the title Implement metrics for ceph-csi driver. Implement metrics for cephfs CSI driver. Jul 17, 2019
@humblec
Copy link
Collaborator Author

humblec commented Jul 17, 2019

is this for rbd also? if not please change the title and also add an E2E for this feature

Its for CephFS , updated the title.


// getFsInfo writes metrics.Capacity, metrics.Used and metrics.Available from the filesystem info
func (md *cephfsStatFS) getFsInfo(metrics *volume.Metrics) error {
available, capacity, usage, inodes, inodesFree, inodesUsed, err := fs.FsInfo(md.path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humblec (tagging @ajarr as well) The call to fs.FsInfo uses the syscall statfs. I checked that statfs on a CephFS subvolume mount, returns used space as per the set attrs on the subvolume. I wanted to know if the same applies to inodes, IOW, statfs fields,

fsfilcnt_t   f_files;   /* total file nodes in file system */
fsfilcnt_t   f_ffree;   /* free file nodes in fs */

and hence can be used to generate metrics as below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShyamsundarR not sure I got the comment, I would like to mention that, whatever statfs returns is good enough for us. If subvol is not responding properly to statfs its totally a different bug which Ceph has to take care. I dont think there is a bug though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humblec I tested and determined that cephfs subvolume mounts as used in ceph-csi returns the correct total space as needed by us (i.e it sticks to the quota set on the subvolume). I am however not able to determine if it returns inode count information as required (or used space, which is a short test away if needed). Hence I need confirmation that it does, in case you know it or can test it, else from CephFS developers.

We need to understand statfs limitations if any for CephFS subvolumes, before interpreting it's output the way we want OR assuming any errors is a bug in CephFS subvolume implementation.

The provided image shows used space, which was never my concern as stated in the comment.

Hope this provides the required clarity.

Copy link
Collaborator Author

@humblec humblec Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to understand statfs limitations if any for CephFS subvolumes, before interpreting it's output the way we want OR assuming any errors is a bug in CephFS subvolume implementation.

@ShyamsundarR If you mount the subvolume and call stat against, you should be able to verify it , Isnt it ? Agree that, its good to verify with cephfs team too.

@humblec
Copy link
Collaborator Author

humblec commented Jul 18, 2019

Btw @anmolsachan was testing this PR and we got the expected result in his testing.

@anmolsachan
Copy link

anmolsachan commented Jul 18, 2019

Btw @anmolsachan was testing this PR and we got the expected result in his testing.

yes, thanks Humble for the PR. Attaching screenshot

Screenshot from 2019-07-17 23-01-14

pkg/cephfs/cephfs_statfs.go Outdated Show resolved Hide resolved
pkg/cephfs/cephfs_statfs.go Outdated Show resolved Hide resolved
"k8s.io/kubernetes/pkg/volume/util/fs"
)

const (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humblec I think this entire code is already there in https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/metrics_statfs.go can we just make use of it?

Copy link
Collaborator Author

@humblec humblec Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 Its a mix or more than that with errors, It can also have ceph specific errors and can be enhanced for more functions. Thats why it has been modified and kept it separately. copyright carries for both for the same reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way, this has been shrinked now.

pkg/cephfs/nodeserver.go Show resolved Hide resolved
pkg/cephfs/nodeserver.go Outdated Show resolved Hide resolved

if err != nil {
klog.Errorf("stat failed on path: %v, error: %v", targetPath, err)
return nil, status.Error(codes.Internal, err.Error())
Copy link
Collaborator

@Madhu-1 Madhu-1 Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return invalid agrument if targetPath does not exist

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still this need to be handled

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 19, 2019

@humblec please resolve the merge conflict

@humblec
Copy link
Collaborator Author

humblec commented Jul 19, 2019

@Madhu-1 Done. Thanks!

@ShyamsundarR
Copy link
Contributor

is this for rbd also? if not please change the title and also add an E2E for this feature

@humblec are we adding an E2E test for this? Or is this not feasible?

@humblec
Copy link
Collaborator Author

humblec commented Jul 19, 2019

is this for rbd also? if not please change the title and also add an E2E for this feature

@humblec are we adding an E2E test for this? Or is this not feasible?

No @ShyamsundarR. Its not feasible.

@ShyamsundarR
Copy link
Contributor

is this for rbd also? if not please change the title and also add an E2E for this feature

@humblec are we adding an E2E test for this? Or is this not feasible?

No @ShyamsundarR. Its not feasible.

What are the alternatives for the future? csi-test is one that I can see. Also feasibility is an issue because there are no callers to this?

@humblec
Copy link
Collaborator Author

humblec commented Jul 19, 2019

@ShyamsundarR we can add csi-test later for the RPC call.

@ShyamsundarR
Copy link
Contributor

@ShyamsundarR we can add csi-test later for the RPC call.

Yes I am also looking at the "future" and not at in this PR, but wanted clarity on the feasibility issue, so that we know how to tackle it later.

@humblec
Copy link
Collaborator Author

humblec commented Jul 19, 2019

@ShyamsundarR we can add csi-test later for the RPC call.

Yes I am also looking at the "future" and not at in this PR, but wanted clarity on the feasibility issue, so that we know how to tackle it later.

@ShyamsundarR Yeah, we could start with csi-test . Meanwhile I will think about the E2E possibilities, if any.


isMnt, err := isMountPoint(targetPath)

if err != nil {
Copy link
Collaborator

@Madhu-1 Madhu-1 Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

if err != nil{
		klog.Errorf("stat failed on path: %v, error: %v", targetPath, err)
     if os.IsNotExist(err){
         return nil, status.Error(codes.InvalidArgument, err.Error())
      }
     else{
           return nil, status.Error(codes.Internal, err.Error())
       }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMountPoint is always returning status.Error(codes.Internal, err.Error()) on errors, but the current error message does need some massaging around "stat failing etc."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humble can you just return an error here?

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Jul 19, 2019

question:
@humblec if the implementation remains same for rbd also can we move this to csi-common?

@ShyamsundarR
Copy link
Contributor

question:
@humblec if the implementation remains same for rbd also can we move this to csi-common?

Implementation is common for RBD and CephFS as long as we are dealing with RWX/RWO volume types. VolumeID validation would differ for each, but that can be layered I guess. So yes there is commonality at the moment. Also, for block volumes we need to return only total size, and possibly should do so (when we implement the same).

For RBD we should not need secrets as we can get the image name from the VolumeID decomposing the image ID and pool ID and using the same to validate pool_id and 'name' contents at /sys/devices/rbd/X (where X is the device number) like in the call at getRbdDevFromImageAndPool. There is a slim chance that we have the same 'imagename' and 'poolID' across 2 clusters, that are mapped to the same node, but that seems slim. Tagging @dillaman to rectify any misconceptions.

For CephFS need to think a little more and see if we can map the mount options to check the subvolume name maps to the image name, but needs more thought to understand if secrets are not needed or are mandatory.

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for rbd, if it's required to move it to the csi-common folder please do it as it will not affect the functionality

@humblec
Copy link
Collaborator Author

humblec commented Jul 19, 2019

Thanks @Madhu-1 , RBD , I am still testing and will do the changes accordingly. @ShyamsundarR Can you review this in priority? Other teams are waiting for availability of this PR , so the urgency.

Copy link
Contributor

@ShyamsundarR ShyamsundarR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly error handling comments, and some message nits.

@humblec thoughts?


isMnt, err := isMountPoint(targetPath)

if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMountPoint is always returning status.Error(codes.Internal, err.Error()) on errors, but the current error message does need some massaging around "stat failing etc."

pkg/cephfs/nodeserver.go Outdated Show resolved Hide resolved
pkg/cephfs/nodeserver.go Outdated Show resolved Hide resolved

if isMnt {
klog.Infof("cephfs: volume %s is already mounted to %s", volID, targetPath)
cephfsProvider := volume.NewMetricsStatFS(targetPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error check to ensure cephfsProvider is non-nil?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not required. never going to happen

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is not gonna happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the signature does not have a return like (xxx, err) that we can assume this will not fail? Attempting to understand why this is not going to happen? Not a blocker for merge, but looking for information.

pkg/cephfs/nodeserver.go Outdated Show resolved Hide resolved
pkg/cephfs/nodeserver.go Outdated Show resolved Hide resolved
Available: available,
Total: capacity,
Used: used,
Unit: 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
use VolumeUsage_BYTES instead of 1 and use VolumeUsage_INODES instead of 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@humblec
Copy link
Collaborator Author

humblec commented Jul 20, 2019

@Madhu-1 @ShyamsundarR comments are addressed. Please go through it, if something else is missing, we can followup.

@humblec
Copy link
Collaborator Author

humblec commented Jul 22, 2019

The travis is failing due to an issue at ceph container image #486

Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
@mergify mergify bot merged commit 561cc26 into ceph:master Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority-0 highest priority issue release-1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants