Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-blocking claimToken in fs.go prior to running du and find #2022

Open
gautamdivgi opened this issue Aug 16, 2018 · 3 comments
Open

Use non-blocking claimToken in fs.go prior to running du and find #2022

gautamdivgi opened this issue Aug 16, 2018 · 3 comments

Comments

@gautamdivgi
Copy link

This is from issue kubernetes/kubernetes#61999. One of the issues with using a blocking claimToken in https://github.com/google/cadvisor/blob/master/fs/fs.go#L62 is that there can potentially be a backlog of du and find requests. If the claimToken is made non-blocking and returns an error if no token is available it will just defer the du and find requests. The advantage of using the non-blocking token is that there will never be a backlog of du and find requests queued up. So, although the disk and inode usage collection is a valuable metric we don't need to compromise kubelet health for it.

The change seems simple enough in fs/fs.go.

func claimToken() error {
	select {
	case token := <-pool:
		return nil
	default:
		fmt.Errorf("Failed to claim token, deferring usage collection")
	}
}

func (self *RealFsInfo) GetDirDiskUsage(dir string, timeout time.Duration) (uint64, error) {
	err = claimToken()
	if err != nil {
		defer releaseToken()
		return GetDirDiskUsage(dir, timeout)
	}
	else {
		return 0, err
	}
}

func (self *RealFsInfo) GetDirInodeUsage(dir string, timeout time.Duration) (uint64, error) {
	err = claimToken()
	if err != nil {
		defer releaseToken()
		return GetDirInodeUsage(dir, timeout)
	}
	else {
		return 0, err
	}
}
@dashpole
Copy link
Collaborator

how does a queue of du and find requests affect the kubelet's health? cAdvisor doesn't do housekeeping in parallel, so you can have at most one queued du or find call for each container.

@gautamdivgi
Copy link
Author

It came down from this PR - #1576 where an fsHandler is created per container. If I have N containers, that means N fsHandlers with each running a loop to trackUsage https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L111. These will all iterate over a the 20 maxConcurrentOps (https://github.com/google/cadvisor/blob/master/fs/fs.go#L51). So even though cAdvisor doesn't do housekeeping in parallel I think you will have a backlog of N-20 waiting on a "token".

I have typically seen the high wait times in kubernetes/kubernetes#61999 with a large number of kubelet threads (>1000) with most of them stuck on a futex_wait_queue_me system call.

But I guess I may owe you an experiment here

@piaoyu
Copy link

piaoyu commented Sep 28, 2018

something that hdfs do can refence https://issues.apache.org/jira/browse/HADOOP-9884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants