Use non-blocking claimToken in fs.go prior to running du and find #2022

gautamdivgi · 2018-08-16T18:33:11Z

This is from issue kubernetes/kubernetes#61999. One of the issues with using a blocking claimToken in https://github.com/google/cadvisor/blob/master/fs/fs.go#L62 is that there can potentially be a backlog of du and find requests. If the claimToken is made non-blocking and returns an error if no token is available it will just defer the du and find requests. The advantage of using the non-blocking token is that there will never be a backlog of du and find requests queued up. So, although the disk and inode usage collection is a valuable metric we don't need to compromise kubelet health for it.

The change seems simple enough in fs/fs.go.

func claimToken() error {
	select {
	case token := <-pool:
		return nil
	default:
		fmt.Errorf("Failed to claim token, deferring usage collection")
	}
}

func (self *RealFsInfo) GetDirDiskUsage(dir string, timeout time.Duration) (uint64, error) {
	err = claimToken()
	if err != nil {
		defer releaseToken()
		return GetDirDiskUsage(dir, timeout)
	}
	else {
		return 0, err
	}
}

func (self *RealFsInfo) GetDirInodeUsage(dir string, timeout time.Duration) (uint64, error) {
	err = claimToken()
	if err != nil {
		defer releaseToken()
		return GetDirInodeUsage(dir, timeout)
	}
	else {
		return 0, err
	}
}

The text was updated successfully, but these errors were encountered:

dashpole · 2018-08-16T18:49:45Z

how does a queue of du and find requests affect the kubelet's health? cAdvisor doesn't do housekeeping in parallel, so you can have at most one queued du or find call for each container.

gautamdivgi · 2018-08-17T18:14:00Z

It came down from this PR - #1576 where an fsHandler is created per container. If I have N containers, that means N fsHandlers with each running a loop to trackUsage https://github.com/google/cadvisor/blob/master/container/common/fsHandler.go#L111. These will all iterate over a the 20 maxConcurrentOps (https://github.com/google/cadvisor/blob/master/fs/fs.go#L51). So even though cAdvisor doesn't do housekeeping in parallel I think you will have a backlog of N-20 waiting on a "token".

I have typically seen the high wait times in kubernetes/kubernetes#61999 with a large number of kubelet threads (>1000) with most of them stuck on a futex_wait_queue_me system call.

But I guess I may owe you an experiment here

piaoyu · 2018-09-28T05:48:10Z

something that hdfs do can refence https://issues.apache.org/jira/browse/HADOOP-9884

gautamdivgi mentioned this issue Aug 16, 2018

Recurring high iowait due to kubelet 'du' process kubernetes/kubernetes#61999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use non-blocking claimToken in fs.go prior to running du and find #2022

Use non-blocking claimToken in fs.go prior to running du and find #2022

gautamdivgi commented Aug 16, 2018

dashpole commented Aug 16, 2018

gautamdivgi commented Aug 17, 2018

piaoyu commented Sep 28, 2018 •

edited

Loading

Use non-blocking claimToken in fs.go prior to running du and find #2022

Use non-blocking claimToken in fs.go prior to running du and find #2022

Comments

gautamdivgi commented Aug 16, 2018

dashpole commented Aug 16, 2018

gautamdivgi commented Aug 17, 2018

piaoyu commented Sep 28, 2018 • edited Loading

piaoyu commented Sep 28, 2018 •

edited

Loading