-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using follow log output causes json-log file descriptors to leak #37391
Comments
I have been investigating a similar issue which I think has the same root cause. We have a dotnet core application that registers an inotify watch on a config file at startup. It sometimes fails at startup with the message
This relates to the sysctl setting When I investigated what was using all the inotify instances, I found dockerd had the vast majority of inotify file handles, with a correspondingly high number of stale file handles to a container's *-json.log file which had since been deleted. For example, with 5 containers running:
and after restarting docker, the same 5 containers running:
|
No assignees, no labels? Is this issue ignored or something? |
Reproduced on latest moby (update: and latest docker/cli). |
Let me summarize the current findings; will continue looking at it tomorrow. I have tracked the issue down to the following code: moby/daemon/logger/loggerutils/logfile.go Lines 622 to 639 in 562df8c
Goroutine is stuck at line 625: moby/daemon/logger/loggerutils/logfile.go Line 625 in 562df8c
The problem is I can't grok this code. The relevant change is commit a69a59f (PR #27782). |
OK, what happens is
Now, I see two ways to solve it.
@@ -622,20 +634,7 @@ func followLogs(f *os.File, logWatcher *logger.LogWatcher, notifyRotate chan int
select {
case logWatcher.Msg <- msg:
case <-ctx.Done():
- logWatcher.Msg <- msg
- for {
- msg, err := decodeLogLine()
- if err != nil {
- return
- }
- if !since.IsZero() && msg.Timestamp.Before(since) {
- continue
- }
- if !until.IsZero() && msg.Timestamp.After(until) {
- return
- }
- logWatcher.Msg <- msg
- }
+ return
}
}
}
diff --git a/daemon/logger/logger.go b/daemon/logger/logger.go
index 912e855c7..a5a20d33b 100644
--- a/daemon/logger/logger.go
+++ b/daemon/logger/logger.go
@@ -123,6 +123,13 @@ func (w *LogWatcher) Close() {
// only close if not already closed
w.closeOnce.Do(func() {
close(w.closeNotifier)
+ // make sure to read tail from the channel until it's closed
+ // in order for the writer to not block
+ for {
+ if _, ok := <-w.Msg; !ok {
+ return
+ }
+ }
})
}
I did some limited testing -- both (1) and (2) fixes the issue. Now, here's my humble opinion about those options. (1) looks scary, as it blatantly removes a chunk of code that presumably is/was useful. Still, I admit I fail to understand the meaning of it -- why is it trying to push some log messages to the consumer if consumer has already signaled it is no longer interested in this? I have traced this code back to commit c0391bf (it is 3 y.o. so I'm not sure anyone remembers), and still can't figure out why. (2) looks ugly (why the hell do we write useless data in one place and read it in another?). It can only make sense in case of a blocking write (like in line 625 in the code above), but if all the writes are non-blocking and with a proper check for context cancellation, there is no need whatsoever to read junk. (3) looks redundant; rationale is the same as for (2). @cpuguy83 PTAL |
As I recall, at the time So I think what's happened here is we've mixed up (or potentially merged?) client signaling "I'm done" and container stopping. |
Proposed fix: #37576 I am still not sure if it is totally correct and flawless, or whether it breaks anything, but at least it looks like a step in the right direction. Also, I have to admit I never knew logging can be that hard. |
Simple but complete test case (requires dockerd to be run with "--debug"): #!/bin/bash
set -e
set -o pipefail
set -u
numgor() {
docker info | awk '($1 == "Goroutines:") {print $2; exit}'
}
waitgor() {
local expected=$1
local retries=20
local duration=0.5
echo -n "Waiting for goroutines to finish "
while [ $retries -gt 0 ] ; do
local actual=$(numgor)
test $actual -le $expected && echo " OK" && return 0
let retries--
sleep $duration
echo -n .
done
echo " FAIL"
echo "Goroutine leak! Expecting <= $expected, got $actual" 1>&2
return 1
}
CTID=$(docker run -d --network=none busybox sh -c "yes somewhat long line should be here so here it goes")
sleep 1
GOR=$(numgor)
PIDS=
for i in $(seq 1 10); do
docker logs -f $CTID > /dev/null &
PIDS="$PIDS $!"
done
sleep 1
kill -9 $PIDS
sleep 1
if ! waitgor $GOR; then
echo look for leaks with ID $CTID
else
echo PASS
fi
docker rm -f $CTID >/dev/null |
This ia a test case for issue moby#37391. The idea is to start a container that produces lots of output, then run "docker logs --follow" on the above container, than kill "docker logs" and check whether the number of - daemon goroutines, - daemon opened file descriptors are back to whatever they were. Currently, this test reliably detects the leak. PS here's what it takes to run the test case against the local daemon: for i in busybox busybox:glibc hello-world debian:jessie; do docker pull $i; done cd integration ln -s ../Dockerfile . cd container go test -v -run TestLogsFollowGoroutineLeak Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channge send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; 2. Most of the code waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), which mimics the old behavior, except for a few places where it's perfectly clear which one we are to wait for. TODO: further specification for some cases might be needed to fix other bugs. 3. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing syncronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. Should fix moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Better test case is at aa3bc5a |
All version of docker-ce from 17.03 to 18.06 are affected. |
FWIW, I think this is probably related to (and thus your PR would fix!) an issue I've seen on Windows where a container can't be deleted because the container logs were followed at some point (and are thus still open and the file cannot be deleted, even though the "consumer" of those followed logs is long gone). 🎉 ❤️ |
This test case checks that followLogs() exits once the reader is gone. Currently it does not (i.e. this test is supposed to fail) due to moby#37391. [kolyshkin@: test case Brian Goff, changelog and all bugs are by me] Source: https://gist.github.com/cpuguy83/e538793de18c762608358ee0eaddc197 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This test case checks that followLogs() exits once the reader is gone. Currently it does not (i.e. this test is supposed to fail) due to moby#37391. [kolyshkin@: test case Brian Goff, changelog and all bugs are by me] Source: https://gist.github.com/cpuguy83/e538793de18c762608358ee0eaddc197 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit d37a11b) [Conflict due to missing moby commit 94a1015]
This test case checks that followLogs() exits once the reader is gone. Currently it does not (i.e. this test is supposed to fail) due to moby#37391. [kolyshkin@: test case Brian Goff, changelog and all bugs are by me] Source: https://gist.github.com/cpuguy83/e538793de18c762608358ee0eaddc197 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit d37a11b)
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit 916eabd)
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit 916eabd) [Conflict while cherry-picking due to missing commit 94a1015] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby/moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby/moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Upstream-commit: 916eabd459fe707b5c4a86377d12e2ad1871b353 Component: engine
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby/moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby/moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit 916eabd459fe707b5c4a86377d12e2ad1871b353) Upstream-commit: 84a5b528aede5579861201e869870d10fc98c07c Component: engine
Thank you very much for this fix. We just ran into the problem in production and I'm very glad to find the issue and the fix so well documented. |
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit 9de9efe ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby/moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby/moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Description
Using
docker logs --follow
on heavy logging containers causes *-json.log files not close. This leads to file descriptors to leak, and free space not reclaimed (if logs rotates).Files are not closed even if container is removed. Only restart of Docker helps.
Steps to reproduce the issue:
docker logs --follow CONTAINER_ID
few times: one by one, gracefully stopping with Ctrl+C/proc/$(cat /run/docker.pid)/fd
Describe the results you received:
now, on second console run and stop by pressing Ctrl+C few times:
and now:
Describe the results you expected:
See above.
Additional information you deem important (e.g. issue happens only occasionally):
Make sure A LOT of logs are generated. Sometimes fd is closed. Run at least 10 times.
First seen on v17.12.1, still present on v18.03.1.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Tested on CentOS both on AWS and on Vagrant/VirtualBox - same results.
The text was updated successfully, but these errors were encountered: