Bug - Deadlock when using multithreaded & multiprocessing Environment #231

shachakz · 2020-03-30T11:40:11Z

Hi,

Python version: 3.6.8
loguru version: 0.4.1
OS: Linux
Dev environment: Terminal

Bug Description: Creating a new process during logging from another thread will cause deadlock inside the new process whenever it calls the logger.
I assume that it happens because handler's _lock is already taken whenever the process is created, resulting that _lock will be taken forever (from the new process perspective) [see 1st comment in here].

Reproduce code:

import multiprocessing as mp
import sys
import threading
from random import uniform
from time import sleep
from loguru import logger

logger.remove()

logger.add(
    sys.stdout,
    colorize=False,
    enqueue=True,  # This bug will reproduce also with enqueue=False
    serialize=False,
    level="DEBUG",
)


def sub_worker():
    while True:
        logger.trace(f"This is the inner thread")
        sleep(0.01)

def worker():
    sleep(uniform(0, 0.1))
    logger.debug(f"Printing some logs")

if __name__ == "__main__":
    threading.Thread(target=sub_worker).start()

    while True:
        w = mp.Process(target=worker, )
        w.start()
        w.join()

        logger.debug(f"Loop ok!")

The text was updated successfully, but these errors were encountered:

Delgan · 2020-03-30T17:13:26Z

Yeah, it's a problem that's been hanging in my backlog for a while. This is the same problem as with the standard logging library. I didn't have the courage to correct this because I found the workaround inelegant. But I guess there's not much choice. See relevant commits:

Anyway, thanks a lot for the report and the fully workable reproducible example! I will implement a fix for the next version.

Delgan · 2020-05-02T09:26:26Z

What a journey! There was not one, not two, but three bugs here! 😄

1. Deadlock while mixing `threading` with `multiprocessing` (`1122e82`)

This was the easy one. It happens if a process is created while the internal lock of one handler (used for thread-safety) is already acquired. This is only possible if the sink is used by another thread while the child process is created. The child process will inherits the locked lock (due to fork() copying the entire process state), and hence deadlock as soon as a message tries to be logged. This problem is extensively discussed in the issues related to the commits I linked above, it's very instructive.

It can happen even if enqueue=False and is easily reproducible by using a "slow" sink:

import multiprocessing
import threading
import time
import sys
from loguru import logger

def slow_sink(msg):
    time.sleep(1)
    sys.stderr.write(msg)

def worker():
    logger.info("Working...")

if __name__ == "__main__":
    logger.remove()
    logger.add(slow_sink, enqueue=False)

    thread = threading.Thread(target=worker)
    thread.start()

    process = multiprocessing.Process(target=worker)
    process.start()

    thread.join()
    process.join()

The fix for this relies on the os.register_at_fork() function. We simply need to acquire all locks in the main process before a fork occurs, and release them in both main and child processes once the fork is finished.

Contrary to the logging library, loguru does not use re-entrant locks and does not expose API to acquire them. This eases the process of protecting handlers during fork and avoid others possibles kind of deadlocks that the standard library faced.

2. Deadlock while using `enqueue=True` with `multiprocessing` (`8c42511`)

So, after I fixed the first mentioned issue, I thought I can call it a day and move on, but there were still more troubles to come. 🙂

While using enqueue=True, a thread is internally started to continuously consume log messages put in the queue. Unfortunately, threading and multiprocessing don't interoperate very well. There was another possible deadlock, due to the fact that sys.stderr (and sys.stdout) itself internally uses a lock which is not protected in case of a fork.

Consequently, if a fork occurs while the internal handler' thread is printing a logging message, the child process will run into a deadlock if sys.stderr is manipulated. This happens beyond the scope of loguru, as child processes do not use the sink (they simply add log messages to the queue).

import multiprocessing
import sys
from loguru import logger

if __name__ == "__main__":
    logger.remove()
    logger.add(sys.stderr, enqueue=True)

    while True:
        logger.info("No deadlock yet...")
        process = multiprocessing.Process(target=lambda: None)
        process.start()
        process.join()

The fix is quite straightforward using the new locking mechanism previously introduced: we have to protect the sink with a lock even if enqueue=True, so it's acquired and released during a fork.

3. Deadlock while joining child process with `enqueue=True` (`be597bd`)

About this one, I was not able to make a reproducible example, but the multiprocessing documentation is quite explicit about it:

Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.

This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.

So, it seems possible for a child process to deadlock if it's exited while items added to the queue have not been fully handled. I'm not sure it apply to SimpleQueue used by loguru, but to be sure, I slightly modified the logger.complete() function. This method can now be called outside the asynchronous functions, it will block until all log messages added to the queue (at the time of the call) have been processed.

kapsh · 2020-05-14T10:12:38Z

@Delgan hello! I am hitting this deadlock while using loguru 0.4.1. Could you please release a new version with these fixes included?

It seems to me that still reproduces on 981edf0 if Process.run of custom worker is wrapped in loguru.catch.

Delgan · 2020-05-14T19:23:02Z

@kapsh I will try to release a new version this week-end.

However, I don't encounter any deadlock wrapping Process.run(). Could you please open a new ticket with the snippet that triggered the bug?

kapsh · 2020-05-14T19:38:29Z

Sorry, no snippet (yet). This is quite tangled code that logs from main process, from subprocess, into file, into another file, etc, etc. But I'll try to narrow down minimal reproducible piece from it. -- Alexander "kapsh" Kapshunakapsh@kap.sh 14.05.2020, 22:23, "Delgan" <notifications@github.com>: @kapsh I will try to release a new version this week-end.However, I don't encounter any deadlock wrapping Process.run(). Could you please open a new ticket with the snippet that triggered the bug?—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

Delgan · 2020-05-17T21:13:48Z

@kapsh You can now update loguru to 0.5.0, deadlocks should be fixed. 👍

If you encounter any other problem, please open a new issue so I can investigate it. 😉

kapsh · 2020-05-17T21:53:06Z

@Delgan meanwhile I couldn't catch deadlock in my script (pretty annoying to reproduce) yet I suspect it is caused by my own interaction with queues.
Thanks for taking care!

ahuan2616 · 2021-06-04T09:00:30Z

I can still reproduce this issue "2. Deadlock while using enqueue=True with multiprocessing" with version 0.5.3, am I missing something? I used these code to test:

import multiprocessing
import sys
from loguru import logger

if __name__ == "__main__":
    logger.remove()
    logger.add(sys.stderr, enqueue=True)

    while True:
        logger.info("No deadlock yet...")
        process = multiprocessing.Process(target=lambda: None)
        process.start()
        process.join()

kapsh · 2021-06-04T09:10:11Z

@ahuan2616 which python version are you running when this happens? In my case I concluded that some 3.6 release is responsible, because it stopped reproducing under 3.9.

Delgan · 2021-06-04T09:39:27Z

@ahuan2616 Sorry, it seems I forgot to mention that the fix requires at least Python 3.7 as @kapsh suggested.
The solution implemented relies on the os.register_at_fork() function which has been introduced with Python 3.7.

ahuan2616 · 2021-06-07T01:43:16Z

@kapsh @Delgan Thanks a lot for your help! My Python version was 3.6 before, it has been resolved when I update to Python 3.9. Thanks for your quickly reply!

Helicopt · 2022-10-22T08:14:48Z

It seems in Python 3.8.0 and loguru 0.6.0, it still goes into a deadlock.

Now I don't have spare time to make a detailed analysis, but I can post some info for you.

Environment:

Linux xxxxxx 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Python 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import loguru
loguru.version
'0.6.0'

In Process 0: It runs normally and reaches a sync command (after printing try resize)

2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg
2022-10-22 06:50:52.848 | INFO | yolox.core.trainer:after_iter:281 - epoch: 26/100, iter: 520/534, mem: 21206Mb, iter_time: 0.651s, data_time: 0.002s, total_loss: 5.378, iou_loss: 2.539, l1_loss: 0.000, conf_loss: 2.225, cls_loss: 0.614, lr: 1.915e-04, size: 1120, ETA: 7:22:11
2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:291 - clear meters
2022-10-22 06:50:52.848 | DEBUG | yolox.core.trainer:after_iter:293 - clear meters done
2022-10-22 06:50:52.849 | DEBUG | yolox.core.trainer:after_iter:297 - try resize

In Process 1: It gets stuck in logger.info.

2022-10-22 06:50:52.835 | DEBUG | yolox.core.trainer:after_iter:278 - get gpu number
2022-10-22 06:50:52.836 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg

In Process 2: It gets stuck in logger.info before I kill process 1, and then it jumps out of deadlock (because I kill process1), and continues to run and reaches a sync command (after printing try resize). It gets stuck for 1 hour, you can see the timestamp.

2022-10-22 06:50:52.837 | DEBUG | yolox.core.trainer:after_iter:278 - get gpu number
2022-10-22 06:50:52.837 | DEBUG | yolox.core.trainer:after_iter:280 - print info msg
2022-10-22 06:50:52.838 | INFO | yolox.core.trainer:after_iter:281 - epoch: 26/100, iter: 520/534, mem: 21470Mb, iter_time: 0.649s, data_time: 0.002s, total_loss: 5.432, iou_loss: 2.601, l1_loss: 0.000, conf_loss: 2.211, cls_loss: 0.621, lr: 1.915e-04, size: 1120, ETA: 7:23:12
2022-10-22 07:53:44.065 | DEBUG | yolox.core.trainer:after_iter:291 - clear meters
2022-10-22 07:53:44.066 | DEBUG | yolox.core.trainer:after_iter:293 - clear meters done
2022-10-22 07:53:44.067 | DEBUG | yolox.core.trainer:after_iter:297 - try resize

Short discussion:

So the message string is created normally because the timestamp is normal, and it gets stuck when outputting the message.

Killing a process can release the deadlock.

My temporary solution is not to log in more than 1 subprocess.

I will update more info when I get time.

Delgan · 2022-10-22T08:47:23Z

@Helicopt There have been some improvements made recently about deadlocks in #712, you can use the master branch of loguru to perhaps get more debugging information.

pip install git+https://github.com/Delgan/loguru.git@master

Delgan added the bug Something isn't working label Mar 30, 2020

Delgan added a commit that referenced this issue May 2, 2020

Fix possible deadlock while mixing threading/multiprocessing (#231)

1122e82

Delgan added a commit that referenced this issue May 2, 2020

Make "complete()" wait for all enqueued messages (#231)

be597bd

Delgan added a commit that referenced this issue May 2, 2020

Fix deadlock if enqueued sink in use during fork (#231)

8c42511

Delgan closed this as completed May 2, 2020

Delgan mentioned this issue Oct 10, 2020

Loguru doesn't remove uncompressed files after compressing them #309

Closed

mfeurer mentioned this issue Nov 20, 2020

Allow passing a multiprocessing context automl/pynisher#5

Merged

Delgan mentioned this issue Feb 17, 2021

How to handle a (possibly) misbehaving logger? #390

Closed

Delgan mentioned this issue Sep 30, 2022

Deadlock when objects emit messages at garbage collection #712

Closed

aalexfvk mentioned this issue Jul 8, 2024

Parallel FREEZE TABLE yandex/ch-backup#164

Merged

NathanielRN mentioned this issue Aug 15, 2024

Deadlock hang when using asyncio await on sklearn RandomForestClassifier joblib/joblib#1605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug - Deadlock when using multithreaded & multiprocessing Environment #231

Bug - Deadlock when using multithreaded & multiprocessing Environment #231

shachakz commented Mar 30, 2020 •

edited

Loading

Delgan commented Mar 30, 2020 •

edited

Loading

Delgan commented May 2, 2020

kapsh commented May 14, 2020 •

edited

Loading

Delgan commented May 14, 2020

kapsh commented May 14, 2020 via email

Delgan commented May 17, 2020

kapsh commented May 17, 2020

ahuan2616 commented Jun 4, 2021 •

edited

Loading

kapsh commented Jun 4, 2021

Delgan commented Jun 4, 2021

ahuan2616 commented Jun 7, 2021

Helicopt commented Oct 22, 2022

Delgan commented Oct 22, 2022

Bug - Deadlock when using multithreaded & multiprocessing Environment #231

Bug - Deadlock when using multithreaded & multiprocessing Environment #231

Comments

shachakz commented Mar 30, 2020 • edited Loading

Delgan commented Mar 30, 2020 • edited Loading

Delgan commented May 2, 2020

1. Deadlock while mixing threading with multiprocessing (1122e82)

2. Deadlock while using enqueue=True with multiprocessing (8c42511)

3. Deadlock while joining child process with enqueue=True (be597bd)

kapsh commented May 14, 2020 • edited Loading

Delgan commented May 14, 2020

kapsh commented May 14, 2020 via email

Delgan commented May 17, 2020

kapsh commented May 17, 2020

ahuan2616 commented Jun 4, 2021 • edited Loading

kapsh commented Jun 4, 2021

Delgan commented Jun 4, 2021

ahuan2616 commented Jun 7, 2021

Helicopt commented Oct 22, 2022

Delgan commented Oct 22, 2022

shachakz commented Mar 30, 2020 •

edited

Loading

Delgan commented Mar 30, 2020 •

edited

Loading

1. Deadlock while mixing `threading` with `multiprocessing` (`1122e82`)

2. Deadlock while using `enqueue=True` with `multiprocessing` (`8c42511`)

3. Deadlock while joining child process with `enqueue=True` (`be597bd`)

kapsh commented May 14, 2020 •

edited

Loading

ahuan2616 commented Jun 4, 2021 •

edited

Loading