Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect use of reentrant locks #1076

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

suligap
Copy link
Contributor

@suligap suligap commented Dec 1, 2024

The first commit fixes a correctness bug introduced in
0014e97 resulting in eg. lost updates during some scenarios.

The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example or allowing data corruption (mmap_dict).

Consider this example

from prometheus_client import Counter

my_counter = Counter('my_counter', 'my_counter description', labelnames=["bug"])


class Tracked(float):
    def __radd__(self, other):
        # Executes within the locked section of ValueClass.inc() if used as
        # amount arg to Counter.inc().
        my_counter.labels(bug="isit").inc(1)
        my_counter.labels(bug="race").inc(10)
        return self + other

    def __str__(self):
        return f"Tracked({float(self)})"


def main():
    my_counter.labels(bug="race").inc(Tracked(100))

    for sample in my_counter.collect()[0].samples:
        print(sample)


if __name__ == '__main__':
    main()

Executing this with reentrant locks in place results in (both multiprocess and single process mode):

$ python3 example.py 
Sample(name='my_counter_total', labels={'bug': 'race'}, value=100.0, timestamp=None, exemplar=None, native_histogram=None)
Sample(name='my_counter_created', labels={'bug': 'race'}, value=1733062486.7625494, timestamp=None, exemplar=None, native_histogram=None)
Sample(name='my_counter_total', labels={'bug': 'isit'}, value=1.0, timestamp=None, exemplar=None, native_histogram=None)
Sample(name='my_counter_created', labels={'bug': 'isit'}, value=1733062486.7625587, timestamp=None, exemplar=None, native_histogram=None)

We would expect the value of the bug=race counter to be 110 however.

With this fix, this code will:

  • With the changes from the first commit only: deadlock, signalling to the user that they have a bug.
  • With both commits: raise a RuntimeError.

It's also likely to happen if someone tries to write metrics during signal handling. Again, it's preferable to deadlock instead of being incorrect IMHO.

I'm open to dropping the second commit (it does add some complexity and maybe not everyone will like the overhead of an extra lock), but it may be a useful quality of life improvement (crashing is usually better than a deadlock).

Let me know what you think!

@suligap suligap force-pushed the drop-reentrant-locking branch from a19124d to b0210f5 Compare December 1, 2024 20:17
@csmarchbanks
Copy link
Member

Giving this some more thought, generally I agree that a deadlock or error would be better than incorrect data. Does this only happen for methods like __radd__? If so, I almost wonder if we should have all of the functions cast to a float before taking any locks as it is confusing for what should be a float value (according to typing) to break locking.

@btimby, I would also like your thoughts as you are the one who needs the RLock behavior. Would having normal locks in values.py still be sufficient for your use case?

@btimby
Copy link
Contributor

btimby commented Dec 2, 2024

Certainly, a deadlock is preferable to corrupt data. It may be the case that the change I proposed was too heavy-handed. In other words, my particular deadlock did not involve the changing of data, but the changing of the metrics registry. In fact, my problem occurred at startup. Manipulation of metric values may require a different locking strategy than registry access.

#1014

I am actually using django-prometheus which uses client_python, and my usage is basically straight out of their documentation.

@csmarchbanks
Copy link
Member

In other words, my particular deadlock did not involve the changing of data, but the changing of the metrics registry. In fact, my problem occurred at startup. Manipulation of metric values may require a different locking strategy than registry access.

Great, that is what I was thinking, keep the reentrant lock where possible in the registry and use regular locks elsewhere. Do you have a link to a reproducible example? Otherwise I can try to set something up soon.

@btimby
Copy link
Contributor

btimby commented Dec 2, 2024

I am sorry, I did not record anything at the time. This was just a small issue that I ran into when setting up a new project. I debugged the code and proposed a fix and moved on. This is what I described in the initial issue:

While using django-prometheus I found a situation where adding a metric to the registry caused a read from cache. I had cache metrics enabled, which attempted to add cache metrics to the registry. This resulting in a deadlock as the second registry call attempted to obtain the lock already held lower on the call stack.

I did a bit of digging just now to refresh my memory, and here is what I think triggered my initial problem. Keep in mind that I am using django-prometheus but I don't think there is anything particularly wrong with that project, it just provided the means for me to create this deadlock scenario.

Basically, I was adding a custom Collector in a module and the collect() method (called when adding to registry) accessed the django cache. I was also using a cache backend with it's own metrics, so by accessing the django cache, additional metrics (for cache access) were added to the registry, resulting in a deadlock.

So I think you could recreate my issue simply by creating a Collector that defines another Collector in it's collect() method.

Here is the code I believe caused the issue:

from typing import Generator

from django.db.models import OuterRef, Subquery
from django.utils import timezone
from django.core.cache import cache

from prometheus_client.core import GaugeMetricFamily, Metric
from prometheus_client.registry import Collector

from schedule.models import MatchTimeSlot
from facility.models import Facility


METRIC_IN_USE = GaugeMetricFamily(
    'two47dink_schedule_courts_in_use',
    'Number of courts reserved for matches',
    labels=['facility']
)


class MatchCollector(Collector):
    """
    Collect court metrics for each facility
    """
    def collect(self) -> Generator[Metric, None, None]:
        now = timezone.now().replace(
            minute=0, second=0, microsecond=0,
        )

        key = f'{__name__}.metrics-results'
        cached_results = cache.get(key)

        if not cached_results:
            match_count = MatchTimeSlot.objects \
                .filter(facility_id=OuterRef('id'), time=now) \
                .values('spanning')
            facilities = Facility.objects.all().annotate(
                match_count=Subquery(match_count),
            )

            cached_results = {
                facility.name: {'match_count': facility.match_count}
                for facility in facilities
            }
            cache.set(key, cached_results, 60)

        for facility_name, results in cached_results.items():
            METRIC_IN_USE.add_metric(
                labels=[facility_name], value=results.get('match_count') or 0.0
            )

        yield METRIC_IN_USE

    def describe(self) -> Generator[Metric, None, None]:
        yield METRIC_IN_USE

I added the optional describe() method later (to address an unrelated problem), which likely also fixes this issue, as it will be called instead of collect() and does NOT access the django cache. I am using this cache backend from django-prometheus:

https://github.com/korfuri/django-prometheus/blob/master/django_prometheus/cache/backends/redis.py

And now that I look at that code, it is indeed modifying counters related to cache access, so it may be modifying data (and not just the registry). OTOH, django does a lot of lazy loading and imports, so it may be the case that the cache backend is not loaded (and thus metrics are not initialized) until the cache is accessed. I know my original description of the problem specifically states that the registry is accessed in a nested manner, and I know that I did run this code under a debugger so I would tend to trust past me on this assessment.

I hope this helps, I can try to provide more information if you need it.

@suligap
Copy link
Contributor Author

suligap commented Dec 2, 2024

@csmarchbanks

Does this only happen for methods like __radd__?

Oh, I think/hope the __radd__ example is something that doesn't happen often in the real world. But it illustrates nicely that the code being locked is not reentrant safe. I think that the float() treatment might be a good idea on its own but it still does not make this code reentrant safe.

I'd expect that in practice people may have been running into deadlocks here (metric updates) mostly in case when they're handling signals (signal handlers invoked during the execution of the critical section held by the lock) and during signal handling they're updating some metrics. This is what we've seen and that's how I stumbled upon the RLock PR. In that case I think the proper fix is keeping the application's signal handling code very simple and not allowing it to update prometheus metrics (at least not in the multiprocess mode).

Great, that is what I was thinking, keep the reentrant lock where possible in the registry and use regular locks elsewhere.

Hmm, the code protected by CollectorRegistry._lock does not look reentrant safe to me either. Same with locking in metrics.py. There's nothing special about the access happening from within the same thread that is currently holding a lock that would make it safe to re-enter these contexts. I think if anything, allowing that will make it harder to spot/debug the resulting issues?

@btimby

Basically, I was adding a custom Collector in a module and the collect()
method (called when adding to registry) accessed the django cache. I was also
using a cache backend with it's own metrics, so by accessing the django
cache, additional metrics (for cache access) were added to the registry,
resulting in a deadlock.

This makes sense indeed if your custom Collector's .collect() resulted in lazy loading the django-prometheus' custom cache module which imports the module that registers the cache metrics in the first place. Perhaps a solution in your case would be to make sure the cache module is already loaded properly somehow. Or like you say, use the .describe() method which does not access the cache.

Thank you both for looking at this!

@csmarchbanks
Copy link
Member

Yep, I agree. What I was thinking is that the same thread should be able to instantiate the same metric lazily which is why a reentrant lock sounded like a good idea, but yeah it is not safe in that case either. If your issue can be solved by implementing .describe() we should just advise users towards that to avoid any deadlocks.

For this PR let's start simple and just move the RLocks back to Locks. I somewhat like the idea of raising an error instead, but I don't think value is the right place since it sounds like the use case is for a custom collector.

@suligap suligap force-pushed the drop-reentrant-locking branch from b0210f5 to eb874cf Compare December 3, 2024 07:13
@suligap
Copy link
Contributor Author

suligap commented Dec 3, 2024

For this PR let's start simple and just move the RLocks back to Locks.

Sure thing and done.

I somewhat like the idea of raising an error instead, but I don't think value is the right place since it sounds like the use case is for a custom collector.

I think it can happen in all cases, and we have certainly been affected by these elusive deadlocks in production caused by our gunicorn wrapper instrumenting the gunicorn main process' logging with sentry and related prometheus metrics (recording http metrics for requests sent to sentry). So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. But the proper fix in this case is for us to be more careful/conservative with what we do during signal handling, or in this case what we instrumented gunicorn to do. We'd certainly find this issue quicker if there was an error instead. If people think it makes sense I can propose a separate PR with a change that adds this for all locks.

This fixes a correctness bug introduced in
0014e97 resulting in lost updates
during some scenarios.

The code being locked is not reentrant safe. It's preferable to deadlock
in these situations instead of silently loosing updates for example.

Signed-off-by: Przemysław Suliga <mail@suligap.net>
@suligap suligap force-pushed the drop-reentrant-locking branch from eb874cf to 19561ae Compare December 3, 2024 07:15
@csmarchbanks csmarchbanks merged commit 00f21e9 into prometheus:master Dec 3, 2024
11 checks passed
csmarchbanks pushed a commit that referenced this pull request Dec 3, 2024
This fixes a correctness bug introduced in
0014e97 resulting in lost updates
during some scenarios.

The code being locked is not reentrant safe. It's preferable to deadlock
in these situations instead of silently loosing updates for example.

Signed-off-by: Przemysław Suliga <mail@suligap.net>
@csmarchbanks
Copy link
Member

Beyond merging, I have backported this and released v0.21.1 with just this fix.

@csmarchbanks
Copy link
Member

I also do think a separate PR that raises an error is a good idea, or at least worth discussing further in that PR.

@btimby
Copy link
Contributor

btimby commented Dec 3, 2024

Would it be overly hard to allow the user to select or override the locking object? Perhaps lock object is parameterized. Then raising could be achieved using a specialized lock.

suligap added a commit to suligap/client_python that referenced this pull request Dec 3, 2024
Detect deadlocks during the library misuse by injecting code into the
critical sections that itself might want to obtain the lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>
suligap added a commit to suligap/client_python that referenced this pull request Dec 3, 2024
Detect deadlocks during the library misuse, eg. by injecting code into
the critical sections that itself might want to obtain the relevant lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>
suligap added a commit to canonical-ols/talisker that referenced this pull request Dec 4, 2024
We have been affected by deadlocks caused by talisker instrumenting the
gunicorn main process' logging with sentry and related
prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was
sometimes in the middle of incrementing a prometheus counter as a result
of sending an event to sentry as a result of gunicorn main process
logging an error log about eg. terminating an unresponsive worker. In
the middle of that when the global multiprocessing mode lock in
values.py was held, a signal handler for SIGCHLD was invoked in that
main process in response to some other worker terminating. During that
signal handling the main process also logs an error message which caused
the sentry event and a corresponding prometheus counter update ->
deadlock.

So in order to be more careful/conservative with what we do during
signal handling, or in this case what we instrument gunicorn to do, this
change sets up all of the requests related metrics to not be emitted
from the process they were created in (which for these metrics is the
gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076
suligap added a commit to canonical-ols/talisker that referenced this pull request Dec 4, 2024
We have been affected by deadlocks caused by talisker instrumenting the
gunicorn main process' logging with sentry and related
prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was
sometimes in the middle of incrementing a prometheus counter as a result
of sending an event to sentry as a result of gunicorn main process
logging an error log about eg. terminating an unresponsive worker. In
the middle of that when the global multiprocessing mode prometheus
client lock in values.py was held, a signal handler for SIGCHLD was
invoked in that main process in response to some other worker
terminating. During that signal handling the main process also logs an
error message which caused the sentry event and a corresponding
prometheus counter update -> deadlock.

So in order to be more careful/conservative with what we do during
signal handling, or in this case what we instrument gunicorn to do, this
change sets up all of the requests related metrics to not be emitted
from the process they were created in (which for these metrics is the
gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076
suligap added a commit to canonical-ols/talisker that referenced this pull request Dec 4, 2024
At least for now stop using the TaliskerRequestsTransport.

In practice, for flask apps it was not used in the apps/workers because
FlaskSentry was explicitly setting a None 'transport' in the client
kwargs.

This change makes sure that the same default raven (threaded) transport
is used by logging in the main process. This will mitigate the deadlocks
described in
prometheus/client_python#1076 (comment)
because there will be no prometheus instrumentation for sentry requests.
suligap added a commit to suligap/client_python that referenced this pull request Dec 8, 2024
Detects and prevents deadlocks during the library misuse, eg. by
injecting code into the critical sections that itself might want to
obtain the relevant lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants