Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Make inflight background metrics more efficient. #7597

Merged
merged 6 commits into from
May 29, 2020

Conversation

erikjohnston
Copy link
Member

Hopefully fixes #7596.

@erikjohnston erikjohnston requested a review from a team May 29, 2020 11:02
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly, the code:

  • Creates a set which contains the background procs since the last metrics updates since the last scrape
  • As background processes are started (either for the first time or after a while of being dormant), they're added to this set
  • When a scrape is performed (with collect()), metrics are pulled from each process before the set is emptied again
  • Another metric is created which tracks the amount of background processes that are currently running at any given time.

Is that accurate? I have a few clarifying questions below as well.

synapse/metrics/background_process_metrics.py Outdated Show resolved Hide resolved
synapse/metrics/background_process_metrics.py Show resolved Hide resolved

self._proc = _BackgroundProcess(name, self)

def start(self, rusage: "Optional[resource._RUsage]"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this run on every batch, or just when the process stops (runs out of things to process) and then starts up again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is a bit confusing here. Basically LoggingContext.{__enter__,__exit__} are called first time we enter the context and last time we leave the context, while {start,stop} get called whenever the logging context starts/stops being the active/current logging context. (The names resume and pause might have been better here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and "entering the context" occurs each time a background processes starts processing.

So _background_processes_active_since_last_scrape contains just those processes that have become active since the last scrape, versus those that are continuously active.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a quick comment to where _background_process_in_flight_count is instantiated mentioning this distinction please?

Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
@richvdh
Copy link
Member

richvdh commented May 29, 2020

Have you considered writing the metrics straight into a Gauge on each stop() call, rather than having to iterate them when we get a collect() call?

@erikjohnston
Copy link
Member Author

Have you considered writing the metrics straight into a Gauge on each stop() call, rather than having to iterate them when we get a collect() call?

I'm a little nervous of the performance impact on repeatedly calling update_metric, as logging contexts gets started/stopped quite a lot I think.

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to do what it says on the tin, and is definitely a step in the right direction.


self._proc = _BackgroundProcess(name, self)

def start(self, rusage: "Optional[resource._RUsage]"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a quick comment to where _background_process_in_flight_count is instantiated mentioning this distinction please?

@erikjohnston erikjohnston merged commit f5353ef into develop May 29, 2020
@erikjohnston erikjohnston deleted the erikj/make_bg_inflight_metrics_efficient branch May 29, 2020 12:25
babolivier added a commit that referenced this pull request Jun 10, 2020
…rg-hotfixes

Synapse 1.15.0rc1 (2020-06-09)
==============================

Features
--------

- Advertise support for Client-Server API r0.6.0 and remove related unstable feature flags. ([\#6585](#6585))
- Add an option to disable autojoining rooms for guest accounts. ([\#6637](#6637))
- For SAML authentication, add the ability to pass email addresses to be added to new users' accounts via SAML attributes. Contributed by Christopher Cooper. ([\#7385](#7385))
- Add admin APIs to allow server admins to manage users' devices. Contributed by @dklimpel. ([\#7481](#7481))
- Add support for generating thumbnails for WebP images. Previously, users would see an empty box instead of preview image. ([\#7586](#7586))
- Support the standardized `m.login.sso` user-interactive authentication flow. ([\#7630](#7630))

Bugfixes
--------

- Allow new users to be registered via the admin API even if the monthly active user limit has been reached. Contributed by @dkimpel. ([\#7263](#7263))
- Fix email notifications not being enabled for new users when created via the Admin API. ([\#7267](#7267))
- Fix str placeholders in an instance of `PrepareDatabaseException`. Introduced in Synapse v1.8.0. ([\#7575](#7575))
- Fix a bug in automatic user creation during first time login with `m.login.jwt`. Regression in v1.6.0. Contributed by @olof. ([\#7585](#7585))
- Fix a bug causing the cross-signing keys to be ignored when resyncing a device list. ([\#7594](#7594))
- Fix metrics failing when there is a large number of active background processes. ([\#7597](#7597))
- Fix bug where returning rooms for a group would fail if it included a room that the server was not in. ([\#7599](#7599))
- Fix duplicate key violation when persisting read markers. ([\#7607](#7607))
- Prevent an entire iteration of the device list resync loop from failing if one server responds with a malformed result. ([\#7609](#7609))
- Fix exceptions when fetching events from a remote host fails. ([\#7622](#7622))
- Make `synctl restart` start synapse if it wasn't running. ([\#7624](#7624))
- Pass device information through to the login endpoint when using the login fallback. ([\#7629](#7629))
- Advertise the `m.login.token` login flow when OpenID Connect is enabled. ([\#7631](#7631))
- Fix bug in account data replication stream. ([\#7656](#7656))

Improved Documentation
----------------------

- Update the OpenBSD installation instructions. ([\#7587](#7587))
- Advertise Python 3.8 support in `setup.py`. ([\#7602](#7602))
- Add a link to `#synapse:matrix.org` in the troubleshooting section of the README. ([\#7603](#7603))
- Clarifications to the admin api documentation. ([\#7647](#7647))

Internal Changes
----------------

- Convert the identity handler to async/await. ([\#7561](#7561))
- Improve query performance for fetching state from a PostgreSQL database. ([\#7567](#7567))
- Speed up processing of federation stream RDATA rows. ([\#7584](#7584))
- Add comment to systemd example to show postgresql dependency. ([\#7591](#7591))
- Refactor `Ratelimiter` to limit the amount of expensive config value accesses. ([\#7595](#7595))
- Convert groups handlers to async/await. ([\#7600](#7600))
- Clean up exception handling in `SAML2ResponseResource`. ([\#7614](#7614))
- Check that all asynchronous tasks succeed and general cleanup of `MonthlyActiveUsersTestCase` and `TestMauLimit`. ([\#7619](#7619))
- Convert `get_user_id_by_threepid` to async/await. ([\#7620](#7620))
- Switch to upstream `dh-virtualenv` rather than our fork for Debian package builds. ([\#7621](#7621))
- Update CI scripts to check the number in the newsfile fragment. ([\#7623](#7623))
- Check if the localpart of a Matrix ID is reserved for guest users earlier in the registration flow, as well as when responding to requests to `/register/available`. ([\#7625](#7625))
- Minor cleanups to OpenID Connect integration. ([\#7628](#7628))
- Attempt to fix flaky test: `PhoneHomeStatsTestCase.test_performance_100`. ([\#7634](#7634))
- Fix typos of `m.olm.curve25519-aes-sha2` and `m.megolm.v1.aes-sha2` in comments, test files. ([\#7637](#7637))
- Convert user directory, state deltas, and stats handlers to async/await. ([\#7640](#7640))
- Remove some unused constants. ([\#7644](#7644))
- Fix type information on `assert_*_is_admin` methods. ([\#7645](#7645))
- Convert registration handler to async/await. ([\#7649](#7649))
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synchrotron metric collection appears to fail if large numbers of background tasks present.
3 participants