FEATURE: Metrics #3214

tillprochaska · 2023-07-17T07:24:22Z

Is your feature request related to a problem? Please describe.
Aleph currently doesn’t expose any metrics directly. At OCCRP, we track some log-based metrics as well as metrics from ElasticSearch, but we’d like to start instrumenting Aleph directly in order to operate Aleph better and to get insights into how Aleph features are adopted.

Metrics we’re interested in include the following. We already collect some of these metrics indirectly, but we should start collecting them using explicit instrumentation.

Queues
- Number of queued tasks per stage
- Number of processed tasks (failed/successful)
- Processing time per stage
API
- Number of requests (authenticated/anonymous, failed/successful)
- Number of streamed entities
- Response time (search endpoints and possibly other endpoints too)
- Elasticsearch query response time
Ingest
- Processing time per ingestor (PDF, Excel, Email, …)
- Number of bytes ingested per ingestor
- Number of processed tasks per ingestor (failed/successful)
- PDF/OCR ingest cache hits/misses
Users
- Number of users (language)
- Number of new sessions/authentications (failed/successful)
- Number of active users in the last 7d/30d
Feature usage
- Number of network diagrams/timelines/lists created
- Number of users that have created at least one diagram/timeline/list
- Number of bookmarks created
- Number of users that have created at least one bookmark
- Number of users that have created at least one investigation
- Number of users that have uploaded at least one file
Collections
- Number of collections (investigations/datasets, countries)
- Number of collections accessed in the last 7d/30d
- Number of collections with at least one source document uploaded
- Number of collections with at least one non-document entity created
Xref
- Number of candidates per entity
- ES query time per entity

Describe the solution you'd like
We should expose metrics in a standard format like Prometheus or OpenTelemetry.

Additional context:
There are a few challenges implementing this:

The Python clients for OT/Prometheus have limited support for multiple processes (we use Gunicorn as the API WSGI server).
We should probably expose metrics on a private port.
Collecting metrics should be as easy as possible in a K8s/GKE environment, ideally making use of managed services without deploying additional components.
Some of these metrics need to backed by database queries. We must choose a collection/scraping interval for these metrics that ensure a sufficient resolution while avoiding unnecessary load on the database.

tillprochaska · 2023-07-17T09:28:52Z

Related work: https://github.com/investigativedata/aleph-prometheus-exporter

tillprochaska · 2023-10-16T08:13:01Z

Reopening because the issue was auto-closed by a subtask

Closes #3214

@stchris

* Add Prometheus instrumentation Closes #3214 * Fix missing bind argument * Run Prometheus exporter as a separate service * Expose number of streaming requests and number of streamed entities as metrics * Expose number of auth attempts as Prometheus metrics * Update Helm chart to expose metrics endpoints, setup ServiceMonitors * Handle requests without Authz object gracefully * Rename Prometheus label to "api_endpoint" to prevent naming clashes Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly. * Add xref metrics * Use common prefix for all metric names Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable. * Expose Python platform information as Prometheus metrics * Remove unused port, network policy from K8s specs Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary). * Use keyword args to set Prometheus metric labels As suggested by @stchris * Bump servicelayer from 1.22.0 to 1.22.1 * Simplify entity streaming metrics code There’s no need to do batched metric increments until this becomes a performance bottleneck. * Limit maximum size of Prometheus multiprocessing directory * Do not let collector classes inherit from `object` I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice. * Add `aleph_` prefix to Prometheus API metrics * Fix metrics name (singular -> plural) * Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster

@stchris

* Add Prometheus instrumentation Closes alephdata#3214 * Fix missing bind argument * Run Prometheus exporter as a separate service * Expose number of streaming requests and number of streamed entities as metrics * Expose number of auth attempts as Prometheus metrics * Update Helm chart to expose metrics endpoints, setup ServiceMonitors * Handle requests without Authz object gracefully * Rename Prometheus label to "api_endpoint" to prevent naming clashes Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly. * Add xref metrics * Use common prefix for all metric names Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable. * Expose Python platform information as Prometheus metrics * Remove unused port, network policy from K8s specs Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary). * Use keyword args to set Prometheus metric labels As suggested by @stchris * Bump servicelayer from 1.22.0 to 1.22.1 * Simplify entity streaming metrics code There’s no need to do batched metric increments until this becomes a performance bottleneck. * Limit maximum size of Prometheus multiprocessing directory * Do not let collector classes inherit from `object` I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3. The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice. * Add `aleph_` prefix to Prometheus API metrics * Fix metrics name (singular -> plural) * Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster

tillprochaska added backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features labels Jul 17, 2023

tillprochaska self-assigned this Jul 17, 2023

tillprochaska linked a pull request Jul 17, 2023 that will close this issue

Metrics #3216

Merged

50 tasks

tillprochaska linked a pull request Sep 20, 2023 that will close this issue

Add basic Prometheus instrumentation for workers alephdata/servicelayer#111

Merged

Rosencrantz mentioned this issue Sep 26, 2023

FEATURE: Integrate OpenCensus metrics & traces #2714

Closed

1 task

stchris closed this as completed in alephdata/servicelayer#111 Oct 13, 2023

tillprochaska added a commit that referenced this issue Nov 22, 2023

Add Prometheus instrumentation

ad5893b

Closes #3214

tillprochaska reopened this Dec 6, 2023

tillprochaska added a commit that referenced this issue Jan 15, 2024

Add Prometheus instrumentation

7cfcb5d

Closes #3214

tillprochaska closed this as completed in #3216 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: Metrics #3214

FEATURE: Metrics #3214

tillprochaska commented Jul 17, 2023 •

edited

Loading

tillprochaska commented Jul 17, 2023

tillprochaska commented Oct 16, 2023

FEATURE: Metrics #3214

FEATURE: Metrics #3214

Comments

tillprochaska commented Jul 17, 2023 • edited Loading

tillprochaska commented Jul 17, 2023

tillprochaska commented Oct 16, 2023

tillprochaska commented Jul 17, 2023 •

edited

Loading