Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Metrics #3214

Closed
tillprochaska opened this issue Jul 17, 2023 · 2 comments · Fixed by #3216 or alephdata/servicelayer#111
Closed

FEATURE: Metrics #3214

tillprochaska opened this issue Jul 17, 2023 · 2 comments · Fixed by #3216 or alephdata/servicelayer#111
Assignees
Labels
backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features

Comments

@tillprochaska
Copy link
Contributor

tillprochaska commented Jul 17, 2023

Is your feature request related to a problem? Please describe.
Aleph currently doesn’t expose any metrics directly. At OCCRP, we track some log-based metrics as well as metrics from ElasticSearch, but we’d like to start instrumenting Aleph directly in order to operate Aleph better and to get insights into how Aleph features are adopted.

Metrics we’re interested in include the following. We already collect some of these metrics indirectly, but we should start collecting them using explicit instrumentation.

  • Queues
    • Number of queued tasks per stage
    • Number of processed tasks (failed/successful)
    • Processing time per stage
  • API
    • Number of requests (authenticated/anonymous, failed/successful)
    • Number of streamed entities
    • Response time (search endpoints and possibly other endpoints too)
    • Elasticsearch query response time
  • Ingest
    • Processing time per ingestor (PDF, Excel, Email, …)
    • Number of bytes ingested per ingestor
    • Number of processed tasks per ingestor (failed/successful)
    • PDF/OCR ingest cache hits/misses
  • Users
    • Number of users (language)
    • Number of new sessions/authentications (failed/successful)
    • Number of active users in the last 7d/30d
  • Feature usage
    • Number of network diagrams/timelines/lists created
    • Number of users that have created at least one diagram/timeline/list
    • Number of bookmarks created
    • Number of users that have created at least one bookmark
    • Number of users that have created at least one investigation
    • Number of users that have uploaded at least one file
  • Collections
    • Number of collections (investigations/datasets, countries)
    • Number of collections accessed in the last 7d/30d
    • Number of collections with at least one source document uploaded
    • Number of collections with at least one non-document entity created
  • Xref
    • Number of candidates per entity
    • ES query time per entity

Describe the solution you'd like
We should expose metrics in a standard format like Prometheus or OpenTelemetry.

Additional context:
There are a few challenges implementing this:

  • The Python clients for OT/Prometheus have limited support for multiple processes (we use Gunicorn as the API WSGI server).
  • We should probably expose metrics on a private port.
  • Collecting metrics should be as easy as possible in a K8s/GKE environment, ideally making use of managed services without deploying additional components.
  • Some of these metrics need to backed by database queries. We must choose a collection/scraping interval for these metrics that ensure a sufficient resolution while avoiding unnecessary load on the database.
@tillprochaska tillprochaska added backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features labels Jul 17, 2023
@tillprochaska tillprochaska self-assigned this Jul 17, 2023
@tillprochaska
Copy link
Contributor Author

@tillprochaska
Copy link
Contributor Author

Reopening because the issue was auto-closed by a subtask

tillprochaska added a commit that referenced this issue Nov 22, 2023
@tillprochaska tillprochaska reopened this Dec 6, 2023
tillprochaska added a commit that referenced this issue Jan 15, 2024
tillprochaska added a commit that referenced this issue Jan 16, 2024
* Add Prometheus instrumentation

Closes #3214

* Fix missing bind argument

* Run Prometheus exporter as a separate service

* Expose number of streaming requests and number of streamed entities as metrics

* Expose number of auth attempts as Prometheus metrics

* Update Helm chart to expose metrics endpoints, setup ServiceMonitors

* Handle requests without Authz object gracefully

* Rename Prometheus label to "api_endpoint" to prevent naming clashes

Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.

* Add xref metrics

* Use common prefix for all metric names

Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.

* Expose Python platform information as Prometheus metrics

* Remove unused port, network policy from K8s specs

Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).

* Use keyword args to set Prometheus metric labels

As suggested by @stchris

* Bump servicelayer from 1.22.0 to 1.22.1

* Simplify entity streaming metrics code

There’s no need to do batched metric increments until this becomes a performance bottleneck.

* Limit maximum size of Prometheus multiprocessing directory

* Do not let collector classes inherit from `object`

I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3.

The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.

* Add `aleph_` prefix to Prometheus API metrics

* Fix metrics name (singular -> plural)

* Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster
simonwoerpel pushed a commit to investigativedata/aleph that referenced this issue Apr 22, 2024
* Add Prometheus instrumentation

Closes alephdata#3214

* Fix missing bind argument

* Run Prometheus exporter as a separate service

* Expose number of streaming requests and number of streamed entities as metrics

* Expose number of auth attempts as Prometheus metrics

* Update Helm chart to expose metrics endpoints, setup ServiceMonitors

* Handle requests without Authz object gracefully

* Rename Prometheus label to "api_endpoint" to prevent naming clashes

Prometheus Operator also uses the "endpoint" label and automatically renames "endpoint" labels exposed by the metrics endpoint to "exported_endpoints" which is ugly.

* Add xref metrics

* Use common prefix for all metric names

Even though it is considered an anti-pattern to add a prefix with the name of the software or component to metrics (according to the official Prometheus documentation), I have decided to add a prefix. I’ve found that this makes it much easier to find relevant metrics. The main disadvantage of per-component prefixes queries become slightly more complex if you want to query the same metric (e.g. HTTP request duration) across multiple components. This isn’t super important in our case though, so I think the trade-off is acceptable.

* Expose Python platform information as Prometheus metrics

* Remove unused port, network policy from K8s specs

Although I'm not 100% sure, the exposed port 3000 probably is a left-over from the past, possibly when convert-document was still part of ingest-file. The network policy prevented Prometheus from scraping ingest-file metrics (and as the metrics port is now the only port exposed by ingest-file, should be otherwise unnecessary).

* Use keyword args to set Prometheus metric labels

As suggested by @stchris

* Bump servicelayer from 1.22.0 to 1.22.1

* Simplify entity streaming metrics code

There’s no need to do batched metric increments until this becomes a performance bottleneck.

* Limit maximum size of Prometheus multiprocessing directory

* Do not let collector classes inherit from `object`

I copied the boilerplate for custom collectors from the docs without thinking about it too much, but inheriting from `object` really isn’t necessary anymore in Python 3.

The Prometheus client also exports an abstract `Collector` class -- it doesn’t do anything except providing type hints for the `collect` method which is nice.

* Add `aleph_` prefix to Prometheus API metrics

* Fix metrics name (singular -> plural)

* Add documentation on how to test Prometheus instrumentation in local Kubernetes cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issues related to Aleph’s backend, API, CLI etc. feature-request Requests for new features or enhancements of existing features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant