Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Grafana dashboard for monitoring OPEA application scaling in k8s #541

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Nov 8, 2024

Description

Adds Grafana dashboard for monitoring OPEA application scaling:

  • How many of the application and its TGI + TEI pods are created, ready and in use
  • How many requests they are processing (min and max across all replicats)
  • How many failures they are reporting (sum across replicas)

And a helper script for installing dashboard k8s configMaps for Grafana.

Unlike earlier ChatQnA dashboard, this handles multiple OPEA application having same names but being in separate namespaces. User selects namespace and then the OPEA application from that. If cluster has only one running, Dashboard will default to that.

(Therefore it does not make sense to install dashboard with application specific Helm charts, as it can cover all apps that use TGI for LLM, i.e. most of them.)

Issues

n/a.

Type of change

  • New feature (non-breaking change which adds new functionality)

Dependencies

n/a.

Tests

Manual testing of the script and dashboard working.

@eero-t eero-t requested a review from daisy-ycguo as a code owner November 8, 2024 18:02
@eero-t
Copy link
Contributor Author

eero-t commented Nov 8, 2024

Currently dashboard relies on HTTP inprogress metric for how many pending requests application has: opea-project/GenAIComps#845

But depending on whether following PR is merged for v1.1, that particular metric may need to be changed before v1.1: opea-project/GenAIComps#864

@eero-t
Copy link
Contributor Author

eero-t commented Nov 8, 2024

I can add blurb about this also to README, but scaling is currently a bit of corner case, so IMHO it could come also in next release.

Larger question about Observability README, and things it refers to, is what to do with chatqna/ sub-directory content here, now that Helm charts have more generic monitoring support for OPEA applications.

Regarding the dashboards under that:

  • queue_size_embedding_rerank_tgi.json: some queries in that do not have any selectors, some use service selector
  • tgi_grafana.json: queries use container selector (container="$service")

I.e. neither handles properly cases when cluster is running multiple OPEA applications with TGI instances. The new dashboard covers first one to some extent. TGI details dashboard could be updated to have similar selectors as this new dashboard.

@poussa poussa added this to the v1.1 milestone Nov 8, 2024
@poussa poussa requested review from poussa and lianhao and removed request for daisy-ycguo November 12, 2024 15:19
@eero-t
Copy link
Contributor Author

eero-t commented Nov 12, 2024

FYI: I'm going to change dashboard "Failures" heading to "Incomplete requests". I do not think half of TGI requests are failures, but that frontend needs to request rest of reply with another query before TGI deems it "complete" (successful).

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
@eero-t
Copy link
Contributor Author

eero-t commented Nov 12, 2024

Dashboard changes:

@poussa poussa merged commit 691bbc5 into opea-project:main Nov 13, 2024
6 checks passed
@eero-t eero-t deleted the grafana branch November 14, 2024 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants