Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Handling vast amount of time series as a result of model_name label #426

Open
GolanLevy opened this issue Sep 6, 2023 · 0 comments

Comments

@GolanLevy
Copy link

Following the abstract in kserve/modelmesh#60 provided by @njhill, I understand that the modelmesh project is very aware of the fact that we can have tens of thousands of models, quickly swapping across many predictor instances, and sometimes used only once before being evicted in order to make room for other models.

I'm glad to see that adding a model_name as a label to metrics is configurable (and will be off for our use-case of course).
However, this is not the case for Kserve transformers' metrics, which we use for pre/postprocessing; see kserve/kserve#2589.

I wonder how you manage to deal with the following issues:

  1. Did you find a way to ignore the model_name in Kserve transformers?
  2. Is there a way to set TTL on time series? The accumulation of all the possible labels for metric (for example, the tuples of pod and model_name) causes the metric report to be so big that it affects our CPU usage. Most of the time series are not going to be updated for at least a few hours and we would be happy to get rid of them.
  3. We happen to have frequent scenarios in which a predictor or a transformer receives a request for a specific model only once.
    In these cases, the time series is created and used exactly once. Each Grafana query that uses any rate function (rate/increase/delta/etc) over the range vector of the metric is useless, since there is only one data point.
    Prometheus are aware of this issue and recently have started to design a solution.
    The community is also aware of that problem, providing solutions which usually require heavy computation queries,
    or more sophisticated prometheus clients (note that this client also solves the previous bullet).
    Did you find a way to deal with these scenarios?

I feel like the issues mentioned here are relevant specifically to modelmesh (and not Kserve in general) since Kserve was not designed to manage a huge amount of models.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant