-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Metrics Refactoring #1492 Draft PR #1727
Conversation
…sting for passing file path through properties
@lxning @maaquib @msaroufim Hi all, I am opening this PR up for review. I'm not sure if I have permissions to add reviewers but if there are more reviewers that should be added, please feel free to add. Apologies in advance for the large PR |
frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerLifeCycle.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a first round of feedback, may need a few more
examples/Huggingface_Transformers/Transformer_handler_generalized.py
Outdated
Show resolved
Hide resolved
|
||
metrics.add_size("GaugeModelMetricNameExample", 42.5) # adding gauge metric | ||
|
||
emit_metrics(metrics.cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to have emit_metrics take in a metrics object directly?
|
||
|
||
class MetricCacheAbstract(metaclass=abc.ABCMeta): | ||
def __init__(self, request_ids, model_name, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't yet support a different metric cache per model because we load a single YAML file?
Also wondering should the metric cache also have a concept of a worker?
logging.debug(f"Successfully received metric {metric_key}") | ||
return metric_obj | ||
else: | ||
raise merrors.MetricsCacheKeyError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have the custom wrappers on errors? If we throw an error then we should also see a callstack which would show this line of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's more of a nice-to-have, specifying/narrowing down the area of what the error is + giving a bit more context
dimensions=dimensions, | ||
) | ||
|
||
def add_percent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead we should have
def add_counter():
def add_gauge():
def add_histogram():
Then on top of those we can add convenience wrappers like
def add_percentage():
add_gauge()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a first round of feedback, may need a few more
examples/Huggingface_Transformers/Transformer_handler_generalized.py
Outdated
Show resolved
Hide resolved
@@ -50,6 +55,8 @@ def __init__(self, name, value, | |||
self.value = value | |||
self.dimensions = dimensions | |||
self.request_id = request_id | |||
self.metric_type = metric_type | |||
self.is_updated = False if value == 0 else True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would it make sense to use None
instead of 0
to indicate that the value is not initialized, since 0
can potentially be a valid updated value for a metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main purpose of the is_updated
attribute is to determine whether or not the metric gets emitted to the frontend (if the is_updated
is True
, then the Metric will be emitted, else it will not). I can understand that None
may make more sense than using 0
, but my other thought is that a Metric with a value of 0
doesn't provide a lot of information to the user anyhow, so that is why I'm currently using 0
. But if users do want a Metric to have a value of 0
, there is the ability to update the Metric using the update
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the init value of "is_updated" should be false.
""" | ||
Reset Metric value to 0 and reset is_updated flag to False | ||
""" | ||
self.value = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, would it make sense to use None
instead of 0
when value
is reset?
@@ -100,6 +110,12 @@ def load_model(load_model_request): | |||
if "limitMaxImagePixels" in load_model_request: | |||
limit_max_image_pixels = bool(load_model_request["limitMaxImagePixels"]) | |||
|
|||
metrics = MetricsCacheYaml( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are creating an instance of MetricsCacheYaml
here which is stored in context at the time of model loading and we also now have a mechanism to reset metrics after we emit them, I believe we should not be creating instances of MetricsStore
objects when predict
is called in service.py
and replacing the MetricsCacheYaml
object with the MetricsStore
object?
https://github.com/pytorch/serve/pull/1727/files#diff-46d0c43f520da01cd15f2b4784a1545eafb2b2c6a8237088f6d79f641b75e19cL95-R114
metrics = MetricsStore(req_id_map, self.context.model_name)
self.context.metrics = metrics
If this is the case then it should be safe to remove metrics_store.py
since it is no longer used anywhere?
```properties | ||
... | ||
... | ||
# enable_metrics_api=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify which configs are actually needed for the example
enable_metrics_api=false
seems strange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary to have parameter enable_metrics_api. The metrics migration should be seamless for users.
If a `metrics_config` argument is not specified, the default yaml file will be used. | ||
|
||
|
||
3. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of HuggingFace let's use the densenet161 example since that's simpler to see working
But even then when I run this command I get this error
2022-10-11T22:41:08,543 [INFO ] W-9000-densenet161_1.0-stdout MODEL_LOG - ts.metrics.metric_cache_errors.MetricsCacheTypeError: File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist.
2022-10-11T22:41:08,544 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-10-11T22:41:08,544 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-10-11T22:41:08,544 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) ~[model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: densenet161, error: Worker died.
2022-10-11T22:41:08,545 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-densenet161_1.0-stderr
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-densenet161_1.0-stdout
2022-10-11T22:41:08,546 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-10-11T22:41:08,562 [INFO ] W-9000-densenet161_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-densenet161_1.0-stderr
2022-10-11T22:41:08,562 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-densenet161_1.0-stdout
2022-10-11T22:41:09,548 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/envs/serve/bin/python3.8, /opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml]
2022-10-11T22:41:09,583 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
java.nio.file.NoSuchFileException: src/test/resources/key.pem
The only thing that really worked for me was deleting those resources lines and then copying config.properties
to my local directory. It might be worth having a config.properties
in the root directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error "File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist. " means the gradle build integration is missing. The default file path of metrics.yaml should be set in gradle.
(example using [Huggingface_Transformers](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers)) | ||
|
||
```torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs --ts-config ../../frontend/server/src/test/resources/config.properties``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should mention that logs are visible in logs/model_metrics.log
although when I just downloaded the densenet.mar file we have in our getting started guide I didn't see anything populated there - was hoping to see some default metrics
* [Custom Metrics API](#custom-metrics-api) | ||
* [Logging the custom metrics](#logging-the-custom-metrics) | ||
* [Log custom metrics](#log-custom-metrics) | ||
* [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a known issues section: for example the first inference is now significantly slower
`metric_type=MetricTypes.[counter/gauge/histogram]`. | ||
|
||
```python | ||
metrics.add_metric("GenericMetric", value=1, ..., metric_type=MetricTypes.gauge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is add_metric()
the right name? This feels more like an add_measurement()
to me otherwise why we do we need to setup a yaml file?
@@ -137,27 +297,33 @@ dimN= Dimension(name_n, value_n) | |||
|
|||
### Add generic metrics | |||
|
|||
**Generic metrics are defaulted to a `counter` metric type** | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section should just refer to the codebase instead
from ts.metrics.metric_type_enum import MetricTypes | ||
|
||
|
||
class CustomHandlerExample: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried packaging and running this example and still couldn't see any logs printed
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T23:48:34,424 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 120, in load_model
2022-10-11T23:48:34,424 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - service = model_loader.load(
2022-10-11T23:48:34,424 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: metrics_model, error: Worker died.
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 135, in load
2022-10-11T23:48:34,425 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - entry_point, initialize_fn = self._get_class_entry_point(module)
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 197, in _get_class_entry_point
2022-10-11T23:48:34,425 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - raise ValueError(
2022-10-11T23:48:34,425 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - ValueError: Expect handle method in class <class 'metrics_model.CustomHandlerExample'>
2022-10-11T23:48:34,426 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 13 seconds.
2022-10-11T23:48:34,426 [INFO ] W-9000-metrics_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:34,443 [INFO ] W-9000-metrics_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:47,426 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/envs/serve/bin/python3.8, /opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/serve/frontend/server/src/test/resources/metrics_default.yaml]
2022-10-11T23:48:48,268 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - [PID]24436
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Torch worker started.
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13
2022-10-11T23:48:48,269 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2022-10-11T23:48:48,271 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2022-10-11T23:48:48,271 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1665532128271
2022-10-11T23:48:48,293 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - model_name: metrics_model, batchSize: 1
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Successfully loaded /home/ubuntu/serve/frontend/server/src/test/resources/metrics_default.yaml.
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Backend worker process died.
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 223, in <module>
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - worker.run_server()
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 191, in run_server
2022-10-11T23:48:48,300 [INFO ] epollEventLoopGroup-5-8 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2022-10-11T23:48:48,301 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 156, in handle_connection
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 120, in load_model
2022-10-11T23:48:48,301 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - service = model_loader.load(
2022-10-11T23:48:48,301 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: metrics_model, error: Worker died.
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 135, in load
2022-10-11T23:48:48,302 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - entry_point, initialize_fn = self._get_class_entry_point(module)
2022-10-11T23:48:48,302 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 197, in _get_class_entry_point
2022-10-11T23:48:48,302 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - raise ValueError(
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 21 seconds.
2022-10-11T23:48:48,303 [INFO ] W-9000-metrics_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:48,320 [INFO ] W-9000-metrics_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stderr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the log "Expect handle method in class <class 'metrics_model.CustomHandlerExample'>", it seems that handle method is missing in the example.
# Emitting the metrics that have been updated to the frontend and | ||
# then resetting the Metrics' values afterwards | ||
emit_metrics(metrics.cache) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc is missing a section on prometheus export
```properties | ||
... | ||
... | ||
# enable_metrics_api=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary to have parameter enable_metrics_api. The metrics migration should be seamless for users.
If a `metrics_config` argument is not specified, the default yaml file will be used. | ||
|
||
|
||
3. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error "File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist. " means the gradle build integration is missing. The default file path of metrics.yaml should be set in gradle.
``` | ||
|
||
|
||
### Updating Metrics parsed from the yaml file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why customer needs to change a metrics defined in metrics.yaml?
Closing in favour of #1954 |
Fixes #1492
TorchServe defines metrics in a metrics.yaml file, including both frontend metrics (i.e. ts_metrics) and backend metrics (i.e. model_metrics). When TorchServe is started, the metrics definition is loaded in frontend and backend cache separately. Backend flushes metrics cache once a load model or inference request is done.
Type of change
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Checklist: