Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Metrics Refactoring #1492 Draft PR #1727

Closed
wants to merge 55 commits into from

Conversation

joshuaan7
Copy link
Contributor

@joshuaan7 joshuaan7 commented Jul 7, 2022

Fixes #1492

TorchServe defines metrics in a metrics.yaml file, including both frontend metrics (i.e. ts_metrics) and backend metrics (i.e. model_metrics). When TorchServe is started, the metrics definition is loaded in frontend and backend cache separately. Backend flushes metrics cache once a load model or inference request is done.

Type of change

  • New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Checklist:

  • [ x] Did you have fun?
  • [ x] Have you added tests that prove your fix is effective or that this feature works?
  • [ x] Has code been commented, particularly in hard-to-understand areas?
  • [ x] Have you made corresponding changes to the documentation?

@maaquib maaquib requested a review from lxning July 21, 2022 21:45
@maaquib maaquib added the enhancement New feature or request label Jul 21, 2022
@rohithkrn rohithkrn self-requested a review August 3, 2022 17:45
@joshuaan7 joshuaan7 marked this pull request as ready for review August 24, 2022 23:23
@joshuaan7
Copy link
Contributor Author

@lxning @maaquib @msaroufim Hi all, I am opening this PR up for review. I'm not sure if I have permissions to add reviewers but if there are more reviewers that should be added, please feel free to add. Apologies in advance for the large PR

ts/arg_parser.py Outdated Show resolved Hide resolved
ts/metrics/metric.py Outdated Show resolved Hide resolved
ts/metrics/metric_cache_yaml.py Outdated Show resolved Hide resolved
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a first round of feedback, may need a few more


metrics.add_size("GaugeModelMetricNameExample", 42.5) # adding gauge metric

emit_metrics(metrics.cache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have emit_metrics take in a metrics object directly?

docs/metrics.md Outdated Show resolved Hide resolved


class MetricCacheAbstract(metaclass=abc.ABCMeta):
def __init__(self, request_ids, model_name, file):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't yet support a different metric cache per model because we load a single YAML file?

Also wondering should the metric cache also have a concept of a worker?

ts/metrics/metric_cache_abstract.py Outdated Show resolved Hide resolved
ts/metrics/metric_cache_abstract.py Show resolved Hide resolved
logging.debug(f"Successfully received metric {metric_key}")
return metric_obj
else:
raise merrors.MetricsCacheKeyError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have the custom wrappers on errors? If we throw an error then we should also see a callstack which would show this line of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's more of a nice-to-have, specifying/narrowing down the area of what the error is + giving a bit more context

dimensions=dimensions,
)

def add_percent(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead we should have

def add_counter():
def add_gauge():
def add_histogram():

Then on top of those we can add convenience wrappers like

def add_percentage():
  add_gauge()

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a first round of feedback, may need a few more

@@ -50,6 +55,8 @@ def __init__(self, name, value,
self.value = value
self.dimensions = dimensions
self.request_id = request_id
self.metric_type = metric_type
self.is_updated = False if value == 0 else True
Copy link
Collaborator

@namannandan namannandan Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it make sense to use None instead of 0 to indicate that the value is not initialized, since 0 can potentially be a valid updated value for a metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main purpose of the is_updated attribute is to determine whether or not the metric gets emitted to the frontend (if the is_updated is True, then the Metric will be emitted, else it will not). I can understand that None may make more sense than using 0, but my other thought is that a Metric with a value of 0 doesn't provide a lot of information to the user anyhow, so that is why I'm currently using 0. But if users do want a Metric to have a value of 0, there is the ability to update the Metric using the update method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the init value of "is_updated" should be false.

"""
Reset Metric value to 0 and reset is_updated flag to False
"""
self.value = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, would it make sense to use None instead of 0 when value is reset?

@@ -100,6 +110,12 @@ def load_model(load_model_request):
if "limitMaxImagePixels" in load_model_request:
limit_max_image_pixels = bool(load_model_request["limitMaxImagePixels"])

metrics = MetricsCacheYaml(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are creating an instance of MetricsCacheYaml here which is stored in context at the time of model loading and we also now have a mechanism to reset metrics after we emit them, I believe we should not be creating instances of MetricsStore objects when predict is called in service.py and replacing the MetricsCacheYaml object with the MetricsStore object?
https://github.com/pytorch/serve/pull/1727/files#diff-46d0c43f520da01cd15f2b4784a1545eafb2b2c6a8237088f6d79f641b75e19cL95-R114

metrics = MetricsStore(req_id_map, self.context.model_name)
self.context.metrics = metrics

If this is the case then it should be safe to remove metrics_store.py since it is no longer used anywhere?

```properties
...
...
# enable_metrics_api=false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify which configs are actually needed for the example

enable_metrics_api=false seems strange

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to have parameter enable_metrics_api. The metrics migration should be seamless for users.

If a `metrics_config` argument is not specified, the default yaml file will be used.


3. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of HuggingFace let's use the densenet161 example since that's simpler to see working

But even then when I run this command I get this error

2022-10-11T22:41:08,543 [INFO ] W-9000-densenet161_1.0-stdout MODEL_LOG - ts.metrics.metric_cache_errors.MetricsCacheTypeError: File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist.
2022-10-11T22:41:08,544 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-10-11T22:41:08,544 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-10-11T22:41:08,544 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) ~[model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: densenet161, error: Worker died.
2022-10-11T22:41:08,545 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-densenet161_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-densenet161_1.0-stderr
2022-10-11T22:41:08,545 [WARN ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-densenet161_1.0-stdout
2022-10-11T22:41:08,546 [INFO ] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-10-11T22:41:08,562 [INFO ] W-9000-densenet161_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-densenet161_1.0-stderr
2022-10-11T22:41:08,562 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-densenet161_1.0-stdout
2022-10-11T22:41:09,548 [DEBUG] W-9000-densenet161_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/envs/serve/bin/python3.8, /opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml]
2022-10-11T22:41:09,583 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.
java.nio.file.NoSuchFileException: src/test/resources/key.pem

The only thing that really worked for me was deleting those resources lines and then copying config.properties to my local directory. It might be worth having a config.properties in the root directory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error "File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist. " means the gradle build integration is missing. The default file path of metrics.yaml should be set in gradle.

(example using [Huggingface_Transformers](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers))

```torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs --ts-config ../../frontend/server/src/test/resources/config.properties```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should mention that logs are visible in logs/model_metrics.log although when I just downloaded the densenet.mar file we have in our getting started guide I didn't see anything populated there - was hoping to see some default metrics

* [Custom Metrics API](#custom-metrics-api)
* [Logging the custom metrics](#logging-the-custom-metrics)
* [Log custom metrics](#log-custom-metrics)
* [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a known issues section: for example the first inference is now significantly slower

`metric_type=MetricTypes.[counter/gauge/histogram]`.

```python
metrics.add_metric("GenericMetric", value=1, ..., metric_type=MetricTypes.gauge)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is add_metric() the right name? This feels more like an add_measurement() to me otherwise why we do we need to setup a yaml file?

@@ -137,27 +297,33 @@ dimN= Dimension(name_n, value_n)

### Add generic metrics

**Generic metrics are defaulted to a `counter` metric type**

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section should just refer to the codebase instead

from ts.metrics.metric_type_enum import MetricTypes


class CustomHandlerExample:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried packaging and running this example and still couldn't see any logs printed

java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T23:48:34,424 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 120, in load_model
2022-10-11T23:48:34,424 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     service = model_loader.load(
2022-10-11T23:48:34,424 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: metrics_model, error: Worker died.
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 135, in load
2022-10-11T23:48:34,425 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     entry_point, initialize_fn = self._get_class_entry_point(module)
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 197, in _get_class_entry_point
2022-10-11T23:48:34,425 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     raise ValueError(
2022-10-11T23:48:34,425 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:34,425 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - ValueError: Expect handle method in class <class 'metrics_model.CustomHandlerExample'>
2022-10-11T23:48:34,426 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 13 seconds.
2022-10-11T23:48:34,426 [INFO ] W-9000-metrics_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:34,443 [INFO ] W-9000-metrics_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:47,426 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/conda/envs/serve/bin/python3.8, /opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /home/ubuntu/serve/frontend/server/src/test/resources/metrics_default.yaml]
2022-10-11T23:48:48,268 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - [PID]24436
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Torch worker started.
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13
2022-10-11T23:48:48,269 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-10-11T23:48:48,269 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2022-10-11T23:48:48,271 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2022-10-11T23:48:48,271 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1665532128271
2022-10-11T23:48:48,293 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - model_name: metrics_model, batchSize: 1
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Successfully loaded /home/ubuntu/serve/frontend/server/src/test/resources/metrics_default.yaml.
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Backend worker process died.
2022-10-11T23:48:48,299 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 223, in <module>
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     worker.run_server()
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 191, in run_server
2022-10-11T23:48:48,300 [INFO ] epollEventLoopGroup-5-8 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-10-11T23:48:48,300 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2022-10-11T23:48:48,301 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 156, in handle_connection
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_service_worker.py", line 120, in load_model
2022-10-11T23:48:48,301 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2022-10-11T23:48:48,301 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     service = model_loader.load(
2022-10-11T23:48:48,301 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: metrics_model, error: Worker died.
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 135, in load
2022-10-11T23:48:48,302 [DEBUG] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-metrics_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     entry_point, initialize_fn = self._get_class_entry_point(module)
2022-10-11T23:48:48,302 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stderr
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -   File "/opt/conda/envs/serve/lib/python3.8/site-packages/ts/model_loader.py", line 197, in _get_class_entry_point
2022-10-11T23:48:48,302 [WARN ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0-stdout MODEL_LOG -     raise ValueError(
2022-10-11T23:48:48,302 [INFO ] W-9000-metrics_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 21 seconds.
2022-10-11T23:48:48,303 [INFO ] W-9000-metrics_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stdout
2022-10-11T23:48:48,320 [INFO ] W-9000-metrics_model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-metrics_model_1.0-stderr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the log "Expect handle method in class <class 'metrics_model.CustomHandlerExample'>", it seems that handle method is missing in the example.

# Emitting the metrics that have been updated to the frontend and
# then resetting the Metrics' values afterwards
emit_metrics(metrics.cache)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc is missing a section on prometheus export

```properties
...
...
# enable_metrics_api=false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to have parameter enable_metrics_api. The metrics migration should be seamless for users.

If a `metrics_config` argument is not specified, the default yaml file will be used.


3. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error "File /home/ubuntu/frontend/server/src/test/resources/metrics_default.yaml does not exist. " means the gradle build integration is missing. The default file path of metrics.yaml should be set in gradle.

```


### Updating Metrics parsed from the yaml file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why customer needs to change a metrics defined in metrics.yaml?

docs/metrics.md Show resolved Hide resolved
docs/metrics.md Show resolved Hide resolved
ts/model_service_worker.py Show resolved Hide resolved
ts/model_service_worker.py Show resolved Hide resolved
ts/model_service_worker.py Show resolved Hide resolved
ts/metrics/metric_cache_abstract.py Show resolved Hide resolved
ts/metrics/metric.py Show resolved Hide resolved
@maaquib maaquib mentioned this pull request Nov 8, 2022
8 tasks
@maaquib
Copy link
Collaborator

maaquib commented Nov 9, 2022

Closing in favour of #1954

@maaquib maaquib closed this Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Metrics Refactoring
5 participants