Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Docs Updates #2560

Merged
merged 13 commits into from
Sep 15, 2023
Merged

Metrics Docs Updates #2560

merged 13 commits into from
Sep 15, 2023

Conversation

sekyondaMeta
Copy link
Contributor

Updates to metrics docs to make them easier to follow. Information re-organized and emphasized.

Fixes #2495

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Pages built locally to test them.

Screenshot 2023-08-31 at 11 49 19 AM

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Updates to metrics docs to make them easier to follow
Something weird with this link. It is possible the site itself rejects the lint check
@sekyondaMeta
Copy link
Contributor Author

@namannandan & @agunapal Could you guys take a look at this, I want to make sure I did not change the meaning of the docs. I made some changes to the structure of the metrics docs to try and make them clearer per the issue #2495.

For details refer [Torchserve config](configuration.md) docs.

**Note** This is not to be confused with the [custom metrics API](metrics.md) which is the API used in the backend handler to emit metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this clearer for folks. It doesn't have to fit into a short note either

@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Merging #2560 (aed33ec) into master (b04f6de) will not change coverage.
The diff coverage is n/a.

❗ Current head aed33ec differs from pull request most recent head 63fdd8e. Consider uploading reports for the commit 63fdd8e to get more accurate results

@@           Coverage Diff           @@
##           master    #2560   +/-   ##
=======================================
  Coverage   70.87%   70.87%           
=======================================
  Files          83       83           
  Lines        3839     3839           
  Branches       58       58           
=======================================
  Hits         2721     2721           
  Misses       1114     1114           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

docs/metrics.md Outdated
Comment on lines 47 to 48
Dynamic updates between the frontend and backend are _not_ currently being handled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reword a little bit to express "For performance concern, currently only metrics defined in metrics configuration file can be published to Prometheus" (ie. available via the metrics API endpoint.)?

docs/metrics.md Outdated

## Frontend Metrics

Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the [metrics API endpoint](metrics_api.md). This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"can be emitted to logs" => "can be emitted to model_metrics.log", otherwise dumps to model_log.log

@sekyondaMeta
Copy link
Contributor Author

@msaroufim @lxning My github is acting up but made some updates based. off your comments. let me know if other changes are needed.

@msaroufim
Copy link
Member

@namannandan could you please review and merge this?

docs/metrics.md Outdated

TorchServe defines metrics in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`).
When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately.
The backend flushes the metrics cache once a load model or inference request is completed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be reworded as follows:

The backend emits metrics logs as and when they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the metrics API endpoint based on the metrics_mode configuration.

docs/metrics.md Outdated
When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately.
The backend flushes the metrics cache once a load model or inference request is completed.

Dynamic updates between the frontend and backend are _not_ currently being handled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be reworded as follows:

Dynamic updates to the metrics configuration file is currently not supported. In order to take into account, updates made to the metrics configuration file, Torchserve will need to be restarted.

docs/metrics.md Outdated

Note that **only** the metrics defined in the **metrics configuration file** can be emitted to model_metrics.log or made available via the [metrics API endpoint](metrics_api.md). This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit.

Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
..... because these metrics will not be emitted unless they are updated.
instead of
..... because these metrics will not be emitted unless they are edited.

docs/metrics.md Outdated

### Starting TorchServe Metrics

Whenever torchserve starts, the [backend worker](https://github.com/pytorch/serve/blob/master/ts/model_service_worker.py) initializes `service.context.metrics` with the [MetricsCache](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_cache_yaml_impl.py) object. The `model_metrics` (backend metrics) section within the specified yaml file will be parsed, and Metric objects will be created based on the parsed section and added that are added to the cache.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ..... will be created based on the parsed section and added to the cache.

docs/metrics.md Outdated
Comment on lines 273 to 274
metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE)
metric.add_or_update(value, dimension_values=["value1", "value2", ...])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the above two lines with the following:

metrics.add_metric("GenericMetric", value, unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE)

docs/metrics.md Outdated
@@ -317,35 +338,31 @@ dimN= Dimension(name_n, value_n)

One can add metrics with generic units using the following function.

#### Function API to add generic metrics without default dimensions
Function API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please retain this section about add_metric_to_cache method since it is part of the API.

docs/metrics.md Outdated
@@ -370,52 +387,10 @@ One can add metrics with generic units using the following function.
# Add Distance as a metric
# dimensions = [dim1, dim2, dim3, ..., dimN]
# Assuming batch size is 1 for example
metric = metrics.add_metric_to_cache('DistanceInKM', unit='km', dimension_names=[...])
metric = metrics.add_metric('DistanceInKM', unit='km', dimension_names=[...])
metric.add_or_update(distance, dimension_values=[...])
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please retain the following section on add_metric.

docs/metrics.md Outdated
@@ -425,15 +400,15 @@ Add time-based by invoking the following method:
Function API

```python
def add_time(self, name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None,
def add_time(self, metric_name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument name defined in the source code is name and not metric_name.
The same applies to other functions of the API below.

docs/metrics.md Outdated
metric_type: MetricTypes = MetricTypes.GAUGE):
"""
Add a time based metric like latency, default unit is 'ms'
Default metric type is gauge

Parameters
----------
name : str
metric_name : str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like for these changes, I might have been using a slightly older metrics.md before you made your changes. I have updated this.

@@ -1,10 +1,13 @@
# Metrics API

Metrics API is listening on port 8082 and only accessible from localhost by default. To change the default setting, see [TorchServe Configuration](configuration.md). The default metrics endpoint returns Prometheus formatted metrics when [metrics_mode](https://github.com/pytorch/serve/blob/master/docs/metrics.md) configuration is set to `prometheus`. You can query metrics using curl requests or point a [Prometheus Server](#prometheus-server) to the endpoint and use [Grafana](#grafana) for dashboards.
Metrics API is a http API that is used to fetch metrics in the prometheus format. It is listening on port 8082 and only accessible from localhost by default. To change the default setting, see [TorchServe Configuration](configuration.md). The metrics endpoint is on by default and returns Prometheus formatted metrics when [metrics_mode](https://github.com/pytorch/serve/blob/master/docs/metrics.md) configuration is set to `prometheus`. You can query metrics using curl requests or point a [Prometheus Server](#prometheus-server) to the endpoint and use [Grafana](#grafana) for dashboards.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ..... The metrics endpoint is enabled by default and .....

@sekyondaMeta
Copy link
Contributor Author

@namannandan Thanks for the feedback. I have made the updates. Let me know if any other updates are needed.

@msaroufim msaroufim self-requested a review September 15, 2023 19:06
@namannandan namannandan added this pull request to the merge queue Sep 15, 2023
Merged via the queue into pytorch:master with commit b3eced5 Sep 15, 2023
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement TorchServe metrics documentation
5 participants