Skip to content

Metrics and Monitoring

benjchristensen edited this page Dec 11, 2012 · 15 revisions

Metrics are captured using the HystrixRollingNumber and HystrixRollingPercentile classes in rolling windows. The rolling windows allow low-latency moving windows of metrics to be used for circuit breaker health checks and operations.

Direct Access

You can get direct programmatic access to metrics like this:

HystrixCommandMetrics.getInstances()
HystrixThreadPoolMetrics.getInstances()

Metrics Event Stream

The hystrix-metrics-event-stream can be used to power the dashboard, realtime alerting and other such use cases.

Metrics Publisher

Metrics can be published by an implementation of HystrixMetricsPublisher.

Implementations can be registered using [HystrixPlugins.registerMetricsPublisher(HystrixMetricsPublisher impl)](http://netflix.github.com/Hystrix/javadoc/index.html?com/netflix/hystrix/strategy/HystrixPlugins.html#registerMetricsPublisher(com.netflix.hystrix.strategy.metrics.HystrixMetricsPublisher\)).

Implementations included with the project are:

Following are details of metrics published with these implementations:

Command Metrics

Each HystrixCommand publishes metrics with the following tags:

  • Servo Tag: "instance" Value: HystrixCommandKey.name()
  • Servo Tag: "type" Value: "HystrixCommand"

Informational and Status

  • Boolean isCircuitBreakerOpen
  • Number errorPercentage
  • Number executionSemaphorePermitsInUse
  • String commandGroup
  • Number currentTime

Cumulative Counts (Counter)

The following are cumulative counts since the start of the application.

  • Long countCollapsedRequests
  • Long countExceptionsThrown
  • Long countFailure
  • Long countFallbackFailure
  • Long countFallbackRejection
  • Long countFallbackSuccess
  • Long countResponsesFromCache
  • Long countSemaphoreRejected
  • Long countShortCircuited
  • Long countSuccess
  • Long countThreadPoolRejected
  • Long countTimeout

Rolling Counts (Gauge)

The following are rolling counts as configured by [[metrics.rollingStats.* properties|Configuration]].

These are "point in time" counts representing the last X seconds (for example 10 seconds).

  • Number rollingCountCollapsedRequests
  • Number rollingCountExceptionsThrown
  • Number rollingCountFailure
  • Number rollingCountFallbackFailure
  • Number rollingCountFallbackRejection
  • Number rollingCountFallbackSuccess
  • Number rollingCountResponsesFromCache
  • Number rollingCountSemaphoreRejected
  • Number rollingCountShortCircuited
  • Number rollingCountSuccess
  • Number rollingCountThreadPoolRejected
  • Number rollingCountTimeout

Latency Percentiles: HystrixCommand.run() Execution (Gauge)

Percentiles of execution times for the [HystrixCommand.run()](http://netflix.github.com/Hystrix/javadoc/index.html?com/netflix/hystrix/HystrixCommand.html#run(\)) method (on the child thread if using thread isolation).

These are rolling percentiles as configured by [[metrics.rollingPercentile.* properties|Configuration]].

  • Number latencyExecute_mean
  • Number latencyExecute_percentile_5
  • Number latencyExecute_percentile_25
  • Number latencyExecute_percentile_50
  • Number latencyExecute_percentile_75
  • Number latencyExecute_percentile_90
  • Number latencyExecute_percentile_99
  • Number latencyExecute_percentile_995

Latency Percentiles: End-to-End Execution (Gauge)

Percentiles of execution times for the end-to-end execution of [HystrixCommand.execute()](http://netflix.github.com/Hystrix/javadoc/index.html?com/netflix/hystrix/HystrixCommand.html#execute(\)) or [HystrixCommand.queue()](http://netflix.github.com/Hystrix/javadoc/index.html?com/netflix/hystrix/HystrixCommand.html#queue(\)) until a response is returned (or ready to return in case of queue().

The purpose of this compared with the latencyExecute* percentiles is to measure the cost of thread queuing/scheduling/execution, semaphores, circuit breaker logic and other aspects of overhead (including metrics capture itself).

These are rolling percentiles as configured by [[metrics.rollingPercentile.* properties|Configuration]].

  • Number latencyTotal_mean
  • Number latencyTotal_percentile_5
  • Number latencyTotal_percentile_25
  • Number latencyTotal_percentile_50
  • Number latencyTotal_percentile_75
  • Number latencyTotal_percentile_90
  • Number latencyTotal_percentile_99
  • Number latencyTotal_percentile_995

Property Values (Informational)

These informational metrics report the actual property values being used by the HystrixCommand. This is useful to see when a dynamic property takes effect and confirm a property is set as expected.

  • Number propertyValue_rollingStatisticalWindowInMilliseconds
  • Number propertyValue_circuitBreakerRequestVolumeThreshold
  • Number propertyValue_circuitBreakerSleepWindowInMilliseconds
  • Number propertyValue_circuitBreakerErrorThresholdPercentage
  • Boolean propertyValue_circuitBreakerForceOpen
  • Boolean propertyValue_circuitBreakerForceClosed
  • Number propertyValue_executionIsolationThreadTimeoutInMilliseconds
  • String propertyValue_executionIsolationStrategy
  • Boolean propertyValue_metricsRollingPercentileEnabled
  • Boolean propertyValue_requestCacheEnabled
  • Boolean propertyValue_requestLogEnabled
  • Number propertyValue_executionIsolationSemaphoreMaxConcurrentRequests
  • Number propertyValue_fallbackIsolationSemaphoreMaxConcurrentRequests

ThreadPool Metrics

Each HystrixThreadPool publishes metrics with the following tags:

  • Servo Tag: "instance" Value: HystrixThreadPoolKey.name()
  • Servo Tag: "type" Value: "HystrixThreadPool"

Informational and Status

  • String name
  • Number currentTime

Rolling Counts (Gauge)

  • Number rollingMaxActiveThreads
  • Number rollingCountThreadsExecuted

Cumulative Counts (Counter)

  • Long countThreadsExecuted

ThreadPool State (Gauge)

  • Number threadActiveCount
  • Number completedTaskCount
  • Number largestPoolSize
  • Number totalTaskCount
  • Number queueSize

Property Values (Informational)

  • Number propertyValue_corePoolSize
  • Number propertyValue_keepAliveTimeInMinutes
  • Number propertyValue_queueSizeRejectionThreshold
  • Number propertyValue_maxQueueSize