If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

LiZhenCheng9527 · 2024-01-17T06:24:38Z

When using Flagger, I understood that it actually determines the service health status based on data collected by Prometheus. I set the metric.Name to istio_requests_total and istio_request_duration_seconds_bucket according to the documentation. However, during the automated rollback test, even injecting a faulty command, the test was still incorrectly reported as passing.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # service port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
    # Istio gateways (optional)
    gateways:
    - istio-system/public-gateway
    # Istio virtual service host names (optional)
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 90s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
    - name: istio_requests_total
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 90s
    - name: istio_request_duration_seconds_bucket
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 90s
    - name: abc
      thresholdRange:
        min: 10
      interval: 90s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

Here's how I did it.

During the canary analysis I generate HTTP 500 errors and high latency to test Flagger's rollback.
Trigger a canary deployment by updating the container image.

The results displayed in the logs are as follows:

k logs -n istio-system flagger-ff76bfdff-nf58z --tail=20

{"level":"info","ts":"2024-01-17T04:32:44.701Z","caller":"controller/controller.go:307","msg":"Synced test/podinfo"}
{"level":"info","ts":"2024-01-17T04:34:36.925Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.914Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.930Z","caller":"controller/events.go:33","msg":"Pre-rollout check acceptance-test passed","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.947Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:37:36.933Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 20","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:39:06.935Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 30","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:40:36.931Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 40","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:42:06.932Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 50","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:43:36.917Z","caller":"controller/events.go:33","msg":"Copying podinfo.test template spec to podinfo-primary.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:45:06.932Z","caller":"controller/events.go:33","msg":"Routing all traffic to primary","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:46:36.930Z","caller":"controller/events.go:33","msg":"Promotion completed! Scaling down podinfo.test","canary":"podinfo.test"}

As show in scheduler_metrics.go, if metric.name is not set to request-success-rate or request-duration, and metric.TemplateRef is empty, any random metric name will always return true. Shouldn't Flagger implement some validation checks and warnings? When not using a BuiltinMetric, it should try to get the metric by name and namespace, and present an error to the user if not found.

The text was updated successfully, but these errors were encountered:

stefanprodan · 2024-01-17T07:21:38Z

This looks like a regression bug, indeed Flagger should return an error for missing metric templates and count that towards the failure threshold.

stefanprodan added the kind/bug Something isn't working label Jan 17, 2024

LiZhenCheng9527 mentioned this issue Jan 17, 2024

return an error for missing metric templates #1582

Merged

stefanprodan closed this as completed in #1582 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

LiZhenCheng9527 commented Jan 17, 2024

stefanprodan commented Jan 17, 2024

If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

Comments

LiZhenCheng9527 commented Jan 17, 2024

stefanprodan commented Jan 17, 2024