Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If Builtin Metric and metric.TemplateRef are not used, will the tests always incorrectly report as passing? #1581

Closed
LiZhenCheng9527 opened this issue Jan 17, 2024 · 1 comment · Fixed by #1582
Labels
kind/bug Something isn't working

Comments

@LiZhenCheng9527
Copy link
Contributor

When using Flagger, I understood that it actually determines the service health status based on data collected by Prometheus. I set the metric.Name to istio_requests_total and istio_request_duration_seconds_bucket according to the documentation. However, during the automated rollback test, even injecting a faulty command, the test was still incorrectly reported as passing.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # service port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
    # Istio gateways (optional)
    gateways:
    - istio-system/public-gateway
    # Istio virtual service host names (optional)
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 90s
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
    - name: istio_requests_total
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 90s
    - name: istio_request_duration_seconds_bucket
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 90s
    - name: abc
      thresholdRange:
        min: 10
      interval: 90s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

Here's how I did it.

  • During the canary analysis I generate HTTP 500 errors and high latency to test Flagger's rollback.
  • Trigger a canary deployment by updating the container image.

The results displayed in the logs are as follows:

k logs -n istio-system flagger-ff76bfdff-nf58z --tail=20

{"level":"info","ts":"2024-01-17T04:32:44.701Z","caller":"controller/controller.go:307","msg":"Synced test/podinfo"}
{"level":"info","ts":"2024-01-17T04:34:36.925Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.914Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.930Z","caller":"controller/events.go:33","msg":"Pre-rollout check acceptance-test passed","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:36:06.947Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:37:36.933Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 20","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:39:06.935Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 30","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:40:36.931Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 40","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:42:06.932Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 50","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:43:36.917Z","caller":"controller/events.go:33","msg":"Copying podinfo.test template spec to podinfo-primary.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:45:06.932Z","caller":"controller/events.go:33","msg":"Routing all traffic to primary","canary":"podinfo.test"}
{"level":"info","ts":"2024-01-17T04:46:36.930Z","caller":"controller/events.go:33","msg":"Promotion completed! Scaling down podinfo.test","canary":"podinfo.test"}

As show in scheduler_metrics.go, if metric.name is not set to request-success-rate or request-duration, and metric.TemplateRef is empty, any random metric name will always return true. Shouldn't Flagger implement some validation checks and warnings? When not using a BuiltinMetric, it should try to get the metric by name and namespace, and present an error to the user if not found.

@stefanprodan
Copy link
Member

This looks like a regression bug, indeed Flagger should return an error for missing metric templates and count that towards the failure threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants