You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Flagger, I understood that it actually determines the service health status based on data collected by Prometheus. I set the metric.Name to istio_requests_total and istio_request_duration_seconds_bucket according to the documentation. However, during the automated rollback test, even injecting a faulty command, the test was still incorrectly reported as passing.
apiVersion: flagger.app/v1beta1kind: Canarymetadata: name: podinfo namespace: testspec: # deployment reference targetRef: apiVersion: apps/v1 kind: Deployment name: podinfo # the maximum time in seconds for the canary deployment # to make progress before it is rollback (default 600s) progressDeadlineSeconds: 60 service: # service port number port: 9898 # container port number or name (optional) targetPort: 9898 # Istio gateways (optional) gateways: - istio-system/public-gateway # Istio virtual service host names (optional) # Istio traffic policy (optional) trafficPolicy: tls: # use ISTIO_MUTUAL when mTLS is enabled mode: DISABLE # Istio retry policy (optional) retries: attempts: 3 perTryTimeout: 1s retryOn: "gateway-error,connect-failure,refused-stream" analysis: # schedule interval (default 60s) interval: 90s # max number of failed metric checks before rollback threshold: 5 # max traffic percentage routed to canary # percentage (0-100) maxWeight: 50 # canary increment step # percentage (0-100) stepWeight: 10 metrics: - name: istio_requests_total # minimum req success rate (non 5xx responses) # percentage (0-100) thresholdRange: min: 99 interval: 90s - name: istio_request_duration_seconds_bucket # maximum req duration P99 # milliseconds thresholdRange: max: 500 interval: 90s - name: abc thresholdRange: min: 10 interval: 90s # testing (optional) webhooks: - name: acceptance-test type: pre-rollout url: http://flagger-loadtester.test/ timeout: 30s metadata: type: bash cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token" - name: load-test url: http://flagger-loadtester.test/ timeout: 5s metadata: cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
Here's how I did it.
During the canary analysis I generate HTTP 500 errors and high latency to test Flagger's rollback.
Trigger a canary deployment by updating the container image.
The results displayed in the logs are as follows:
k logs -n istio-system flagger-ff76bfdff-nf58z --tail=20{"level":"info","ts":"2024-01-17T04:32:44.701Z","caller":"controller/controller.go:307","msg":"Synced test/podinfo"}{"level":"info","ts":"2024-01-17T04:34:36.925Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up podinfo.test","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:36:06.914Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:36:06.930Z","caller":"controller/events.go:33","msg":"Pre-rollout check acceptance-test passed","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:36:06.947Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:37:36.933Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 20","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:39:06.935Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 30","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:40:36.931Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 40","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:42:06.932Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 50","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:43:36.917Z","caller":"controller/events.go:33","msg":"Copying podinfo.test template spec to podinfo-primary.test","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:45:06.932Z","caller":"controller/events.go:33","msg":"Routing all traffic to primary","canary":"podinfo.test"}{"level":"info","ts":"2024-01-17T04:46:36.930Z","caller":"controller/events.go:33","msg":"Promotion completed! Scaling down podinfo.test","canary":"podinfo.test"}
As show in scheduler_metrics.go, if metric.name is not set to request-success-rate or request-duration, and metric.TemplateRef is empty, any random metric name will always return true. Shouldn't Flagger implement some validation checks and warnings? When not using a BuiltinMetric, it should try to get the metric by name and namespace, and present an error to the user if not found.
The text was updated successfully, but these errors were encountered:
When using Flagger, I understood that it actually determines the service health status based on data collected by Prometheus. I set the metric.Name to istio_requests_total and istio_request_duration_seconds_bucket according to the documentation. However, during the automated rollback test, even injecting a faulty command, the test was still incorrectly reported as passing.
Here's how I did it.
The results displayed in the logs are as follows:
As show in scheduler_metrics.go, if metric.name is not set to
request-success-rate
orrequest-duration
, andmetric.TemplateRef
is empty, any random metric name will always return true. Shouldn't Flagger implement some validation checks and warnings? When not using a BuiltinMetric, it should try to get the metric by name and namespace, and present an error to the user if not found.The text was updated successfully, but these errors were encountered: