nginxinc · ciarams87 · Dec 19, 2023 · Dec 18, 2023 · Dec 19, 2023
@@ -24,8 +24,8 @@
 NGF version:
 
 ```text
-commit: "21a2507d3d25ac0428384dce2c042799ed28b856"
-date: "2023-12-06T23:47:17Z"
+commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a"
+date: "2023-12-13T17:45:06Z"
 version: "edge"
 ```
 
@@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
 Running 5760m test @ http://cafe.example.com/coffee
   2 threads and 100 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
-    Latency   182.30ms  146.48ms   2.00s    82.86%
-    Req/Sec   306.26    204.19     2.18k    65.75%
-  207104807 requests in 5760.00m, 72.17GB read
-  Socket errors: connect 0, read 362418, write 218736, timeout 19693
-Requests/sec:    599.26
-Transfer/sec:    218.97KB
+    Latency   177.87ms  143.16ms   2.00s    82.78%
+    Req/Sec   314.21    211.96     2.32k    65.78%
+  212474192 requests in 5760.00m, 74.08GB read
+  Socket errors: connect 0, read 346218, write 508394, timeout 10574
+Requests/sec:    614.80
+Transfer/sec:    224.78KB
 ```
 
 HTTPS:
@@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
 Running 5760m test @ https://cafe.example.com/tea
   2 threads and 100 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
-    Latency   172.15ms  118.59ms   2.00s    68.16%
-    Req/Sec   305.26    203.43     2.33k    65.34%
-  206387831 requests in 5760.00m, 70.81GB read
-  Socket errors: connect 44059, read 356656, write 0, timeout 126
-Requests/sec:    597.19
-Transfer/sec:    214.84KB
+    Latency   168.63ms  118.44ms   2.00s    69.72%
+    Req/Sec   313.02    210.77     2.31k    65.50%
+  211667319 requests in 5760.00m, 72.65GB read
+  Socket errors: connect 80661, read 338673, write 0, timeout 147
+Requests/sec:    612.46
+Transfer/sec:    220.42KB
 ```
 
 While there are socket errors in the output, there are no connection-related errors in NGINX logs.
@@ -86,28 +86,31 @@ Further investigation is out of scope of this test.
 
 ```text
 resource.type="k8s_container"
-resource.labels.cluster_name="ciara-1"
+resource.labels.cluster_name="<CLUSTER_NAME>"
 resource.labels.namespace_name="nginx-gateway"
 resource.labels.container_name="nginx-gateway"
 severity=ERROR
 SEARCH("error")
 ```
 
-There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
-running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
-loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).
-
-Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
-after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
-but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
-Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
-other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
-the leader election loss, plus initial start-up of both pods (4 in total).
-
-```console
-kubectl get pods -n nginx-gateway
-NAME                                               READY   STATUS    RESTARTS      AGE
-my-release-nginx-gateway-fabric-78d4b84447-4hss5   2/2     Running   6 (31h ago)   3d22h
+There were 53 error logs, and 6 restarts, across 3 pod instances. The error logs came in 3 almost identical batches,
+starting just over 24 hours after the initial deployment, and then each subsequent error batch just over 24 hours after
+the last. They were all relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100). There were also 2 termination events, both of these occurred approximately 5 minutes
+after a leader election loss and successful restart.
+
+Each error batches caused the pod to restart, but not terminate. The first pod was terminated about 5 minutes
+after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
+There was no termination event after the second error batch. Exactly why these pods were terminated is not currently
+clear, but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time
+also. All the restarts happened roughly at the same time each day.
+
+```text
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
+{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
 ```
 
 ### NGINX
@@ -118,7 +121,7 @@ Errors:
 
 ```text
 resource.type=k8s_container AND
-resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
+resource.labels.cluster_name="<CLUSTER_NAME>" AND
 resource.labels.container_name="nginx" AND
 severity=ERROR AND
 SEARCH("`[warn]`") OR SEARCH("`[error]`")
@@ -132,7 +135,7 @@ Non-200 response codes in NGINX access logs:
 
 ```text
 resource.type=k8s_container AND
-resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
+resource.labels.cluster_name="<CLUSTER_NAME>" AND
 resource.labels.container_name="nginx"
 "GET" "HTTP/1.1" -"200"
 ```
@@ -145,47 +148,61 @@ No such responses.
 
 ![memory.png](memory.png)
 
-Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
-Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
-majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
-about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
-v1.0.0 test run. I don't think there is anything to investigate here.
+Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
+There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
+of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
+this trending upwards on future longevity tests, I don't think there is anything to investigate here.
 
 ### Containers CPU
 
 ![cpu.png](cpu.png)
 
-No unexpected spikes or drops.
+One drop in NGINX CPU that corresponds with a Pod termination event.
 
 ### NGINX metrics
 
-In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
+![stub-status.png](stub-status.png)
+
+The drop of _requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
 
 ### Reloads
 
-In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
+Rate of reloads - successful and errors:
+
+![reloads.png](reloads.png)
+
+Reload spikes correspond to 1 hour periods of backend re-rollouts.
+However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
+triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
+
+No reloads finished with an error.
+
+Reload time distribution with 50th, 95th and 99th percentiles and the threshold:
+
+![reload-time.png](reload-time.png)
 
+Note - 60s is the threshold for waiting for NGINX to be reloaded.
 Reload related metrics at the end:
 
 ```text
+# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
 # TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
-nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
-nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06
-nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
+nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
+nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e+06
+nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
 # HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
 # TYPE nginx_gateway_fabric_nginx_reloads_total counter
-nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
+nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
 ```
 
 All successful reloads took less than 5 seconds, with most (>90%) under 1 second.
 
 ## Existing Issues still relevant
 
-- NGF unnecessary reloads NGINX when it reconciles
-  Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
+- NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
 - Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101