Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results from re-run of longevity test #1401

Merged
merged 2 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 68 additions & 51 deletions tests/longevity/results/1.1.0/1.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@
NGF version:

```text
commit: "21a2507d3d25ac0428384dce2c042799ed28b856"
date: "2023-12-06T23:47:17Z"
commit: "7e075c636f1a40e6dc849c6bb9ba288c9801bd4a"
date: "2023-12-13T17:45:06Z"
version: "edge"
```

Expand Down Expand Up @@ -53,12 +53,12 @@ wrk -t2 -c100 -d96h http://cafe.example.com/coffee
Running 5760m test @ http://cafe.example.com/coffee
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 182.30ms 146.48ms 2.00s 82.86%
Req/Sec 306.26 204.19 2.18k 65.75%
207104807 requests in 5760.00m, 72.17GB read
Socket errors: connect 0, read 362418, write 218736, timeout 19693
Requests/sec: 599.26
Transfer/sec: 218.97KB
Latency 177.87ms 143.16ms 2.00s 82.78%
Req/Sec 314.21 211.96 2.32k 65.78%
212474192 requests in 5760.00m, 74.08GB read
Socket errors: connect 0, read 346218, write 508394, timeout 10574
Requests/sec: 614.80
Transfer/sec: 224.78KB
```

HTTPS:
Expand All @@ -69,12 +69,12 @@ wrk -t2 -c100 -d96h https://cafe.example.com/tea
Running 5760m test @ https://cafe.example.com/tea
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 172.15ms 118.59ms 2.00s 68.16%
Req/Sec 305.26 203.43 2.33k 65.34%
206387831 requests in 5760.00m, 70.81GB read
Socket errors: connect 44059, read 356656, write 0, timeout 126
Requests/sec: 597.19
Transfer/sec: 214.84KB
Latency 168.63ms 118.44ms 2.00s 69.72%
Req/Sec 313.02 210.77 2.31k 65.50%
211667319 requests in 5760.00m, 72.65GB read
Socket errors: connect 80661, read 338673, write 0, timeout 147
Requests/sec: 612.46
Transfer/sec: 220.42KB
```

While there are socket errors in the output, there are no connection-related errors in NGINX logs.
Expand All @@ -86,28 +86,31 @@ Further investigation is out of scope of this test.

```text
resource.type="k8s_container"
resource.labels.cluster_name="ciara-1"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="nginx-gateway"
resource.labels.container_name="nginx-gateway"
severity=ERROR
SEARCH("error")
```

There were 36 error logs across 2 pod instances. They came in 2 almost identical batches, one on the first day of
running the test, after approximately 6 hours, and the second 2.5 days later. They were both relating to leader election
loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100).

Both error batches caused the pod to restart, but not terminate. However, the first pod was terminated about 10 minutes
after the first error batch and subsequent restart occurred. Exactly why this pod was terminated is not currently clear,
but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time also.
Strangely, there were 6 pod restarts in total of the second pod, but no other errors were observed over the test period
other than what was seen above, and grepping the logs for start-up logs only produced the 2 known restarts relating to
the leader election loss, plus initial start-up of both pods (4 in total).

```console
kubectl get pods -n nginx-gateway
NAME READY STATUS RESTARTS AGE
my-release-nginx-gateway-fabric-78d4b84447-4hss5 2/2 Running 6 (31h ago) 3d22h
There were 53 error logs, and 6 restarts, across 3 pod instances. The error logs came in 3 almost identical batches,
starting just over 24 hours after the initial deployment, and then each subsequent error batch just over 24 hours after
the last. They were all relating to leader election loss, and subsequent restart (see https://github.com/nginxinc/nginx-gateway-fabric/issues/1100). There were also 2 termination events, both of these occurred approximately 5 minutes
after a leader election loss and successful restart.

Each error batches caused the pod to restart, but not terminate. The first pod was terminated about 5 minutes
after the first error batch and subsequent restart occurred. A similar occurance happened after the third error batch.
There was no termination event after the second error batch. Exactly why these pods were terminated is not currently
clear, but it looks to be a cluster event (perhaps an upgrade) as the coffee and tea pods were terminated at that time
also. All the restarts happened roughly at the same time each day.

```text
{"level":"info", "msg":"Starting manager", "ts":"2023-12-13T17:45:10Z"} -> Start-up
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:43:11Z"} -> First error batch
{"level":"info", "msg":"Starting manager", "ts":"2023-12-14T17:47:36Z"} -> First pod terminated, second pod start
{"level":"info", "msg":"Starting manager", "ts":"2023-12-15T17:47:22Z"} -> Second error batch
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:52:21Z"} -> Third error batch
{"level":"info", "msg":"Starting manager", "ts":"2023-12-16T17:58:01Z"} -> Second pod terminated, third pod start
```

### NGINX
Expand All @@ -118,7 +121,7 @@ Errors:

```text
resource.type=k8s_container AND
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
resource.labels.cluster_name="<CLUSTER_NAME>" AND
resource.labels.container_name="nginx" AND
severity=ERROR AND
SEARCH("`[warn]`") OR SEARCH("`[error]`")
Expand All @@ -132,7 +135,7 @@ Non-200 response codes in NGINX access logs:

```text
resource.type=k8s_container AND
resource.labels.pod_name="my-release-nginx-gateway-fabric-78d4b84447-4hss5" AND
resource.labels.cluster_name="<CLUSTER_NAME>" AND
resource.labels.container_name="nginx"
"GET" "HTTP/1.1" -"200"
```
Expand All @@ -145,47 +148,61 @@ No such responses.

![memory.png](memory.png)

Memory usage dropped twice which appears to correspond with the restarts seen above relating to leader election.
Interestingly, before the first restart and after the second restart, memory usage sat at about 8.5MiB, but for the
majority of the test run, memory usage was about 9.5-10MiB. The previous release test run also had memory usage at
about 9-10MiB, but more stable usage across the duration of the test. However, there were no restarts observed in the
v1.0.0 test run. I don't think there is anything to investigate here.
Memory usage dropped once which appears to correspond with a restart event seen above relating to leader election.
There did seem to be a creeping up of memory before this restart, and then memory usage remained consistent for the rest
of the test. Memory usage was slightly higher than what was observed in the v1.0.0 longevity test, but unless we see
this trending upwards on future longevity tests, I don't think there is anything to investigate here.

### Containers CPU

![cpu.png](cpu.png)

No unexpected spikes or drops.
One drop in NGINX CPU that corresponds with a Pod termination event.

### NGINX metrics

In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
![stub-status.png](stub-status.png)

The drop of _requests_ on Dec 16 corresponds to a Pod termination event. There are no correlated errors in NGINX logs.
pleshakov marked this conversation as resolved.
Show resolved Hide resolved

### Reloads

In this test, NGINX metrics were not correctly exported so no dashboards are available for these.
Rate of reloads - successful and errors:

![reloads.png](reloads.png)

Reload spikes correspond to 1 hour periods of backend re-rollouts.
However, small spikes, correspond to periodic reconciliation of Secrets, which (incorrectly)
triggers a reload -- https://github.com/nginxinc/nginx-gateway-fabric/issues/1112

No reloads finished with an error.

Reload time distribution with 50th, 95th and 99th percentiles and the threshold:

![reload-time.png](reload-time.png)

Note - 60s is the threshold for waiting for NGINX to be reloaded.
Reload related metrics at the end:

```text
# HELP nginx_gateway_fabric_nginx_reloads_milliseconds Duration in milliseconds of NGINX reloads
# TYPE nginx_gateway_fabric_nginx_reloads_milliseconds histogram
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 1647
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 4043
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 4409
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 4409
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 4409
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 4409
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 2.701667e+06
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 4409
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="500"} 4626
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="1000"} 6831
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="5000"} 7317
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="10000"} 7317
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="30000"} 7317
nginx_gateway_fabric_nginx_reloads_milliseconds_bucket{class="nginx",le="+Inf"} 7317
nginx_gateway_fabric_nginx_reloads_milliseconds_sum{class="nginx"} 3.060948e+06
nginx_gateway_fabric_nginx_reloads_milliseconds_count{class="nginx"} 7317
# HELP nginx_gateway_fabric_nginx_reloads_total Number of successful NGINX reloads
# TYPE nginx_gateway_fabric_nginx_reloads_total counter
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 4409
nginx_gateway_fabric_nginx_reloads_total{class="nginx"} 7317
```

All successful reloads took less than 5 seconds, with most (>90%) under 1 second.

## Existing Issues still relevant

- NGF unnecessary reloads NGINX when it reconciles
Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
- NGF unnecessary reloads NGINX when it reconciles Secrets - https://github.com/nginxinc/nginx-gateway-fabric/issues/1112
- Use NGF Logger in Client-Go Library - https://github.com/nginxinc/nginx-gateway-fabric/issues/1101
Binary file modified tests/longevity/results/1.1.0/cpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/longevity/results/1.1.0/memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/longevity/results/1.1.0/reload-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/longevity/results/1.1.0/reloads.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/longevity/results/1.1.0/stub-status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading