Don't send 429 when max concurrency is reached #19

scott-rc · 2023-05-19T12:29:59Z

Now that max concurrency is being enforced via:

We're beginning to see the following logs in production:

{
  "error": " - function 'a228d02a-6d62-4ee8-91e5-0278682794c0_870499720' concurrency '1' limit reached.",
  "fields": {
    "ts": "2023-05-21T16:19:02.276Z",
    "caller": "executor/api.go:88",
    "function_name": "appenv-fn--34810--68301",
    "logger": "executor"
  },
  "message": "error getting service for function",
  "_time": "2023-05-21T16:19:02.276436375Z"
}

The executor returns that error ☝️ here when the router asks the executor for a function's endpoint and said function:

has n specialized pods + n specializing pods >= concurrency
every specialized pod is serving requestsPerPod requests
every specializing pod has requestsPerPod requests queued for it

The router communicates with the executor here via HTTP and that error ☝️ turns into a 429 Too Many Requests HTTP response.

Since the executor returns 429 Too Many Requests, and Fission uses hashicorp's retryable http client which automatically retries 429 errors, the request eventually succeeds and the router gets a pod's endpoint.

To prove that the request for a function endpoint eventually succeeds, here's how often we're seeing concurrency limit reached:

And here's how often we're seeing error posting to getting service for function:

As you can see, we aren't seeing any error posting to getting service for function logs, but rate limiting our sandbox pods in Fission rather than our API feels unnecessary.

Fission uses a function's requestsPerPod and concurrency to both scale (specialize) and rate limit (return 429). This PR removes the rate limit aspect and instead returns an endpoint at random instead of 429.

To 🎩 , I set our dev sandbox's requestsPerPod to 1 and ran our steady scenario:

k6 run tests/k6/simple-action.js

steady({ rps: 10, duration: "1m" })

Using Fission 1.19.2

	 ✗ is status 200
      ↳  95% — ✓ 526 / ✗ 27
     ✗ has data
      ↳  95% — ✓ 526 / ✗ 27
     ✗ was successful
      ↳  95% — ✓ 526 / ✗ 27

     checks.........................: 95.11% ✓ 1578     ✗ 81  
     data_received..................: 708 kB 12 kB/s
     data_sent......................: 301 kB 5.0 kB/s
     dropped_iterations.............: 47     0.783292/s
     http_req_blocked...............: avg=700.35µs min=2µs     med=4µs     max=149.07ms p(90)=20.8µs  p(95)=3.66ms  
     http_req_connecting............: avg=20.06µs  min=0s      med=0s      max=342µs    p(90)=0s      p(95)=212.79µs
   ✗ http_req_duration..............: avg=1.33s    min=35.42ms med=44.85ms max=17.55s   p(90)=3.21s   p(95)=15.04s  
       { expected_response:true }...: avg=595.19ms min=35.42ms med=44.36ms max=16.63s   p(90)=87.95ms p(95)=3.24s   
   ✗ http_req_failed................: 4.88%  ✓ 27       ✗ 526 
     http_req_receiving.............: avg=66.87µs  min=30µs    med=56µs    max=563µs    p(90)=99µs    p(95)=122.59µs
     http_req_sending...............: avg=30.05µs  min=14µs    med=22µs    max=1.09ms   p(90)=43µs    p(95)=51µs    
     http_req_tls_handshaking.......: avg=665.11µs min=0s      med=0s      max=145.76ms p(90)=0s      p(95)=3.36ms  
     http_req_waiting...............: avg=1.33s    min=35.33ms med=44.74ms max=17.55s   p(90)=3.21s   p(95)=15.04s  
     http_reqs......................: 553    9.216177/s
     iteration_duration.............: avg=1.33s    min=35.65ms med=45.26ms max=17.61s   p(90)=3.22s   p(95)=15.04s  
     iterations.....................: 553    9.216177/s
     vus............................: 0      min=0      max=48
     vus_max........................: 49     min=6      max=49

Router logs

2023/05/23 18:10:13 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 1s (4 left)
2023/05/23 18:10:13 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 2s (3 left)
2023/05/23 18:10:15 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 4s (2 left)
2023/05/23 18:10:15 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 8s (1 left)
{
    "level": "error",
    "ts": "2023-05-23T18:10:31.263Z",
    "logger": "triggerset.http_trigger_set.appenv-fn-trigger--1--1",
    "caller": "router/functionHandler.go:653",
    "msg": "error from GetServiceForFunction",
    "trace_id": "1eeb4fb1d5de551f81ceec95fd772352",
    "error": "error posting to getting service for function: POST http://executor.fission-development/v2/getServiceForFunction giving up after 5 attempt(s)",
    "error_message": "error posting to getting service for function: POST http://executor.fission-development/v2/getServiceForFunction giving up after 5 attempt(s)",
    "function": {
        "apiVersion": "fission.io/v1",
        "kind": "Function",
        "namespace": "gadget-app-sandbox-development",
        "name": "appenv-fn--1--1"
    },
    "status_code": 500
}

Using This PR

     ✓ is status 200
     ✓ has data
     ✓ was successful

     checks.........................: 100.00% ✓ 1713     ✗ 0   
     data_received..................: 646 kB  11 kB/s
     data_sent......................: 304 kB  5.1 kB/s
     dropped_iterations.............: 30      0.499573/s
     http_req_blocked...............: avg=410.47µs min=2µs     med=3µs     max=40.29ms p(90)=10µs    p(95)=3.33ms  
     http_req_connecting............: avg=19.85µs  min=0s      med=0s      max=4.23ms  p(90)=0s      p(95)=174.49µs
   ✓ http_req_duration..............: avg=240.6ms  min=35.63ms med=42.86ms max=6.18s   p(90)=89.68ms p(95)=892.07ms
       { expected_response:true }...: avg=240.6ms  min=35.63ms med=42.86ms max=6.18s   p(90)=89.68ms p(95)=892.07ms
   ✓ http_req_failed................: 0.00%   ✓ 0        ✗ 571 
     http_req_receiving.............: avg=76.23µs  min=13µs    med=53µs    max=2.86ms  p(90)=94µs    p(95)=130.49µs
     http_req_sending...............: avg=28.78µs  min=14µs    med=23µs    max=963µs   p(90)=40µs    p(95)=52µs    
     http_req_tls_handshaking.......: avg=341.96µs min=0s      med=0s      max=39.99ms p(90)=0s      p(95)=3.07ms  
     http_req_waiting...............: avg=240.5ms  min=35.58ms med=42.77ms max=6.18s   p(90)=89.6ms  p(95)=891.96ms
     http_reqs......................: 571     9.50853/s
     iteration_duration.............: avg=241.4ms  min=35.9ms  med=43.23ms max=6.23s   p(90)=89.95ms p(95)=899.55ms
     iterations.....................: 571     9.50853/s
     vus............................: 0       min=0      max=31
     vus_max........................: 32      min=6      max=32

pkg/executor/fscache/poolcache.go

thegedge · 2023-05-30T13:42:14Z

pkg/executor/fscache/poolcache.go

 			}
+
+			// we can specialize another pod
+			// increment the number of requests waiting for a pod to finish specializing and return not found so that the executor specializes one


<3 this comment for clarifying this causes specialization to happen!

thegedge · 2023-05-30T13:42:59Z

pkg/executor/fscache/poolcache.go

+					otelUtils.LoggerWithTraceID(req.ctx, c.logger).
+						Info("max concurrency reached, sending request to random pod",
+							zap.String("function", req.function),
+							zap.String("address", svc.val.Address),
+							zap.Int("active_requests", svc.activeRequests),
+							zap.Int("svc_waiting", funcSvcGroup.svcWaiting),
+							zap.Int("queue_len", funcSvcGroup.queue.Len()),
+						)


This log feels like it could be fairly noisy for our big apps. Is it intended to be temporary? Should we make it a debug log?

thegedge · 2023-05-30T13:49:13Z

pkg/executor/fscache/poolcache.go

+				// check if we have any pods that are already specialized
+				if numPodsSpecialized > 0 {
+					// we have specialized pod(s), just pick one at random and use it for this request
+					svc := randomSvc(funcSvcGroup.svcs)


Curious what you all think about this scenario:

One specialized pod

A massive burst of requests, going well beyond the max requests

All requests that spill over the max requests go to that one specialized pod because the others are specializing

Could we do more harm than good by having one pod take those request? Should we consider an upper bound where we start putting these requests on the queue for a specializing pod to become available?

That's a good point, how about this instead:

if numPodsSpecializing > 0 { // we still have pods that are specializing // increment the number of requests waiting for a pod to finish specializing and push this request onto the queue } else { // we have the maximum amount of specialized pods, just pick one at random and use it for this request }

☝️ This queues the requests for the specializing pods instead of the specialized one(s). We already have logic below to randomly distribute the queue among all the pods once we've specialized the last one.

Yeah, I think in the absolute worst-case scenario outlined above, I'd rather have those requests distributed, or at least some of them, distributed across the specializing pods to not overwhelm the one specialized pod. Some questions that come to my mind:

Do we have any sense of what would be overwhelming? I know node could definitely take on a ton of requests, but eventually it would topple over.

Latency-wise, anything we can measure? Would it be better to wait for specialization, or have the specialized pod slowly work through the requests?

If specialization fails, we keep things in the queue. Does that mean, for now, it would be best to send it at the specialized pod instead and follow up where we pop things off the queue when an existing specialized pod can serve more requests, i.e. when markAvailable?

I'm not sure what's best, but I'll let you make that call!

thegedge

Logic looks good! Really appreciate all the additional comments, which really help with understanding <3

Once @angelini is also on board, 🚢 away 🚀

angelini · 2023-06-26T16:22:55Z

This LGTM, I'll merge this into our 1.19 branch and rebase that on the final 1.19 version that the Fission team shipped.

This reverts commit 5f1f05b, reversing changes made to fae2052.

scott-rc force-pushed the sc/dont-429 branch 5 times, most recently from 624e13c to 5c5919e Compare May 23, 2023 15:30

Don't send 429 when max concurrency is reached

63ef576

scott-rc force-pushed the sc/dont-429 branch from 5c5919e to 63ef576 Compare May 23, 2023 18:12

scott-rc marked this pull request as ready for review May 23, 2023 19:41

scott-rc requested review from angelini and thegedge May 23, 2023 19:41

thegedge reviewed May 30, 2023

View reviewed changes

thegedge approved these changes May 30, 2023

View reviewed changes

angelini merged commit 5f1f05b into gadget_dev_1_19 Jun 26, 2023

scott-rc added a commit that referenced this pull request Aug 10, 2023

Revert "Merge pull request #19 from gadget-inc/sc/dont-429"

a6ba481

This reverts commit 5f1f05b, reversing changes made to fae2052.

scott-rc added a commit that referenced this pull request Aug 11, 2023

Revert "Merge pull request #19 from gadget-inc/sc/dont-429"

5b4fe3f

This reverts commit 5f1f05b, reversing changes made to fae2052.

scott-rc added a commit that referenced this pull request Aug 11, 2023

Revert "Merge pull request #19 from gadget-inc/sc/dont-429"

a8fd0a2

This reverts commit 5f1f05b, reversing changes made to fae2052.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't send 429 when max concurrency is reached #19

Don't send 429 when max concurrency is reached #19

scott-rc commented May 19, 2023 •

edited

Loading

thegedge May 30, 2023

thegedge May 30, 2023

thegedge May 30, 2023

scott-rc May 30, 2023

thegedge May 30, 2023

thegedge left a comment

angelini commented Jun 26, 2023

Don't send 429 when max concurrency is reached #19

Don't send 429 when max concurrency is reached #19

Conversation

scott-rc commented May 19, 2023 • edited Loading

thegedge May 30, 2023

Choose a reason for hiding this comment

thegedge May 30, 2023

Choose a reason for hiding this comment

thegedge May 30, 2023

Choose a reason for hiding this comment

scott-rc May 30, 2023

Choose a reason for hiding this comment

thegedge May 30, 2023

Choose a reason for hiding this comment

thegedge left a comment

Choose a reason for hiding this comment

angelini commented Jun 26, 2023

scott-rc commented May 19, 2023 •

edited

Loading