Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't send 429 when max concurrency is reached #19

Merged
merged 1 commit into from
Jun 26, 2023

Conversation

scott-rc
Copy link

@scott-rc scott-rc commented May 19, 2023

Now that max concurrency is being enforced via:

We're beginning to see the following logs in production:

{
  "error": " - function 'a228d02a-6d62-4ee8-91e5-0278682794c0_870499720' concurrency '1' limit reached.",
  "fields": {
    "ts": "2023-05-21T16:19:02.276Z",
    "caller": "executor/api.go:88",
    "function_name": "appenv-fn--34810--68301",
    "logger": "executor"
  },
  "message": "error getting service for function",
  "_time": "2023-05-21T16:19:02.276436375Z"
}

The executor returns that error ☝️ here when the router asks the executor for a function's endpoint and said function:

  • has n specialized pods + n specializing pods >= concurrency
  • every specialized pod is serving requestsPerPod requests
  • every specializing pod has requestsPerPod requests queued for it

The router communicates with the executor here via HTTP and that error ☝️ turns into a 429 Too Many Requests HTTP response.

Since the executor returns 429 Too Many Requests, and Fission uses hashicorp's retryable http client which automatically retries 429 errors, the request eventually succeeds and the router gets a pod's endpoint.

To prove that the request for a function endpoint eventually succeeds, here's how often we're seeing concurrency limit reached:

CleanShot 2023-05-23 at 14 45 21@2x

And here's how often we're seeing error posting to getting service for function:

CleanShot 2023-05-23 at 14 47 49@2x

As you can see, we aren't seeing any error posting to getting service for function logs, but rate limiting our sandbox pods in Fission rather than our API feels unnecessary.


Fission uses a function's requestsPerPod and concurrency to both scale (specialize) and rate limit (return 429). This PR removes the rate limit aspect and instead returns an endpoint at random instead of 429.

To 🎩 , I set our dev sandbox's requestsPerPod to 1 and ran our steady scenario:

k6 run tests/k6/simple-action.js

steady({ rps: 10, duration: "1m" })

Using Fission 1.19.2

	 ✗ is status 200
      ↳  95% — ✓ 526 / ✗ 27
     ✗ has data
      ↳  95% — ✓ 526 / ✗ 27
     ✗ was successful
      ↳  95% — ✓ 526 / ✗ 27

     checks.........................: 95.11% ✓ 1578     ✗ 81  
     data_received..................: 708 kB 12 kB/s
     data_sent......................: 301 kB 5.0 kB/s
     dropped_iterations.............: 47     0.783292/s
     http_req_blocked...............: avg=700.35µs min=2µs     med=4µs     max=149.07ms p(90)=20.8µs  p(95)=3.66ms  
     http_req_connecting............: avg=20.06µs  min=0s      med=0s      max=342µs    p(90)=0s      p(95)=212.79µs
   ✗ http_req_duration..............: avg=1.33s    min=35.42ms med=44.85ms max=17.55s   p(90)=3.21s   p(95)=15.04s  
       { expected_response:true }...: avg=595.19ms min=35.42ms med=44.36ms max=16.63s   p(90)=87.95ms p(95)=3.24s   
   ✗ http_req_failed................: 4.88%  ✓ 27       ✗ 526 
     http_req_receiving.............: avg=66.87µs  min=30µs    med=56µs    max=563µs    p(90)=99µs    p(95)=122.59µs
     http_req_sending...............: avg=30.05µs  min=14µs    med=22µs    max=1.09ms   p(90)=43µs    p(95)=51µs    
     http_req_tls_handshaking.......: avg=665.11µs min=0s      med=0s      max=145.76ms p(90)=0s      p(95)=3.36ms  
     http_req_waiting...............: avg=1.33s    min=35.33ms med=44.74ms max=17.55s   p(90)=3.21s   p(95)=15.04s  
     http_reqs......................: 553    9.216177/s
     iteration_duration.............: avg=1.33s    min=35.65ms med=45.26ms max=17.61s   p(90)=3.22s   p(95)=15.04s  
     iterations.....................: 553    9.216177/s
     vus............................: 0      min=0      max=48
     vus_max........................: 49     min=6      max=49

Router logs

2023/05/23 18:10:13 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 1s (4 left)
2023/05/23 18:10:13 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 2s (3 left)
2023/05/23 18:10:15 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 4s (2 left)
2023/05/23 18:10:15 [DEBUG] POST http://executor.fission-development/v2/getServiceForFunction (status: 429): retrying in 8s (1 left)
{
    "level": "error",
    "ts": "2023-05-23T18:10:31.263Z",
    "logger": "triggerset.http_trigger_set.appenv-fn-trigger--1--1",
    "caller": "router/functionHandler.go:653",
    "msg": "error from GetServiceForFunction",
    "trace_id": "1eeb4fb1d5de551f81ceec95fd772352",
    "error": "error posting to getting service for function: POST http://executor.fission-development/v2/getServiceForFunction giving up after 5 attempt(s)",
    "error_message": "error posting to getting service for function: POST http://executor.fission-development/v2/getServiceForFunction giving up after 5 attempt(s)",
    "function": {
        "apiVersion": "fission.io/v1",
        "kind": "Function",
        "namespace": "gadget-app-sandbox-development",
        "name": "appenv-fn--1--1"
    },
    "status_code": 500
}

Using This PR

     ✓ is status 200
     ✓ has data
     ✓ was successful

     checks.........................: 100.00% ✓ 1713     ✗ 0   
     data_received..................: 646 kB  11 kB/s
     data_sent......................: 304 kB  5.1 kB/s
     dropped_iterations.............: 30      0.499573/s
     http_req_blocked...............: avg=410.47µs min=2µs     med=3µs     max=40.29ms p(90)=10µs    p(95)=3.33ms  
     http_req_connecting............: avg=19.85µs  min=0s      med=0s      max=4.23ms  p(90)=0s      p(95)=174.49µs
   ✓ http_req_duration..............: avg=240.6ms  min=35.63ms med=42.86ms max=6.18s   p(90)=89.68ms p(95)=892.07ms
       { expected_response:true }...: avg=240.6ms  min=35.63ms med=42.86ms max=6.18s   p(90)=89.68ms p(95)=892.07ms
   ✓ http_req_failed................: 0.00%   ✓ 0        ✗ 571 
     http_req_receiving.............: avg=76.23µs  min=13µs    med=53µs    max=2.86ms  p(90)=94µs    p(95)=130.49µs
     http_req_sending...............: avg=28.78µs  min=14µs    med=23µs    max=963µs   p(90)=40µs    p(95)=52µs    
     http_req_tls_handshaking.......: avg=341.96µs min=0s      med=0s      max=39.99ms p(90)=0s      p(95)=3.07ms  
     http_req_waiting...............: avg=240.5ms  min=35.58ms med=42.77ms max=6.18s   p(90)=89.6ms  p(95)=891.96ms
     http_reqs......................: 571     9.50853/s
     iteration_duration.............: avg=241.4ms  min=35.9ms  med=43.23ms max=6.23s   p(90)=89.95ms p(95)=899.55ms
     iterations.....................: 571     9.50853/s
     vus............................: 0       min=0      max=31
     vus_max........................: 32      min=6      max=32

@scott-rc scott-rc force-pushed the sc/dont-429 branch 5 times, most recently from 624e13c to 5c5919e Compare May 23, 2023 15:30
@scott-rc scott-rc marked this pull request as ready for review May 23, 2023 19:41
pkg/executor/fscache/poolcache.go Show resolved Hide resolved
}

// we can specialize another pod
// increment the number of requests waiting for a pod to finish specializing and return not found so that the executor specializes one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3 this comment for clarifying this causes specialization to happen!

Comment on lines +181 to +188
otelUtils.LoggerWithTraceID(req.ctx, c.logger).
Info("max concurrency reached, sending request to random pod",
zap.String("function", req.function),
zap.String("address", svc.val.Address),
zap.Int("active_requests", svc.activeRequests),
zap.Int("svc_waiting", funcSvcGroup.svcWaiting),
zap.Int("queue_len", funcSvcGroup.queue.Len()),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log feels like it could be fairly noisy for our big apps. Is it intended to be temporary? Should we make it a debug log?

// check if we have any pods that are already specialized
if numPodsSpecialized > 0 {
// we have specialized pod(s), just pick one at random and use it for this request
svc := randomSvc(funcSvcGroup.svcs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what you all think about this scenario:

  1. One specialized pod
  2. A massive burst of requests, going well beyond the max requests
  3. All requests that spill over the max requests go to that one specialized pod because the others are specializing

Could we do more harm than good by having one pod take those request? Should we consider an upper bound where we start putting these requests on the queue for a specializing pod to become available?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, how about this instead:

if numPodsSpecializing > 0 {
	// we still have pods that are specializing
	// increment the number of requests waiting for a pod to finish specializing and push this request onto the queue
} else {
	// we have the maximum amount of specialized pods, just pick one at random and use it for this request
}

☝️ This queues the requests for the specializing pods instead of the specialized one(s). We already have logic below to randomly distribute the queue among all the pods once we've specialized the last one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think in the absolute worst-case scenario outlined above, I'd rather have those requests distributed, or at least some of them, distributed across the specializing pods to not overwhelm the one specialized pod. Some questions that come to my mind:

  • Do we have any sense of what would be overwhelming? I know node could definitely take on a ton of requests, but eventually it would topple over.
  • Latency-wise, anything we can measure? Would it be better to wait for specialization, or have the specialized pod slowly work through the requests?
  • If specialization fails, we keep things in the queue. Does that mean, for now, it would be best to send it at the specialized pod instead and follow up where we pop things off the queue when an existing specialized pod can serve more requests, i.e. when markAvailable?

I'm not sure what's best, but I'll let you make that call!

Copy link

@thegedge thegedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good! Really appreciate all the additional comments, which really help with understanding <3

Once @angelini is also on board, 🚢 away 🚀

@angelini
Copy link

This LGTM, I'll merge this into our 1.19 branch and rebase that on the final 1.19 version that the Fission team shipped.

@angelini angelini merged commit 5f1f05b into gadget_dev_1_19 Jun 26, 2023
scott-rc added a commit that referenced this pull request Aug 10, 2023
This reverts commit 5f1f05b, reversing
changes made to fae2052.
scott-rc added a commit that referenced this pull request Aug 11, 2023
This reverts commit 5f1f05b, reversing
changes made to fae2052.
scott-rc added a commit that referenced this pull request Aug 11, 2023
This reverts commit 5f1f05b, reversing
changes made to fae2052.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants