Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bursts of webhooks cause bots to fail #2114

Closed
bcoe opened this issue Jun 18, 2021 · 8 comments
Closed

bursts of webhooks cause bots to fail #2114

bcoe opened this issue Jun 18, 2021 · 8 comments
Assignees
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@bcoe
Copy link
Contributor

bcoe commented Jun 18, 2021

I noticed this debugging #1714, and I believe it's the root cause of many of the bug reports (CC: @parthea, @Neenu1995 , @SurferJeffAtGoogle).

  • Cloud Functions only serve a single request at a time (Node.js can handle multiple requests at once, so this is inefficient.).
  • When OwlBot (or another bot) performs an action that creates many concurrent webhook events on GitHub, these WebHook events cause the number of Cloud Functions we're running to scale.

Screen Shot 2021-06-18 at 1 16 00 PM

  • During this scaling phase, response times are slow from cloud functions, resulting in 5 10s timeouts in webhooks (I believe 10 seconds is what GitHub allocates).

Screen Shot 2021-06-18 at 12 58 55 PM

Potential Solutions

  • Cloud functions recently added support for "min instances" which would allow us to run a larger pool of warmed up functions -- this could be very expensive, but might be smart for us to use in the short term.
  • Cloud Run allows multiple concurrent requests, it would be a better model for us.
@bcoe bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 18, 2021
@chingor13 chingor13 assigned tmatsuo and unassigned chingor13 and bcoe Jul 2, 2021
@tmatsuo
Copy link
Contributor

tmatsuo commented Jul 2, 2021

We're running an experiment with Cloud Run frontend.

@bcoe
Copy link
Contributor Author

bcoe commented Jul 2, 2021

@SurferJeffAtGoogle @parthea we just stress tested @tmatsuo's experiment with 80 concurrent OwlBot builds, everything worked perfectly, no failed Webhooks.

I think we're in a much healthier place, I suggest we wait to close this until after a week of experimenting though.

@tmatsuo
Copy link
Contributor

tmatsuo commented Jul 13, 2021

We still see occasional webhook delivery failure, but I'd close this for owl-bot.
For other bots, it should still happen.

For a long term solution, I think we should have a shared frontend service for receiving webhook requests.
I also propose that we use the same frontend code for receiving Pub/Sub and Cron requests. We can use the same code, but we'll need a separate deployment because we only want to receive legitimate Pub/Sub and Cron requests from Cloud Scheduler and Pub/Sub.

  • Design shared frontend
  • Deploy the shared frontend and test it with canary bot
  • Capacity planning and load test
  • Migrate bots to the new frontend
  • Migrate cron and scheduler request to the new frontend
  • Ditch serverless scheduler proxy

@bcoe
Copy link
Contributor Author

bcoe commented Jul 14, 2021

@tmatsuo lets keep this open to track work on other bots?

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Nov 8, 2021
@sofisl
Copy link
Contributor

sofisl commented Dec 13, 2021

This issue is going OOSLO soon. Perhaps this would be better labeled as a process, or we demote it to a P3, given that the holidays are coming up.

@sofisl sofisl added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 14, 2021
@bcoe
Copy link
Contributor Author

bcoe commented Dec 15, 2021

@sofisl I think you made the right call downgrading this to p3, we could honestly probably call it a feature request alternatively.

@tmatsuo is working on a new design.

@bcoe
Copy link
Contributor Author

bcoe commented Jan 29, 2022

cross linking #2227.

@chingor13
Copy link
Contributor

We think this is resolved now. Opened #3313 to track progress in rolling out frontend cloud run services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants
@chingor13 @tmatsuo @bcoe @yoshi-automation @sofisl and others