Split apart web server and build server #795

jyn514 · 2020-05-31T16:30:12Z

Currently, all web requests are served from the same server that performs the crate builds. This has given us trouble in the past, since builds consume a lot of resources, driving up response times for web requests. It would be useful to instead have different servers for the website and builds.

Possible implementation:

The web server continues having primary access to the database. Builders don't have access at all, but instead receive webhooks (or similar) telling them to start a build. When the build finishes, they report back to the main server.
We have a 'heartbeat' mechanism similar to crater: if a build server doesn't send a message to the web server in 5 minutes, we mark it as crashed
We have a new state for crates in the queue: queued, finished, and assigned, where assigned means there is a build server currently building that crate (Add crates to the database before building them #1011).

Benefits:

We could build multiple crates at a time much more easily (alleviating some of the trouble from stop building crates on all targets #343) just by having more build servers. This requires Add crates to the database before building them #1011 first.
We could potentially target more architectures without cross-compiling, by having build servers that are e.g. hosted on ARM.
We could eventually share some build resources with crater.

@pietroalbini has experience with implementing this for crater and is helping a lot with the design aspect.

The text was updated successfully, but these errors were encountered:

syphar · 2021-04-01T09:00:44Z

Just a thought, wouldn't it a possible intermediate step to just add a second server that just uses the same database and bucket?
Also, the local index repo could move to the build-server.

Then the webserver could benefit from having less load, and we can slowly work on the webhook logic

Nemo157 · 2021-04-01T09:52:09Z

IIRC the blocker last time I touched this was the metrics, currently the builder metrics are merged into the web server metrics to get to prometheus. We would either need the metrics from the builder to be stored into the database to then be served by the web server, or pushed direct from the builder to prometheus somehow.

Once those are sorted running two processes for this should be trivial.

pietroalbini · 2021-04-01T10:55:15Z

A possible approach for the metrics would be to run a really small webserver out of the builder that just exposes a /metrics endpoint.

syphar · 2022-05-10T17:57:01Z

Just a thought, not sure if we should do it in the first version already:

when the goal is to be able to just rebuild instances via ansible then we perhaps should start

moving state from local files to the database (for example the last seen commit of the index?)
moving scripts from the server to the repo (or any repo)

IMO as long as we only have one build server (that also manages the queue & crates-index-diff) we don't need #1011 and we can just reboot / recreate the machine.

There would be a initial performance hit on every new build-server (downloading the toolchain, docker image etc), and also for every new webserver (local archive indexes come to mind, not sure if there is more).

For webservers it would be theoretically possible to run multiple machines with a ELB, while I actually prefer better caching on the CDN (#1552 and/or #1560 )

syphar · 2022-07-16T16:35:45Z

Some additional thoughts on this, steps to be able to horizontally scale the build-server:

separate registry watcher

We move the registry-watcher into a separate container / process that only exists once, saving the last seen ref additionally in the database to survive restarts, and to be able to just recreate the machine any time we want.

allow multiple build servers

move the queue-lock into the database
allow multiple builders by using SELECT FOR UPDATE SKIP LOCKED, and wrapping the build in a transaction

2 should scale well enough for a low number of build servers (<50) and wouldn't need much change to the current logic.
This would even allow autoscaling scaling build-servers, of course the initial setup time for the first build would take some time.

pietroalbini · 2022-07-17T10:43:19Z

We move the registry-watcher into a separate container / process that only exists once, saving the last seen ref additionally in the database to survive restarts, and to be able to just recreate the machine any time we want.

That's a great idea, that could be deployed to ECS.

syphar · 2022-07-24T10:45:43Z

~~In addition to the separate registry watcher we also have a separate job for the repository stats updater.~~ strike that, we already have cratesfyi database update-repository-fields which does exactly what the daemon triggers once an hour

I assume ECS has something that can run a command every hour? @pietroalbini

pietroalbini · 2022-07-27T21:38:04Z

I would create a single binary that includes the registry watcher and update-repository-fields all together tbh. ECS has something to run periodic tasks but if we can avoid more pieces of infra I'd be happier.

syphar · 2022-07-28T06:07:52Z

I would create a single binary that includes the registry watcher and update-repository-fields all together tbh. ECS has something to run periodic tasks but if we can avoid more pieces of infra I'd be happier.

that's possible, thank you for the input!

rylev · 2023-02-08T13:09:54Z

@syphar does #1785 close this issue?

Nemo157 · 2023-02-08T15:58:30Z

One additional thing that might make sense is to a do complete redesign of the web/build/watcher entrypoints now before we move to the new infra. The current CLI interface doesn't entirely make sense with the separation. Probably by making sure we have good new entrypoints used by the new infra then deleting the daemon entrypoint used in current production. (And maybe even separating to multiple binaries containing just the web/build/watcher/utility parts of the codebase instead of one combined binary).

rylev · 2023-02-08T16:09:43Z

@Nemo157 there is already the start-build-server command for running the background server independent of the web server or registry watcher. While I think separating to different binaries could be useful I don't see how waiting on that to set up this infrastructure actually makes our lives easier.

Nemo157 · 2023-02-08T16:38:15Z

I don't think we should wait on it for setting up the infrastructure, it'd make more sense as a back-and-forth of "here's the independent services we are running on the new infrastructure, how can we improve the interface for them" and try to get that done before we switch over production. (I guess we're most of the way there with start-{{build,web}-server,registry-watcher}, I forgot that #1785 had pulled the registry-watcher+stats-updater out already too).

rylev · 2023-02-08T16:40:39Z

It probably makes sense to close this issue then as the split has successfully been done. Perhaps we can then create new issues for improving the interface further.

syphar · 2023-02-08T20:32:31Z

@syphar does #1785 close this issue?

While the current design is not what was envisioned in the initial issue description, I would say we're nearly there.

Only piece missing IMO are metrics endpoints.

jyn514 added E-hard Effort: This will require a lot of work S-needs-design Status: There's a problem here, but no obvious solution; or the solution raises other questions labels Jun 27, 2020

syphar added the S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. label Jan 9, 2022

syphar mentioned this issue Apr 21, 2022

pcan-basic-sys - documentation build failure #1728

Closed

syphar mentioned this issue Jul 3, 2022

Host --output-format json files in addition to generated HTML #1285

Open

syphar mentioned this issue Jul 30, 2022

move remaining state to db, allow multiple build servers #1785

Merged

syphar mentioned this issue Feb 11, 2023

split service&instance metrics, serve metrics for build-server & watcher #2038

Merged

syphar closed this as completed in #2038 Jun 4, 2023

syphar mentioned this issue Aug 3, 2023

some crates take long enough to make docs.rs look stuck #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split apart web server and build server #795

Split apart web server and build server #795

jyn514 commented May 31, 2020 •

edited

Loading

syphar commented Apr 1, 2021

Nemo157 commented Apr 1, 2021

pietroalbini commented Apr 1, 2021

syphar commented May 10, 2022

syphar commented Jul 16, 2022

pietroalbini commented Jul 17, 2022

syphar commented Jul 24, 2022 •

edited

Loading

pietroalbini commented Jul 27, 2022

syphar commented Jul 28, 2022

rylev commented Feb 8, 2023

Nemo157 commented Feb 8, 2023

rylev commented Feb 8, 2023

Nemo157 commented Feb 8, 2023

rylev commented Feb 8, 2023

syphar commented Feb 8, 2023

Split apart web server and build server #795

Split apart web server and build server #795

Comments

jyn514 commented May 31, 2020 • edited Loading

syphar commented Apr 1, 2021

Nemo157 commented Apr 1, 2021

pietroalbini commented Apr 1, 2021

syphar commented May 10, 2022

syphar commented Jul 16, 2022

separate registry watcher

allow multiple build servers

pietroalbini commented Jul 17, 2022

syphar commented Jul 24, 2022 • edited Loading

pietroalbini commented Jul 27, 2022

syphar commented Jul 28, 2022

rylev commented Feb 8, 2023

Nemo157 commented Feb 8, 2023

rylev commented Feb 8, 2023

Nemo157 commented Feb 8, 2023

rylev commented Feb 8, 2023

syphar commented Feb 8, 2023

jyn514 commented May 31, 2020 •

edited

Loading

syphar commented Jul 24, 2022 •

edited

Loading