Add readiness endpoints for all services #371

Anton-Kalpakchiev · 2024-11-04T18:10:24Z

Each service has a "health" endpoint that is used to check if the service is healthy, i.e. running. While this provides some information about the service's state, observability can be improved by implementing a "readiness" endpoint, which checks if the service is able to serve traffic.

In addition to a simple health check, the readiness endpoint verifies that all service dependencies are ready. For instance, the origin would check if the remote storage backends (e.g. S3) are reachable.

The endpoint has several uses:

Provide a binary signal on a service's health during new version deployments - a failure from the endpoint could signal that a regression has been made, which could trigger a rollback.
The agent's readiness endpoint signals whether Kraken is ready to serve images on a host. The endpoint can be called before scheduling a workload on the host.

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. Thus, the origin and build index's readiness endpoints should check if they can reach remote backends. This diff adds a function to the backend manager that allows exactly that by iterating through the storage backends and running a Stat call on all of them. I suggest it be configurable which backends need to be checked by the readiness endpoint (through the must_ready flag in the .yaml config), as some are more important for Kraken than others.

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. Add readiness endpoint for origin, which checks whether all its backends are reachable.

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. - Add a client for the origin's readiness endpoint. - Additionally, add a ClusterClient (a client which abstracts which origin from the cluster is called). - Rework the readiness test to use the client instead of directly calling the endpoint. Also, make each test case run separately (by using t.Run instead of just looping) to follow idiomatic Go.

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. - Add readiness endpoint for build index, which checks whether all its backends are reachable and the origin's readiness. - Add a client for build-index's readiness endpoint. The client will be used by agent -- agent's readiness endpoint will check build index's readiness, as build index is a dependency for agent.

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. The tracker makes requests to the origin cluster to get metainfo data for torrents. Therefore, its readiness endpoint should check the origin's readiness. Additionally, the tracker's readiness endpoint must be queried by the agent, thus a client for the endpoint must be added. - Add readiness endpoint for tracker, which checks origin's readiness. - Add client for the endpoint

As described in #371, all services should implement a "readiness" endpoint, which checks whether the service is ready to serve traffic, i.e. whether it can reach all of its dependencies. - Add readiness endpoint for agent. It calls both build-index and tracker's readiness endpoints. They in turn call the origin's readiness endpoint. The endpoint succeeding provides a strong signal that an agent on a host is ready to provide images.

As described in #371, the agent has a readiness endpoint, which checks its dependencies' readiness. This endpoint is already implemented. However, depending on usage, the endpoint might be called very frequently, which would result in many redundant checks being performed (for example if the endpoint is queried once per second). To address this, we are adding the ability to cache readiness success for a certain amount of time. If the endpoint is queried and succeeds, it will continue reporting success without making any readiness checks until the readiness cache TTL expires. The TTL is is configurable through the .yaml config. Failures are not cached.

Anton-Kalpakchiev self-assigned this Nov 4, 2024

This was referenced Nov 5, 2024

Add ability to check backend readiness #372

Merged

Add origin readiness endpoint #373

Merged

Add origin readiness client #374

Merged

This was referenced Nov 13, 2024

Add build index readiness endpoint and client #377

Merged

Add tracker health check client #378

Closed

This was referenced Nov 14, 2024

Add tracker readiness endpoint and client #380

Merged

Add agent readiness endpoint #381

Merged

Anton-Kalpakchiev mentioned this issue Nov 21, 2024

Enable caching of agent readiness success #384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add readiness endpoints for all services #371

Add readiness endpoints for all services #371

Anton-Kalpakchiev commented Nov 4, 2024

Add readiness endpoints for all services #371

Add readiness endpoints for all services #371

Comments

Anton-Kalpakchiev commented Nov 4, 2024