fix: docs (#667)

## Changes
rivet-gg · Apr 17, 2024 · c5b33fa · c5b33fa
1 parent 3bf5a1f
commit c5b33fa
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 163 deletions.
diff --git a/docs/packages/cluster/AUTOSCALING.md b/docs/packages/cluster/AUTOSCALING.md
diff --git a/docs/packages/cluster/SERVER_PROVISIONING.md b/docs/packages/cluster/SERVER_PROVISIONING.md
@@ -1,43 +1,103 @@
 # Automatic Server Provisioning
 
-Server provisioning handles everything responsible for getting servers running and installed for game lobbies to run on. Server provisioning occurs in the `cluster` package and is automatically brought up and down to desired levels via `cluster-datacenter-scale`.
+Server provisioning handles everything responsible for getting servers from supported cloud providers running and installed with all of the software required to run Rivet edge functionality. Server provisioning occurs in the `cluster` package and server count is automatically brought up and down to desired levels via `cluster-datacenter-scale`.
+
+Server provisioning is declarative, meaning it is configured based on the state you want the [cluster](#cluster) (all [datacenters](#datacenter) and servers within datacenters) to be in. The operations required to get the current cluster state to match the desired state are handled automatically.
 
 ## Motivation
 
 Server provisioning was created to allow for quick and stateful configuration of the game server topology on Rivet. This system was also written with the intention to allow clients to choose their own hardware options and server providers.
 
-In the future, an autoscaling system will be hooked up to the provisioning system to allow the system to scale up to meet spikes in demand, and scale down when load is decreased to save on costs.
+On Rivet Enterprise, an autoscaling system is hooked up to the provisioning system to allow the system to scale up to meet spikes in demand, and scale down when load is decreased to save on costs.
 
 ## Basic structure
 
-There are currently three types of servers that work together to host game lobbies:
+There are currently three types of servers, known as [pools](#pool), that work together to host game lobbies:
 
 -   ### ATS
 
-    ATS servers host game images via Apache Traffic server. The caching feature provided by ATS along with ATS node being in the same datacenter as the Job node allows for very quick lobby start times.
+    ATS servers host game images via Apache Traffic Server. The caching feature provided by ATS along with ATS node being in the same datacenter as the [Job](#job) node allows for very quick lobby start times.
 
 -   ### Job
 
     Job servers run Nomad which handles the orchestration of the game lobbies themselves.
 
 -   ### GG
 
-    GameGuard nodes serve as a proxy for all incoming game connection and provide DoS protection.
+    Game Guard nodes serve as a proxy for all incoming game connection and provide DDoS protection.
+
+## Provisioning process (upscaling)
+
+If `cluster-datacenter-scale` determines that there are less servers in a [pool](#pool) than the desired count, it will provision new servers or [undrain](#drainundrain) currently draining servers.
+
+-   ### Creating a new server
+
+    1. Before the new server is provisioned, it checks if a [prebaked](#prebaking) image for the given [pool](#pool) already exists. If it does, the prebake image is copied to the newly created disk and no install procedure is required. If the prebake image does not exist, it will be created on a separate prebake server. The current server being created will be ssh'd into and run install scripts that are customized based on the [pool](#pool) this server is assigned to.
+
+-   ### [Prebaking](#prebaking)
+
+    The process for prebaking a server image is the same as installing but without initialization. A new server is created and installed with the software required by the [pool](#pool) type, but none of the software is turned on. The server is then shut down, and an image of the disk is created. Finally, the image id is written to database and the server is deleted.
+
+-   ### [Undraining](#drainundrain)
+
+    A server that is currently draining (usually from [downscaling](#drainingdestroying-process-downscaling)) can be undrained to get it back to its normal state. This is preferred over creating a new server because it is much faster.
+
+## Draining/destroying process (downscaling)
+
+If `cluster-datacenter-scale` determines that there are more servers in a [pool](#pool) than the desired count, it will delete or [drain](#drainundrain) servers.
+
+-   ### Deleting servers
+
+    Servers are deleted by destroying all related resources such as DNS records, SSH keys, and firewalls before finally deleting the server itself via the cloud provider's API.
+
+-   ### [Draining](#drainundrain)
+
+    A server is drained to allow it to finish pending operations or allow game lobbies to close gracefully before it is destroyed. In this state, it can be undrained.
 
-## Why are servers in the same availability zone (aka datacenter or region)
+## Tainting
 
-Servers are placed in the same region for two reasons:
+A [datacenter](#datacenter) can be tainted to allow for a rolling deploy of new changes to the underlying software configuration within install scripts. When tainted, all of the servers in the datacenter will be marked as tainted and the same amount of new servers will be deployed. These tainted servers do not differ in functionality from normal servers. However, once the new servers begin to come online, the tainted servers start to get [drained](#draining) until all tainted servers are drained/deleted.
+
+## Why are servers in the same [availability zone](#availability-zone)
+
+Servers are placed in the same [AZ](#availability-zone) for two reasons:
 
 1. ### VLAN + Network Constraints
 
     Servers rely on VLAN to communicate between each other.
 
 2. ### Latency
 
-    Having all of the required components to run a Job server on the edge, (i.e. in the same datacenter) allows for very quick lobby start times.
+    Having all of the required components to run a [Job](#job) server on the edge, (i.e. in the same [datacenter](#datacenter)) allows for very quick lobby start times.
 
 ## Prior art
 
 -   https://console.aiven.io/project/rivet-3143/new-service?serviceType=pg
 -   https://karpenter.sh/docs/concepts/nodepools/
 -   Nomad autoscaler
+
+## Terminology
+
+-   #### Cluster
+
+    A collection of datacenters.
+
+-   #### Datacenter
+
+    A collection of servers in the same availability zone of a cloud server provider.
+
+-   #### Pool
+
+    A pool is a collection of servers with the same purpose. Read more [here](#basic-structure).
+
+-   #### Availability zone
+
+    Also known as region or datacenter.
+
+-   #### Drain/Undrain
+
+    When a server is drained, it is put in a state in which it can complete all remaining operations before being deleted. When a draining server is undrained, it is set back to a state of normal function.
+
+-   #### Prebaking
+
+    Prebaking refers to the process of installing a variation of the required software for the given [pool](#pool) on a prebake server to create a prebake image. It must be a variation because the prebake image cannot know what server it will be copied to. Can be though of a template.
diff --git a/docs/packages/cluster/TLS_AND_DNS.md b/docs/packages/cluster/TLS_AND_DNS.md
@@ -1,33 +1,33 @@
-# [rivet.run](http://rivet.run) DNS & TLS Configuration
+# DNS & TLS Configuration
 
-## Moving parts
-
-#### TLS Cert
+## TLS Cert
 
 -   Can only have 1 wildcard
     -   i.e. `*.lobby.{dc_id}.rivet.run`
 -   Takes a long time to issue
 -   Prone to Lets Encrypt downtime and [rate limits](https://letsencrypt.org/docs/rate-limits/)
-    -   Nathan requested a rate limit increase for when this is needed
 
-#### DNS record
+### DNS record
 
 -   Must point to the IP of the datacenter we need
     -   i.e. `*.lobby.{dc_id}.rivet.run` goes to the GG Node for the given datacenter
     -   `*.rivet.run` will not work as a static DNS record because you can’t point it at a single datacenter
 
-#### GG host resolution
+### GG host resolution
 
 -   When a request hits the GG server for HTTP(S) or TCP+TLS requests, we need to be able to resolve the lobby to send it to
 -   This is why the lobby ID Needs to be in the DNS name
+-   Uses hostname to route to a specific lobby: `{lobby_id}-{port}.lobby.{dc_id}.rivet.run`
 
-#### GG autoscaling
+### GG autoscaling
 
 -   The IPs that the DNS records point to change frequently as GG nodes scale up and down
 
 ## Design
 
-#### DNS records
+### DNS records
+
+[Source](../../../svc/pkg/cluster/worker/src/workers/server_dns_create.rs)
 
 Dynamically create a DNS record for each GG node formatted like `*.lobby.{dc_id}.rivet.run`. Example:
 
@@ -39,28 +39,29 @@ A *.lobby.51f3d45e-693f-4470-b86d-66980edd87ec.rivet.run 9.10.11.12	# DC bar, GG
 
 These the IPs of these records change as the GG nodes scale up and down, but the origin stays the same.
 
-#### TLS certs
+### TLS certs
+
+[Source](../../../svc/pkg/cluster/worker/src/workers/datacenter_tls_issue.rs)
 
 Each datacenter needs a TLS cert. For the example above, we need a TLS cert for `*.lobby.51f3d45e-693f-4470-b86d-66980edd87ec.rivet.run` and `*.lobby.51f3d45e-693f-4470-b86d-66980edd87ec.rivet.run`.
 
 ## TLS
 
-#### TLS cert provider
+### TLS cert provider
 
 Currently we use Lets Encrypt as our TLS certificate provider.
 
 Alternatives:
 
 -   ZeroSSL
+    -   Higher rate limits, better cert issuing
 
-#### TLS cert refreshing
+### TLS cert refreshing
 
 Right now, the TLS certs are issued in the Terraform plan. Eventually, TLS certs should renew on their own automatically.
 
-## TLS Alternatives
-
-#### Use `*.rivet.run` TLS cert with custom DNS server
+## TLS Alternative
 
-Create a `NS` record for `*.rivet.run` pointed at our custom DNS server
+### Use `*.rivet.run` TLS cert with custom DNS server
 
-We can use a single static TLS cert
+Create a `NS` record for `*.rivet.run` pointed at our custom DNS server and use a single static TLS cert. We did not go through with this because there is added security risk and complexity with running your own DNS server.