Skip to content

Commit

Permalink
Upgrade to Ubuntu 22.04, and custom 6.1.0-rc5 kernel
Browse files Browse the repository at this point in the history
Change-Id: I7f49901478ed0085710e7df574d3b24132cd4b42
  • Loading branch information
benjamin-maynard committed Nov 18, 2022
1 parent 2e9a362 commit f39409b
Show file tree
Hide file tree
Showing 29 changed files with 128 additions and 1,206 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@
*.pkrvars.json
*.auto.pkrvars.json

# Exclude Specific Packer Variables File
!image/internal-build/image.pkrvars.hcl

# Service Account
service-account-key.json

Expand Down
69 changes: 2 additions & 67 deletions deployment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This directory contains a [Terraform Module](https://www.terraform.io/docs/modul
The `main` branch may be updated at any time with the latest changes which could be breaking. You should always configure your module to use a release. This can be configured in the modules Terraform Configuration block.

```
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
```

## Prerequisites
Expand All @@ -28,7 +28,7 @@ Basic usage of this module is as follows:
```terraform
module "nfs_proxy" {
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
# Google Cloud Project Configuration
PROJECT = "my-gcp-project"
Expand Down Expand Up @@ -182,26 +182,6 @@ The service account will need the following project level IAM permissions:
| LOCAL_SSDS | (Only used if `CACHEFILESD_DISK_TYPE` = `local-ssd`) The number of Local SSDs to assign to each cache instance. This can be either 0 to 8, 16, or 24 local SSDs for up to 9TB of capacity ([see here](https://cloud.google.com/compute/docs/disks/local-ssd#choosing_a_valid_number_of_local_ssds)). If you are setting this to 24 Local SSDs you should also change the `MACHINE_TYPE` variable to an instance with 32 CPU's, for example `n1-highmem-32`. | False | `4` |
| CACHEFILESD_PERSISTENT_DISK_SIZE_GB | (Only used if `CACHEFILESD_DISK_TYPE` = `pd-standard`, `pd-balanced` or `pd-ssd`), what size should the persistent disk be in GB? Can be set between `100` and `64000`. For large volumes, consider larger instance types (see [here](https://cloud.google.com/compute/docs/disks/performance)). | False | `1500` |

### Culling Options

| Variable | Description | Required | Default |
| -------------------- | ------------------------------------------------------------------------------------------------------------ | -------- | ------------------------------------------------------------------------------------------ |
| CULLING | Culling method to use (`default`, `custom`, or `none`). | False | `custom` |
| CULLING_LAST_ACCESS | (custom only) Remove files from the cache that were last accessed over `CULLING_LAST_ACCESS` ago. | False | If using Local SSD: 1 hour for each `LOCAL_SSD`.<br><br>If using Persistent Disk, 6 hours. |
| CULLING_THRESHOLD | (custom only) Cull when the percentage of remaining disk space (or inodes) is less than `CULLING_THRESHOLD`. | False | `20` |
| CULLING_INTERVAL | (custom only) How often to check if the remaining disk space is less than the `CULLING_THRESHOLD` | False | `1m` |
| CULLING_QUIET_PERIOD | (custom only) After culling, how long to wait before resuming checks. | False | `CULLING_LAST_ACCESS / 4` |

The culling `CULLING` mode supports the following options:

* `default` - This uses the standard cachefilesd to perform culling.
* `custom` - This uses the custom knfsd-cull service to perform culling.
* `none` - Disables culling completely

The purpose of the custom culling agent is to workaround a known issue where cachefilesd may stop culling files in the cache. See [culling](../docs/culling.md) for more information on how to configure cachefilesd, and the known issue where cachefilesd may stop culling.

The `none` option supports special cases where the cache rarely fills up.

### Mount Options

These mount options are for the proxy to the source server.
Expand Down Expand Up @@ -232,51 +212,6 @@ These mount options are for the proxy to the source server.
| NOHIDE | When `true`, adds the `nohide` option to all the exports. | False | `true` |
| EXPORT_OPTIONS | Any custom NFS exports options. These options will be applied to all NFS exports. | False | `""` |

#### Custom culling options

The durations, `CULLING_LAST_ACCESS`, `CULLING_INTERVAL`, and
`CULLING_QUIET_PERIOD` support `h`, `m`, and `s` (hours, minutes, seconds)
For example `5m`, `2.5h`, or `1h30m`.

To avoid deleting files unnecessarily the culling process will wait until the
remaining percentage of free space is less than `CULLING_THRESHOLD`. The
remaining free space will be checked every `CULLING_INTERVAL`.

Any file with a last access time older than `CULLING_LAST_ACCESS` will be
deleted. Because files are deleted based on their last access, this might not
remove enough files (or any files) to bring the free space above the threshold
if most of the files in the cache have been used within the last access period.
It can also remove more files than required (possibly even all the files).

Once a culling attempt has been completed (even if no files were culled),
culling will wait for `CULLING_QUIET_PERIOD` before resuming culling checks.
This avoids repeatedly scanning the full file tree (costing IOPS) while most
files are in use.

**IMPORTANT:** `CULLING_THRESHOLD` *MUST* be greater than `bstop` and `fstop` in
`/etc/cachefilesd.conf`. Otherwise cachefilesd will stop caching data before the
custom culling threshold is reached so culling will never run.

#### Example

```terraform
CULLING = "custom"
CULLING_LAST_ACCESS = "4h"
CULLING_THRESHOLD = 20
CULLING_INTERVAL = "1m"
CULLING_QUIET_PERIOD = "1h"
```

The culling agent will check every minute (`CULLING_INTERVAL`) to see if the
remaining disk space (or inodes) is less than 20% (`CULLING_THRESHOLD`) of the
total disk space.

When the remaining disk space is below the 20% threshold the culling agent will
then remove any files that were last accessed over four hours ago (`CULLING_LAST_ACCESS`)
from the cache.

The culling agent will then wait for at least one hour (`CULLING_QUIET_PERIOD`)
before resuming culling checks.

### Autoscaling Configuration

Expand Down
4 changes: 2 additions & 2 deletions deployment/fanout.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ There is no special logic in the Knfsd Terraform Module to handle the fanout arc
```terraform
module "nfs_proxy_fanout" {
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
# Google Cloud Project Configuration
PROJECT = "my-gcp-project"
Expand Down Expand Up @@ -67,7 +67,7 @@ module "nfs_proxy_fanout" {
module "nfs_proxy_cluster" {
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
# Google Cloud Project Configuration
PROJECT = "my-gcp-project"
Expand Down
4 changes: 2 additions & 2 deletions deployment/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Providing the metrics config from a file:

```terraform
module "nfs_proxy" {
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
METRICS_AGENT_CONFIG = file("metrics-config.yaml")
}
Expand All @@ -69,7 +69,7 @@ Providing the metrics config inline using heredoc syntax:

```terraform
module "nfs_proxy" {
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v0.10.0"
source = "github.com/GoogleCloudPlatform/knfsd-cache-utils//deployment/terraform-module-knfsd?ref=v1.0.0-beta1"
METRICS_AGENT_CONFIG = <<-EOT
receivers:
Expand Down
7 changes: 0 additions & 7 deletions deployment/terraform-module-knfsd/compute.tf
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,6 @@ resource "google_compute_instance_template" "nfsproxy-template" {
EXPORT_OPTIONS = var.EXPORT_OPTIONS
NFS_MOUNT_VERSION = var.NFS_MOUNT_VERSION

CULLING = var.CULLING

CULLING_LAST_ACCESS = coalesce(var.CULLING_LAST_ACCESS, local.CULLING_LAST_ACCESS_DEFAULT)
CULLING_THRESHOLD = var.CULLING_THRESHOLD
CULLING_INTERVAL = var.CULLING_INTERVAL
CULLING_QUIET_PERIOD = var.CULLING_QUIET_PERIOD

# system
NFS_KERNEL_SERVER_CONF = file("${path.module}/resources/nfs-kernel-server.conf")
NUM_NFS_THREADS = var.NUM_NFS_THREADS
Expand Down
35 changes: 0 additions & 35 deletions deployment/terraform-module-knfsd/resources/proxy-startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -216,12 +216,6 @@ function init() {
DISABLED_NFS_VERSIONS=$(get_attribute DISABLED_NFS_VERSIONS)
READ_AHEAD_KB=$(get_attribute READ_AHEAD_KB)

CULLING="$(get_attribute CULLING)"
CULLING_LAST_ACCESS="$(get_attribute CULLING_LAST_ACCESS)"
CULLING_THRESHOLD="$(get_attribute CULLING_THRESHOLD)"
CULLING_INTERVAL="$(get_attribute CULLING_INTERVAL)"
CULLING_QUIET_PERIOD="$(get_attribute CULLING_QUIET_PERIOD)"

ENABLE_STACKDRIVER_METRICS=$(get_attribute ENABLE_STACKDRIVER_METRICS)
METRICS_AGENT_CONFIG=$(get_attribute METRICS_AGENT_CONFIG)
ENABLE_KNFSD_AGENT=$(get_attribute ENABLE_KNFSD_AGENT)
Expand Down Expand Up @@ -445,34 +439,6 @@ function configure-nfs() {

}

function configure-culling() (
function fmt() {
if [[ -n "$2" ]]; then
printf '%s %s\n' "$1" "$2"
fi
}

sed -i '/^nocull/d' /etc/cachefilesd.conf

if [[ "$CULLING" == "none" ]] || [[ "$CULLING" == "custom" ]]; then
echo "nocull" >>/etc/cachefilesd.conf
fi

if [[ "$CULLING" == "custom" ]]; then
: >/etc/knfsd-cull.conf
fmt last-access "$CULLING_LAST_ACCESS" >>/etc/knfsd-cull.conf
fmt threshold "$CULLING_THRESHOLD" >>/etc/knfsd-cull.conf
fmt interval "$CULLING_INTERVAL" >>/etc/knfsd-cull.conf
fmt quiet-period "$CULLING_QUIET_PERIOD" >>/etc/knfsd-cull.conf

echo "Starting Custom Culling Agent..."
start-services knfsd-cull
echo "Finished starting Custom Culling Agent."
else
echo "Custom Culling Agent disabled. Skipping..."
fi
)

function configure-metrics() {

# If needed, override the Monitoring API to use an IP address from private.googleapis.com
Expand Down Expand Up @@ -552,7 +518,6 @@ function main() {

configure-read-ahead
configure-nfs
configure-culling
configure-metrics

start-nfs
Expand Down

This file was deleted.

This file was deleted.

This file was deleted.

53 changes: 0 additions & 53 deletions deployment/terraform-module-knfsd/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -331,59 +331,6 @@ variable "NETAPP_ALLOW_COMMON_NAME" {
default = false
}

variable "CULLING" {
type = string
default = "default"

validation {
condition = contains(["default", "custom", "none"], var.CULLING)
error_message = "CULLING must be one of [default, custom, none]."
}
}

variable "CULLING_LAST_ACCESS" {
type = string
default = "" # calculate default based on LOCAL_SSDS

validation {
# This will also match the empty string, but in this case the empty string is allowed.
condition = can(regex("^(\\d+h)?(\\d+m)?(\\d+s)?$", var.CULLING_LAST_ACCESS))
error_message = "CULLING_LAST_ACCESS must be a positive duration, e.g. '1h'. Valid time units are 'h' (hours), 'm' (minutes), 's' (seconds)."
}
}

variable "CULLING_THRESHOLD" {
type = number
default = 20

validation {
condition = var.CULLING_THRESHOLD == null || (var.CULLING_THRESHOLD >= 0 && var.CULLING_THRESHOLD <= 100)
error_message = "CULLING_THRESHOLD must between 0% and 100%."
}
}

variable "CULLING_INTERVAL" {
type = string
default = "1m"

validation {
# This will also match the empty string, but in this case the empty string is allowed.
condition = can(regex("^(\\d+h)?(\\d+m)?(\\d+s)?$", var.CULLING_INTERVAL))
error_message = "CULLING_INTERVAL must be a positive duration, e.g. '1h'. Valid time units are 'h' (hours), 'm' (minutes), 's' (seconds)."
}
}

variable "CULLING_QUIET_PERIOD" {
type = string
default = ""

validation {
# This will also match the empty string, but in this case the empty string is allowed.
condition = can(regex("^(\\d+h)?(\\d+m)?(\\d+s)?$", var.CULLING_QUIET_PERIOD))
error_message = "CULLING_QUIET_PERIOD must be a positive duration, e.g. '1h'. Valid time units are 'h' (hours), 'm' (minutes), 's' (seconds)."
}
}

variable "CACHEFILESD_DISK_TYPE" {
type = string
default = "local-ssd"
Expand Down
17 changes: 16 additions & 1 deletion docs/changes/changelog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
# Next
# v1.0.0-beta1

* Update Monitoring Dashboard to support new Persistent Disk FS-Cache Volumes
* Upgrade to Ubuntu 22.04 LTS
* Build and use custom kernel with FS-Cache performance patches
* Remove custom culling agent and custom cachefilesd package

## Remove custom culling agent and custom cachefilesd package

Version `5.17`+ of the kernel does not contain the FS-Cache culling bug, therefore the custom culling agent and custom cachefilesd package is no longer required.

## Build and use custom kernel with FS-Cache performance patches

Builds a custom version of the kernel based on `6.1.0-rc5`. This custom version contains additional patches that resolve the FS-Cache single page caching performance issue. See [here](https://github.com/benjamin-maynard/kernel/commits/nfs-fscache-netfs) for more details.

## Upgrade to Ubuntu 22.04 LTS

Upgrades the Ubuntu image to 22.04.1 LTS (Jammy Jellyfish).

## Update Monitoring Dashboard to support new Persistent Disk FS-Cache Volumes

Expand Down
Loading

0 comments on commit f39409b

Please sign in to comment.