In today's digital landscape, every business is taking backups of their data (hopefully). These backups can be challenging as they often require substantial disk space and/or cloud storage that can lead to significant financial expenses. If you're in search of a solution to handle your backups and retain only the necessary ones according to your retention policies, look no further! With Backup Warden, it will supervise and maintain your backups simplifying your overall data life cycle and enabling you to have smarter resource utilization.
Thanks to xolox for his work on rotate-backups that gave me a lot of inspiration for this project!
Option | Description | Default Value |
---|---|---|
--minutely |
Number of minutely backups to preserve | 0 |
--hourly |
Number of hourly backups to preserve | 72 |
--daily |
Number of daily backups to preserve | 7 |
--weekly |
Number of weekly backups to preserve | 6 |
--monthly |
Number of monthly backups to preserve | 12 |
--yearly |
Number of yearly backups to preserve | always |
-c , --config |
Location of the config file | /etc/backup_warden.ini |
-s , --source |
Source of where the backups are stored. Select from: local , ssh , s3 |
local |
-b , --bucket |
Name of the AWS S3 bucket | |
-p , --path |
Specify a path to traverse all directories it contains for granular retention policies | |
-e , --environment |
Environment the backups are rotated in (used for Slack alert only) | |
-t , --timestamp-pattern |
The timestamp pattern using a regex expression to parse out of filenames | |
-l , --log-file |
Enable logging to this file path | |
-I , --include-list |
Include backups based on their directory path and/or filename (separated by comma) | |
-E , --exclude-list |
Exclude backups based on their directory path and/or filename (separated by comma) | |
-H , --ssh-host |
SSH host/alias to use | |
--ssh-sudo |
Wrap SSH commands with sudo for escalated privileges | False |
--filestat |
Use the file's last modified date instead of parsing timestamp from filename | False |
--prefer-recent |
Keep the most recent backup in each time slot instead of oldest | False |
--s3-only-prefixes |
When used with an S3 bucket, only prefixes will be considered (not individual objects) | False |
--relaxed |
Time windows are not enforced | False |
--utc |
Use UTC timezone instead of local machine's timezone for timestamps | False |
--syslog |
Use syslog | False |
--debug |
Log debug messages that can help troubleshoot | False |
--delete |
Commit to deleting backups (DANGER ZONE) | False |
-V , --version |
Display version and exit | |
-h , --help |
Show this help message and exit |
Note: Boolean options such as --filestat
can be specified as yes
/no
, true
/false
, or 1
/0
in the config
- These options determine the number of backups to retain for each respective frequency
- You have the flexibility to provide an expression that will be evaluated to calculate a value. For example, using
--hourly=5+2
would result in 7 - Alternatively, you can specify "always" as the value to preserve all backups for that particular frequency
There are currently three available sources, each functioning differently when scanning directories to find backups.
local
This option is straightforward and doesn't really require any additional explanation. It is a simple method for locating backups
ssh
To use this source, you need to configure the SSH config file (~/.ssh/config
) with the relevant host information. It also supports aliases defined in the SSH config, as well as jump hosts
s3
One thing to note is the s3_endpoint_url
option. This lets you specify an endpoint other than the default to be able to use an alternative like DigitalOcean Spaces (i.e. https://nyc3.digitaloceanspaces.com
)
Using Config File
When path
is used under the [main]
section in config, it significantly alters Backup Warden's functionality. In this case, Backup Warden will traverse through every directory and file under the given path until it locates a backup. Once a backup is found, it associates the backup with a config section using fnmatch
for pattern matching that defines its retention policy. If there isn't a config section that matches all possibilities of a path found, it's ignored. If path
is not specified, Backup Warden will only scan the path defined in each config section.
Using the path
option provides granular control over retention policies and allows for flexible path name conventions. It enables you to define custom retention rules based on very specific paths.
Using Parameters
--path
is the same as using path
under [main]
section in config.
Note: Specifying a path
may result in a performance impact if there are a lot of non-backup directories/files within the specified path. This shouldn't be an issue though unless your setup is very abnormal. You can use exclude-list
to assist in this scenario, along with s3-only-prefixes
if you are working with S3 storage.
The timestamp-pattern
option provides the flexibility to customize the regular expression used for extracting timestamps from filenames. The value for this option should be a Python-compatible regular expression that includes the named capture groups 'year', 'month', and 'day'. Additionally, it can optionally include the groups 'hour', 'minute', and 'second'. 'unixtime' is also supported (see below for how to use it)
Here is an example of the default regular expression:
# Required components
(?P<year>\d{4} ) \D?
(?P<month>\d{2}) \D?
(?P<day>\d{2} ) \D?
(?:
# Optional components
(?P<hour>\d{2} ) \D?
(?P<minute>\d{2}) \D?
(?P<second>\d{2})?
)?
Regular expressions are compiled using the re.VERBOSE flag which ignores whitespace, including newlines.
If your backups utilize Unix timestamps instead of standard timestamps, you can specify a pattern like:
(?P<unixtime>\d+)
In cases where your backup files do not contain a timestamp, you have the option to use the last modified time of the backup instead. However, it is important to note that when utilizing this parameter, you will also need to modify the timestamp-pattern
to accurately identify which directories/files are considered backups. For example, if all of your backups have filenames starting with "backup-", you would change the timestamp-pattern
to backup-\S+
.
If your backup file names are not standardized and do not follow a specific pattern, this feature is currently not supported.
Backup Warden offers the --relaxed
option to modify its default rotation behavior. By default, Backup Warden enforces strict time windows for each rotation scheme. However, with the --relaxed
option, you can relax this enforcement. Here's a clear explanation/example of the difference between strict and relaxed rotation:
-
Strict Rotation: When the number of hourly backups to preserve is set to three, only backups created within the relevant time window (the hour of the most recent backup and the two hours leading up to it) will match the hourly frequency. Choose this option if your backups are created at regular intervals without any missed intervals
-
Relaxed Rotation: With the
--relaxed
option enabled, the three most recent backups will all match the hourly frequency and be preserved, regardless of the calculated time window. Choose this option if your backups are created at irregular intervals, as it allows for the preservation of more backups
With the --s3-only-prefixes
option, only prefixes (not individual objects) will be considered for rotation. This drastically improves performance when the bucket contains a large number of objects, but only works if your backups are nested under timestamped prefixes.
These options utilize fnmatch
, allowing the use of asterisks as wildcards. This enables precise definition of which backups should be excluded from deletion. Include and exclude can be used together for fine-grained control.
For example, to exclude the cluster1
from Backup Warden's operations, you can use the --exclude-list="*cluster1*"
argument. This ensures that any directories/files containing cluster1
in their names will be excluded.
To further expand the exclusion criteria, you can exclude backups from the year 2022 by using --exclude-list="*cluster1*, *2022*"
.
The same concept applies to the exclude_list
option under each section in the config file:
[/path/backups/*/logical]
hourly = 72
daily = 7
weekly = 6
monthly = 12
yearly = always
include_list =
exclude_list = *cluster1*, *2022*
Include functions in the opposite manner. If you want to only include specific backups, you can utilize this feature. It can be used as a command-line argument using --include-list
, or as the include_list
option in a config path section.
Must be using Python 3.8+
Using PyPi:
pip install backup-warden
backup-warden --config config/example.ini
Using Poetry:
curl -sSL https://install.python-poetry.org | python3 -
poetry install
poetry run backup-warden --config config/example.ini
Using Docker:
docker build -t backup-warden .
docker run --volume=/my/backup/dir:/my/backup/dir --volume=$PWD/config:/config backup-warden --config /config/example.ini
Backup Warden offers two methods for setting it up: parameters and a config file. The recommended approach is to use a config file, which allows customization of directory paths and their respective retention policies. You can find examples of the config file here
With a config file, each section represents a specific path containing backups to be rotated. Within each section, you can define the rotation scheme and other options. Please refer to the above information for detailed instructions on how to utilize pattern matching effectively when using the path
option.
Note: If you specify a config file along with config path(s), command-line parameters will have no effect. The methods are not interchangeable.
Under the [main]
section in the config file, you can set the following options:
bucket
path
source
environment
ssh_host
ssh_sudo
syslog
log_file
s3_endpoint_url
s3_access_key_id
s3_secret_access_key
s3_only_prefixes
slack_webhook
For each config [path]
section, you can set the following options:
minutely
,hourly
,daily
,weekly
,monthly
,yearly
timestamp_pattern
include_list
/exclude_list
filestat
relaxed
prefer_recent
utc
You can also set the following options using environment variables, which will override the corresponding config values:
S3_ENDPOINT_URL
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN
SLACK_WEBHOOK
When the path
parameter is omitted under the [main]
section, Backup Warden does not accept wildcarding for config sections. In this scenario, Backup Warden will solely scan the specified config sections without traversing through additional directories.
This means that instead of scanning the entire file system or applying wildcards to match multiple paths, Backup Warden will focus solely on the directories specified within the config sections. It will not explore subdirectories or perform any recursive scanning.
By adhering to this behavior, Backup Warden provides a targeted approach, limiting the scope of its backup scanning and rotation operations to the explicitly defined directories within the config sections.
Directory structure
/path/backups/logical/$backups
/path/backups/physical/$backups
Config
[/path/backups/logical]
hourly = 15
daily = 3
weekly = 5
monthly = 12
yearly = always
[/path/backups/physical]
hourly = 10
daily = 7
weekly = 4
monthly = 12
yearly = always
This is a basic setup and works as you expect.
When the path
option is used under the [main]
section, Backup Warden allows wildcarding to be applied flexibly within the config sections.
By specifying the path
option, you gain the ability to use wildcards to define the directories or files that Backup Warden should scan for backups. This enables a more dynamic and versatile approach. Backup Warden will effectively traverse through the specified paths, including subdirectories if necessary, to locate the backups based on the wildcard patterns provided.
Directory structure
/path/backups/cluster1/logical/$backups
/path/backups/cluster2/logical/$backups
/path/backups/cluster1/physical/$backups
/path/backups/cluster2/physical/$backups
Config
[main]
path=/path/backups
[/path/backups/*/logical]
hourly = 15
daily = 3
weekly = 5
monthly = 12
yearly = always
[/path/backups/*/physical]
hourly = 10
daily = 7
weekly = 4
monthly = 12
yearly = always
Backup Warden's design incorporates hierarchical directory structure awareness, allowing for precise configuration and retention policies.
By defining /path/backups/*/logical
as a config section, Backup Warden acknowledges the wildcard (*
) as a placeholder that matches any subdirectory under /path/backups/
and assigns it to the logical
config section.
When a retention policy is set for a broader path, such as path/backups
, it will not override or take precedence over a more specific path like /path/backups/cluster1/logical
. Backup Warden's scanning and rotation operations respect the defined hierarchy, ensuring that retention policies are accurately applied to the corresponding backup directories without unintentionally affecting others.
Backup Warden offers a convenient Slack integration feature that allows you to stay informed about your backups if you specify a Slack Webhook URL. Benefit from the following alerts:
- Non-Backup Alert: Get notified if a path doesn't have a backup in the past 24 hours
- Success Alert: Receive notification after a successful execution, along with detailed statistics about what it did
- Failure Alert: In case of a failed execution, be promptly notified to address any potential issues
Backup Warden employs the following steps to carry out the backup rotation process:
- Specify Paths: You need to provide a
path
and/or use config sections for paths to inform Backup Warden about the locations where the backups are stored - Scan for Backups: Backup Warden scans each specified path to locate backups. These backups can be in the form of either directories or files. Backup Warden identifies backups by searching for timestamps in their names. If you're using
filestat
, it will not look for a timestamp, but what you specify in place of it - Apply Rotation Scheme: Backup Warden applies the defined rotation scheme to the identified backups. If the outcome doesn't align with your expectations, you can experiment with the
relaxed
and/orprefer-recent
options to achieve the desired behavior - Backup Deletion: Backups that are determined to be rotated based on the rotation scheme will be deleted if the
delete
option is passed. If it isn't, Backup Warden will skip the deletion step and preserve the rotated backups