Windows 32/64 binary, FreeBSD statically linked
First (public) release with zfsbackup-zfsreceive-zfsrestore
NOTICE. There are virtually no test on input parameters. So place caution. If you need help just ask. In the next release I will include stringent checks
And now... THE SPIEGONE!
This release contains numerous features, both commonly used and specific to zfs
The first function is versum, something similar to a "smarter" hashdeep
Basically: verifies the hashes of the files in the filesystem, against a list in a text file
We want to verify that the backup-restore with zfs works well, without trusting
It can be fed with two types of files: those created by zpaqfranz itself, and those of hashdeep.
The former are written by the sum function with the appropriate switches (ez. zpaqfranz sum *.txt -xxh3). BTW zpaqfranz can write and read (-hashdeep switch) this fileformat.
In the following examples we will operate on the tank/d dataset with SSD/NVMe, working on fc (yep, francocorbelli snapshot)
You can use all hash types of zpaqfranz, in this example it will be xxhash64
Incidentally -forcezfs is used to have the example folder (which contains .zfs, being a snapshot) examined, otherwise zpaqfranz will ignore it
zpaqfranz sum /tank/d/.zfs/snapshot/fc -forcezfs -ssd -xxhash -noeta -silent -out /tmp/hash_xx64.txt
A possible alternative, to have third-party control (i.e. software other than zpaqfranz) is to use hashdeep
usually in the md5deep package
The essential difference of hashdeep from md5deep is the use of multithreading: it reads files from disk in parallel, so suppose we are operating with solid-state disks (or ... wait longer :) )
Various hashes can be selected, but as they are basically used as checksums and not as cryptographic signatures: md5 is more than fine (it is the fastest), at least for me
hashdeep -c md5 -r /tank/d/.zfs/snapshot/fc >/tmp/hashdeep.txt
BTW hashdeep does not have a find replace function, awk or sed is commonly used. Uncomfortable to say the least
To check the contents of the filesystem we have three chances
- the zpaqfranz hashlist
In this example, a multithreaded operation (-ssd) will be adopted, operating a renaming (-find/-replace) to convert the paths in the source file from the target ones
zpaqfranz versum z:\uno\_tmp\hash_xx64.txt -ssd -find /tank/d/.zfs/snapshot/fc -replace z:\uno\_tank\d
- the hashdeep
zpaqfranz is able to 'understand' the original format of hashdeep (look at the -hashdeep switch)
zpaqfranz versum z:\uno\_tmp\hashdeep.txt -hashdeep -ssd -find /tank/d/.zfs/snapshot/fc -replace z:\uno\_tank\d
- small-scale test, without reading from filesystem
If the hash function used to create the .zpaq file is the same as that of the .txt control file, you can operate it as follows
zpaqfranz versum z:\uno\_tmp\hash_xx64.txt -to thebak.zpaq -find /tank/d/.zfs/snapshot/fc -replace /tank/d
It should be remembered that the default hash of zpaqfranz is xxhash64, so if you want to use other hashes (e.g. xxh3, sha256 or sha3 etc.) you must, when creating the .zpaq file (the a command), add the relevant switch (e.g. -xxh3, -sha3, -blake3 etc.)
Recap
Complete example of creating an archive (on FreeBSD with zfs) to be then extracted on Windows with independent control
The source will be tank/d using -ssd for multithread
Take the snapshot fc of tank/d
zfs snapshot tank/d@fc
Get the hash list with xxhash64 into the file /tmp/hash_xx64.txt
zpaqfranz sum /tank/d/.zfs/snapshot/fc -forcezfs -ssd -xxhash -noeta -silent -out /tmp/hash_xx64.txt
Create hashdeep.txt w/md5 into /tmp/hashdeep.txt. Using md5 because very fast
WARNING: /sbin/zfs set snapdir=hidden tank/d should be required to "hide" .zfs folders to hashdeep. There is not an easy way to exclude folders in hashdeep
hashdeep -c md5 -r /tank/d/.zfs/snapshot/fc >/tmp/hashdeep.txt
Now making the backup (fixing path w/-to)
In this case the default hash function is used (xxhash), matching with hash_xx64.txt
We "inject" the two hash list, /tmp/hash_xx64.txt and /tmp/hashdeep.txt, to keep with the archive
zpaqfranz a /tmp/thebak.zpaq /tank/d/.zfs/snapshot/fc /tmp/hash_xx64.txt /tmp/hashdeep.txt -to /tank/d
Destroy the snapshot
zfs destroy tank/d@fc
Now transfer somehow thebak.zpaq to Win (usually with rsync)
Extracting everything to z:\uno (look at -longpath)
zpaqfranz x thebak.zpaq -to z:\uno -longpath
Verify files by zpaqfranz's hash list
Note the -find and -replace to fix source (on FreeBSD) and destination (on Windows) paths
zpaqfranz versum z:\uno\_tmp\hash_xx64.txt -ssd -find /tank/d/.zfs/snapshot/fc -replace z:\uno\_tank\d
Now paranoid double-check with hashdeep.
Please note the -hashdeep
zpaqfranz versum z:\uno\_tmp\hashdeep.txt -hashdeep -ssd -find /tank/d/.zfs/snapshot/fc -replace z:\uno\_tank\d
Finally compare the hashes into the txt with the .zpaq
zpaqfranz versum z:\uno\_tmp\hash_xx64.txt -to thebak.zpaq -find /tank/d/.zfs/snapshot/fc -replace /tank/d
Short version: this is an example of how to perform on a completely different system (Windows) the verification of a copy made from a .zfs snapshot with zpaqfranz. We will see how, in reality, it is designed for "real" zfs backup-restore
New advanced option: the -stdout
If the files are ordered-stored into the .zpaq, it is possible to -stdout
WHAT?
Files stored within .zpaq are divided into fragments (let's say 'chunks') which, in general, are not sorted.
This happens for various reasons (I will not elaborate), preventing the possibility of extracting files in stream form, i.e. as a sequence of bytes, as required by -stdout
This is not normally a serious problem (for zpaq 7.15) as it simply does not support mixing streamed and journaled files into an archive
Translation of the translation (!)
zpaq started out as a stream compressor (actually no, there would be a further very long explanation here that I will spare)
Processes any long sequence of bytes, one byte at a time, and writes a sequence of bytes in output: this is the so-called streamed format
It was present in older versions of zpaq, something analogous to gz just to give a known example.
Subsequently, the developer of zpaq (Matt Mahoney) implemented the so-called 'journaled' storage format, where each file has its various versions in it.
This is the 'normal' format, while the 'streamed' one has practically disappeared (vestiges remain in the source).
For a whole series of technical problems that I won't go into here, Mahoney decided not to allow the mixing of the two types:
- archives WITH VERSIONS (aka: modern)
XOR
- with streamed files (aka: OK for stdout)
The ability to write to stdout does not have much reason to exist, unless coupled with the ability to read from stdin, and zpaq 7.15 does not allow this, essentially operating by reading files from the filesystem "the usual way".
As you may have noticed (?) for some time now, I have instead evolved zpaqfranz to allow the processing of input streams (with -stdin)
The concrete reason is twofold
The first is to archive mysql dumps, the tool of which (mysqldump and various similar ones) output precisely a text file.
This way, you can use zpaqfranz to archive them versioned (which as far as I know is superior to practically any other system, by a WIDE margin).
The second is to make sector-level copies of Windows drives, in particular the C disk:
As you may have noticed (?) zpaqfranz is now able to back up (from within Windows) an entire system, either 'file-based' (with a VSS) or 'dd-style'
Obviously the 'dd' method will take up more space (it is good to use the f command to fill the free space with zeros) and will also be slower
BUT
it allows you to mount (with other software) / extract (e.g. with 7z) almost everything
If you are really paranoid (like me), what could be better than a backup of every sector of the disk?
Let us now return to why it is so important (and actually not trivial) to obtain archives in journaled format but with the possibility of orderly (=streamed) extraction
It is about speed
Streamed .zpaq archives exist, BUT listing (i.e. enumerating the list of files therein) is very slow, requiring a scan of the entire file (which can be hundreds of GB = minutes)
They are also extremely slow to be created (by the zpaqd 7.15 utility), essentially monothread (~10MB/s)
Instead, by having journaled (i.e. 'normal' zpaq format) but ORDERED archives, I can obtain all the benefits
Various versions of the same file, listing speed, and even creation speed (maintaining multithreading), at the cost of a (not excessive) slowdown in output, due to the use of -stdout instead of the more efficient filesystem
Why all this?
For zfs backups, of course, and especially restorations
There are now three new commands (actually very crude, to be developed, but that's the "main thing")
- zfsbackup
- zfsrestore
- zfsreceive
One normally uses .zpaq to make differential zfs backups, i.e. with one base file and N differential files, which are stored as different versions. This is good, it works well, and it is not fragile (differential means that two files are enough to restore). The "normal" method for "older" zpaqfranz.
BUT
it takes up space: as the differential snapshots get bigger and bigger, it is a normal problem for any differential system
On the other hand, using incremental zfs snapshots has always been very risky and fragile, because it only takes small failures to make the recovery fail.
The zfsbackup command masks the complexity (of backing up a single dataset) to be extremely space efficient and with minimal time
How?
Let's see an example
Just 3 commands to make another level of verify...
touch /tank/d/momentoincrementale
/sbin/zfs set snapdir=hidden tank/d
/usr/local/bin/hashdeep -c md5 -r /tank/d |/usr/local/bin/zpaqfranz a /tank/d/hasho.zpaq elenco.txt -stdin
and now the "real" one (warning: NO input checks!)
/usr/local/bin/zpaqfranz zfsbackup /temporaneo/incrementale.zpaq tank/d
That's all
Into the /temporaneo/incrementale.zpaq archive the dataset tank/d will be "snapshotted" and archived
Now look at a l (list) with the new -stdout
H:\backup\rambo_incrementale>zpaqfranz l incrementale.zpaq -stdout
incrementale.zpaq:
23 versions, 23 files, 157.051 fragments, 4.968 blocks, 66.872.360.363 bytes (62.28 GB)
AVAILABLE -stdout 23
- 2022-12-17 15:58:42 71.790.160.936 0644 [STDOUT] 00000001.zfs
- 2022-12-17 16:22:59 49.744 0644 [STDOUT] 00000002.zfs
- 2022-12-17 16:30:19 45.016 0644 [STDOUT] 00000003.zfs
- (...)
- 2022-12-22 09:00:00 4.579.140.520 0644 [STDOUT] 00000015.zfs
- 2022-12-24 20:01:41 193.543.496 0644 [STDOUT] 00000016.zfs
(...)
81.605.788.800 (76.00 GB) of 81.605.788.800 (76.00 GB) in 23 files shown
66.872.360.363 compressed
The files are compatible with STDOUT (because created by this new zpaqfranz with -stdin)
In generalthis is not possible: in this example the first two files can be streamed to stdout, the latter 2... maybe
(...)
- 2019-11-13 13:54:03 21.326 0666 [STDOUT] /tank/d/ALESSANDRA/00086140 Verbale spl.docx
- 2019-12-13 15:26:50 934.440 0666 [STDOUT] /tank/d/ALESSANDRA/00086216 COMPUTO LAVORI.doc
- 2020-01-02 09:51:57 22.114.816 0666 /tank/d/ALESSANDRA/00086216 perizia.doc
- 2019-12-13 15:27:02 943.598 0666 /tank/d/ALESSANDRA/00086216 verb.pdf
(...)
Recap: at every zpaqfranz zfsbackup a new snapshot is made, like in this example
root@aserver:~ # zpaqfranz zfslist "*" "zpaqfranz"
zpaqfranz v56.4h-JIT-L archiver, (26 Dec 2022)
tank/d@zpaqfranz00000001
tank/d@zpaqfranz00000002
tank/d@zpaqfranz00000003
(...)
tank/d@zpaqfranz00000023
0.048 seconds (00:00:00) (all OK)
(in fact, using -kill will destroy the snapshots, leaving only the last, to save space)
Now the problem moves to data recovery
There are basically two ways
The first is to extract all .zfs files from the zpaq archive, and then restore them in sequence
"Old" style (aka: zpaq 7.15)
- Extract the .zfs into some folder, like
zpaqfranz x incrementale.zpaq -to estratti
- Then zfsrestore
In this example a 1.sh script will be prepared, to restore data from the ./estratti folder into the
(new dataset) rpool/saveme [of course rpool must exists, and saveme NOT!]
zpaqfranz zfsrestore ./estratti rpool/saveme -out 1.sh
The script is something like that (automagically use pv if found)
cat ./estratti/00000001.zfs |/usr/bin/pv| zfs receive rpool/saveme@franco00000001
cat ./estratti/00000002.zfs |/usr/bin/pv| zfs receive rpool/saveme@franco00000002
(...)
cat ./estratti/00000023.zfs |/usr/bin/pv| zfs receive rpool/saveme@franco00000023
So far, so good
BUT
takes up double the space.
If the data originally occupied 100GB, let's say 100GB will be needed (in which to extract the .zfs files)
AND
another 100GB in which to restore them (actually more, it's just an example)
The "new style" (to be tested!) is zfsreceive
zpaqfranz zfsreceive incrementale.zpaq tank/ripri1 -out doreceive.sh
You will get a script to be run
./doreceive.sh
With something like...
/usr/local/bin/zpaqfranz x incrementale.zpaq 00000001.zfs -stdout |/usr/bin/pv| zfs receive tank/ripri1@franco00000001
/usr/local/bin/zpaqfranz x incrementale.zpaq 00000002.zfs -stdout |/usr/bin/pv| zfs receive tank/ripri1@franco00000002
(...)
/usr/local/bin/zpaqfranz x incrementale.zpaq 00000023.zfs -stdout |/usr/bin/pv| zfs receive tank/ripri1@franco00000023
In this case the single files will be extracted and streamed by zpaqfranz ("piped") into zfs receive, halving the space needed to restore (and even time)
versum - paranoid test for zfs' backups
And (turning back to the example) if you are really, really, really paranoid, you can test
the hashdeep's file against the restored zfs, just to be really sure
cd /tank/ripri1
zpaqfranz x hasho.zpaq
zpaqfranz versum elenco.txt -hashdeep -ssd -find /tank/d -replace /tank/ripri1
zpaqfranz v56.4h-JIT-L archiver, (26 Dec 2022)
franz:-ssd -hashdeep
franz:find <</tank/d>>
franz:replace <</tank/ripri1>>
60188: Loading text files
60134: Working on hashdeep format <<elenco.txt>>
60451: hasher selected: md5
60188: Loading finished
Creating 4 hashing thread(s) [2]
----------------------------------------------------------------------------
Files 100.090 vs exp 100.090 this is good
Bytes 69.533.786.054 vs exp 69.529.738.801 *** DIFFER 4.047.253 ***
----------------------------------------------------------------------------
60402: ERROR md5 |file=771CD0E8408C16FA596146A6E31ED4BC|txt=228A410DD737F57A623D41979898AD18| /tank/ripri1/hasho.zpaq
----------------------------------------------------------------------------
Time 87.436 s, average speed 795.253.511/s (758.41 MB/s)
This is good: everything is fine EXCEPT the hasho.zpaq
And it must be different!
Using the -hashdeep switch (during zfsbackup), if hashdeep is reachable, a checklist will be "magically" added to the archive
Short (!) version
By synergistically using the zfs and zpaqfranz replication tools, "everything" (with attributes, etc.) can be obtained with substantially minimal space occupation AND very short times (the file system scan performed by any program is always slow, while the zfs calculation takes a few hundred milliseconds)
If you do not trust even in zfs (yep, I do NOT trust) you can use hashdeep (of course solid-state drives highly raccomended) to be REALLY sure that everything runs fine after the recovery
Recovery (especially for large amounts of data) takes time and space
That's why, generally, it's a strategy that I ADD to the classic zpaq one
In this release I put a binary version (statically linked) for FreeBSD, to make it easier to test on systems such as TrueNAS