Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cutmark concept #199

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions TODO
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ LATER:
* send progress information via sd_notify(), so that people can wrap casync nicely in UIs
* maybe turn "recursive" mode into a numeric value specifying how far to descend?
* make "casync stat" work on a directory with a subpath
* tweak chunker: shift cut to last "marker".
* define sane errors we can show user messages about
* introduce a --best-effort mode when replaying, which means we'll ignore what we can't apply
* when building the cache, also build a seed
Expand All @@ -49,6 +48,5 @@ LATER:
* make sure "casync list /etc/fstab" does something useful
* rework CaSeed logic to use CaCache as backend, and then add a new command "casync cache" or so, to explicitly generate a cache/seed
* support blake2 as hashes
* parallelize image generation: when storing chunks in the store do so in a thread
* in "casync stat" output show which flags enable what
* save/restore xfs/ext4 projid
78 changes: 78 additions & 0 deletions doc/casync.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ General options:
--store=PATH The primary chunk store to use
--extra-store=<PATH> Additional chunk store to look for chunks in
--chunk-size=<[MIN:]AVG[:MAX]> The minimal/average/maximum number of bytes in a chunk
--cutmark=CUTMARK Specify a cutmark
--cutmark-delta-bytes=BYTES Maximum bytes to shift cut due to cutmark
--digest=<DIGEST> Pick digest algorithm (sha512-256 or sha256)
--compression=<COMPRESSION> Pick compression algorithm (zstd, xz or gzip)
--seed=<PATH> Additional file or directory to use as seed
Expand Down Expand Up @@ -291,3 +293,79 @@ excluded:
unconditionally take precedence over lines not marked like this. Moreover,
lines prefixed with ``!`` also cancel the effect of patterns in
``.caexclude`` files placed in directories further up the tree.

Cutmarks
--------

``casync`` cuts the stream to serialize into chunks of an average size (as
specified with ``--chunk-size=``), determining cut points using the ``buzhash``
rolling hash function and a modulo test. Frequently, cut points determined that
way are at slightly inconvenient locations: in the midle of objects serialized
in the stream rather then before or after them, thus needlessly exploding
changes to individual objects into more than one chunk. To optimize this
**cutmarks** may be configured. These are byte sequences ``casync`` (up to 8
bytes in length) automatically detects in the data stream and that should be
considered particularly good cutpoints. When cutmarks are defined the chunking
algorithm will slightly move the cut point between two chunks to match a
cutmark if one has recently been seen in the serialization stream.

Cutmarks may be specified with the ``--cutmark=`` option. It takes a cutmark
specification in the format ``VALUE:MASK+OFFSET`` or ``VALUE:MASK-OFFSET``. The
first part, the value indicates the byte sequence to detect in hexadecimal
digits, up to 8 bytes (thus 16 characters) in length. Following the colon a
bitmask (also in hexadecimal) may be specified of the same size. Every 8 byte
sequence at every 1 byte granularity stream position is tested against the
value. If all bits indicated in the mask match a cutmark is found. The third
part of the specification indicates where to place the cutmark specifically
relative to the the end of the 8 byte sequence. Specify ``-8`` to cut
immediately before the cutmark sequence, and ``+0`` right after. The offset
(along with its ``+`` or ``-`` character) may be omitted, in which case the
offset is assumed to be zero, i.e. the cut is done right after the
sequence. The mask (along with its ``:`` character) may also be omitted, in
which case it is assumed to be ``FFFFFFFFFFFFFFFF``, i.e. all
bits on, matching the full specified byte sequence. In order to match shorter
byte sequence (for example to adapt the tool to some specific file format using
shorter object or section markers) simply specificy a shorter mask value and
correct the offset value.

Examples:

--cutmark=123456789ABCDEF0


This defines a cutmark to be the 8 byte sequence 0x12, 0x34, 0x56, 0x78, 0x9A,
0xBC, 0xDE, 0xF0, and the cut is placed right after the last byte, i.e. after the
0xF0.


--cutmark=C0FFEE:FFFFFF-5


This defines a cutmark to be the 3 byte sequence 0xC0, 0xFF, 0xEE and the cut is
placed right after the last byte, i.e. after the 0xEE.

--cutmark=C0DECAFE:FFFFFFFF-8


This defines a cutmark to be the 4 byte sequence 0xC0, 0xDE, 0xCA, 0xFE and the
cut is placed right before the first byte, i.e. before the 0xC0.

When operating on the file system layer (i.e. when creating `.caidx` files),
the implicit cutmark of ``--cutmark=51bb5beabcfa9613+8`` is used, to increase
the chance that cutmarks are placed right before each serialized file.

Multiple cutmarks may be defined on the same operation, simply specify
``--cutmark=`` multiple times. The parameter also takes the specifical values
``yes`` and ``no``. If the latter any implicit cutmarks are turned off, in
particular the implicit cutmark used when generating ``.caidx`` files above.

``casync`` will honour cutmarks only within the immediate vicinity of the cut
point the modulo test suggested. By default this a 16K window before the
calculated cut point. This value may be altered using the
``--cutmark-delta-max=`` setting.

Any configured cutmark (and the selected ``--cutmark-delta-max=`` value) is
also stored in the ``.caidx`` or ``.caibx`` file to ensure that such an index
file contains sufficient data for an extracting client to properly use an
existing file system tree (or block device) as seed while applying the same
chunking logic as the original image.
10 changes: 10 additions & 0 deletions meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ foreach ident : [
['copy_file_range', '''#define _GNU_SOURCE
#include <sys/syscall.h>
#include <unistd.h>'''],
['reallocarray', '''#include <malloc.h>'''],
]
have = cc.has_function(ident[0], args : '-D_GNU_SOURCE', prefix : ident[1])
conf.set10('HAVE_' + ident[0].to_upper(), have)
Expand Down Expand Up @@ -312,6 +313,14 @@ test_cache = find_program(test_cache_sh)
test('test-cache.sh', test_cache,
timeout : 30 * 60)

test_seed_sh = configure_file(
output : 'test-seed.sh',
input : 'test/test-seed.sh.in',
configuration : substs)
test_seed = find_program(test_seed_sh)
test('test-seed.sh', test_seed,
timeout : 30 * 60)

udev_rule = configure_file(
output : '75-casync.rules',
input : 'src/75-casync.rules.in',
Expand All @@ -325,6 +334,7 @@ test_sources = '''
test-cachunk
test-cachunker
test-cachunker-histogram
test-cacutmark
test-cadigest
test-caencoder
test-calocation
Expand Down
39 changes: 39 additions & 0 deletions src/affinity-count.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#include "affinity-count.h"

#include <errno.h>
#include <sched.h>
#include <sys/types.h>

int cpus_in_affinity_mask(void) {
size_t n = 16;
int r;

for (;;) {
cpu_set_t *c;

c = CPU_ALLOC(n);
if (!c)
return -ENOMEM;

if (sched_getaffinity(0, CPU_ALLOC_SIZE(n), c) >= 0) {
int k;

k = CPU_COUNT_S(CPU_ALLOC_SIZE(n), c);
CPU_FREE(c);

if (k <= 0)
return -EINVAL;

return k;
}

r = -errno;
CPU_FREE(c);

if (r != -EINVAL)
return r;
if (n*2 < n)
return -ENOMEM;
n *= 2;
}
}
8 changes: 8 additions & 0 deletions src/affinity-count.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
/* SPDX-License-Identifier: LGPL-2.1+ */

#ifndef fooaffinitycounthfoo
#define fooaffinitycounthfoo

int cpus_in_affinity_mask(void);

#endif
Loading