Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test/bench runners, benchmarks, additional scripts #752

Merged
merged 82 commits into from
Apr 26, 2023

Conversation

geky
Copy link
Member

@geky geky commented Dec 2, 2022

This PR brings in a number of changes to how littlefs is tested and measured.

Originally, the motivation was to add a method for benchmarking the filesystem, to lay the groundwork for future performance improvements, but the scope ended up growing to include a number of fixes/improvements to general littlefs testing.

  1. Reworked test framework, no. 3

    The test framework gets a rework again, taking what worked well in the current test framework and throwing out what doesn't.

    The main goals behind this rework were to 1. simplify the framework, even if it means more boilerplate, as this should make it easier to extend with new features, and 2. run the tests as fast as possible.

    Previously I've disregarded test performance, worried a focus on test performance risks complexity and difficulty in understanding the system that is being debugged, but my perspective is changing as faster tests => more tests => more confidence => the dark side a safer filesystem. If you've told me previously to parallelize the tests, etc, this is the part where you can say you told me so.

    • Tests incrementally compile, and we don't rebuild lfs.c for every suite

      Previously the test's build system and runner was all self-contained in test.py. On one hand this meant you only needed test.py to build/run the tests, but on the other hand this design was confusing, limiting, and just all around problematic. One big issue was that, being outside of the build system, tests couldn't be built incrementally and every test suite needed a custom built version of lfs.c. This led to a slow debugging experience as each change to lfs.c needed at least 16 recompilations.

      Now the test framework is integrated into the Makefile with separate build steps for applying prettyasserts.py and other scripts, all of which can be built incrementally, significantly reducing the time spent waiting for tests to recompile.

    • runners/test_runner is now its own standalone application

      Previously any extra features/configuration had to be built into the test binaries during compilation. Now there is an explicit test_runner.c which can contain high-level test features that can be engaged at runtime through additional flags.

      This makes it easier to add new test features, but also makes it easier to debug the test_runner itself, as it's no longer hidden inside test.py.

      The actual tests are provided at link-time using a custom linker section, and are still generated by ./scripts/test.py -c

      $ make test-runner
      ...
      $ ./runners/test_runner -l
      suite                      flags   cases       perms
      test_alloc                     -      12       62/70
      test_attrs                     -       4       20/20
      ...
      $ ./runners/test_runner
      running test_alloc_parallel:1g12gg2f3g1ghsj5
      ...
    • Tests now avoid spawning processes as much as possible

      When you find a bug in C, it often leads to undefined behavior, memory corruption, etc, making the current test process no longer sound. But you also often want to keep running tests to see if there is any trends among the test failures. To accomplish this, the previous test framework ran each test in its own process.

      Unfortunately, process spawning is not really a cheap operation. And with most tests not failing (hopefully), this ends up wasting a significant amount of time just spawning processes.

      Now, with a more powerful test_runner, the test framework tries to run as many tests in a single process. Only spawning a new process when a test fails. This is all handled by scripts/test.py, which interacts with runners/test_runner, telling it which tests to run via the low-lever --step flag.

    • Powerloss is now simulated with setjmp/longjmp

      As a part of reducing process spawning, powerloss is directly simulated in the test_runner using setjmp/longjmp. Previously powerloss was simulated by killing and restarting the process. Which is a simple, heavy-handed solution that works. Slowly.

      Since there can be thousands of powerlosses in a single test, this needed to be moved into the test_runner, especially since powerloss testing is arguably the most important feature of littlefs's test framework.

      As an added plus, the simulated block-device no longer needs to be persisted in the host's filesystem when powerloss testing, and can stay comfortably in the test_runner's RAM. The cost of persisting the block-device could be mitigated by using a RAM-backed tmpfs disk, but this still incurred a cost as all block-device operations would need to go through the OS.

      Using setjmp/longjmp can lead to memory leaks when reentrant tests call malloc, but since littlefs uses malloc in only a handful of convenience functions (littlefs's whole goal is minimal RAM after all), this doesn't seem to been a problem so far.

    • Tests now run in parallel

      Perhaps the lowest-hanging fruit, tests now run in parallel.

      The exact implementation here is a bit naive/suboptimal, giving each process n/m tests to run for n tests and m cores, but this keeps the process/thread management in the high-level test.py python layer, simplifying thread management and avoiding a multi-threaded test_runner.

      $ ./scripts/test.py ./runners/test_runner -j -v
      using runner: ./runners/test_runner
      ./runners/test_runner --list-cases
      ./runners/test_runner --list-case-paths
      found 17 suites, 130 cases, 4202/4315 permutations
      
      ./runners/test_runner --list-cases
      ./runners/test_runner --list-case-paths
      ./runners/test_runner -s0,,12
      ./runners/test_runner -s1,,12
      ./runners/test_runner -s2,,12
      ./runners/test_runner -s3,,12
      ./runners/test_runner -s4,,12
      ./runners/test_runner -s5,,12
      ./runners/test_runner -s6,,12
      ./runners/test_runner -s7,,12
      ./runners/test_runner -s8,,12
      ./runners/test_runner -s9,,12
      ./runners/test_runner -s10,,12
      ./runners/test_runner -s11,,12
      ...

    The combination of the above improvements allows us to run the tests a lot faster, and/or cram in a lot more tests:

    Test permutations Runtime (single core) Runtime (6 cores/12 threads) Tests per second (single core) Tests per second (6 cores/12 threads)
    Before 897 51.67 s 31.73 s 17.36 t/s 28.27 t/s
    After 4202 1 m 22.26 s 26.19 s 51.08 t/s (+194.23%) 160.44 t/s (+467.53%)

    (Most of the new permutations are from moving the different test geometries out of CI and into the test_runner. Note the previous test framework does parallelize builds, which are included.)

  2. Exhaustive powerloss testing

    In addition to the heuristic-based powerloss testing, the new test_runner can also exhaustively search all possible powerloss scenarios for a given reentrant test.

    To speed this up, the test_runner uses a simulated, copy-on-write block-device (reintroducing emubd), such that all possible code-paths in all possible powerloss scenarios are executed at most once. And, because most of the block-device's state can be shared via copy-on-write operations, each powerloss branch needs at most one additional block of memory in RAM.

    The runtime still grows exponentially, and we each have a finite lifetime, so it will be more useful to exhaustively search a bounded number of powerlosses. Here's a run of all possible 5-deep powerlosses in the test_move test suite:

    $ ./scripts/test.py ./runners/test_runner test_move -P5 -j
    using runner: ./runners/test_runner -P5
    found 1 suites, 17 cases, 10/10 permutations
    
    running tests: 1/1 suites, 17/17 cases, 10/10 perms, 7981852pls!
    
    done: 10/10 passed, 0/10 failed, 7981852pls!, in 951.48s

    Since it can be a bit annoying to wait 15 minutes to reproduce a test failure, each powerloss scenario is encoded in a leb16 suffix appended to the current test identifier. This, combined with a leb16-encoding of the test's configuration and the test's name, can uniquely identify and reproduce any test run in the test_runner:

    test_dirs_many_reentrant:2gg2cb:k4o6
    ^                        ^  ^^^ ^ ^
    '------------------------|--|||-|-|-- test_dirs_many_reentrant
                             '--|||-|-|--   2 =   0x2 = BLOCK_SIZE
                                '||-|-|-- gg2 = 0x200 = 512
                                 '|-|-|--   c =   0xc = N
                                  '-|-|--   b =   0xb = 11
                                    '-|--  k4 =  0x44 = powerloss after 68 writes
                                      '--  o6 =  0x68 = powerloss after 104 writes
    

    So once a failing test scenario is found, the exact state of the failure can be quickly reproduced for debugging:

    $ ./scripts/test.py ./runners/test_runner -P2 -b -j
    using runner: ./runners/test_runner -P2
    found 17 suites, 130 cases, 390/400 permutations
    
    running test_alloc: 12/12 cases, 0/0 perms
    running test_attrs: 4/4 cases, 0/0 perms
    running test_badblocks: 4/4 cases, 0/0 perms
    running test_bd: 5/5 cases, 0/0 perms
    running test_dirs: 12/14 cases, 12/18 perms, 39477pls!, 1/18 failures
    
    done: 12/18 passed, 1/18 failed, 39477pls!, in 23.46s
    
    tests/test_dirs.toml:180:failure: test_dirs_many_reentrant:2gg2cb:k4o6 (BLOCK_SIZE=512, N=11) failed
    powerloss test_dirs_many_reentrant:2gg2cb:k4k6
    powerloss test_dirs_many_reentrant:2gg2cb:k4l6
    powerloss test_dirs_many_reentrant:2gg2cb:k4m6
    powerloss test_dirs_many_reentrant:2gg2cb:k4n6
    powerloss test_dirs_many_reentrant:2gg2cb:k4o6
    tests/test_dirs.toml:196:assert: assert failed with false, expected eq true
            assert(err == 0 || err == LFS_ERR_EXIST);
    
    $ ./scripts/test.py ./runners/test_runner test_dirs_many_reentrant:2gg2cb:k4o6 --gdb
    ...
    196             assert(err == 0 || err == LFS_ERR_EXIST);
    (gdb)

    Unfortunately, the current tests are not the most well designed for exhaustive powerloss testing. Some of them, test_files and test_interspersed specifically, write large files a byte at a time. Under exhaustive powerloss testing, these result in, well, a lot of powerlosses, but outside of the writes with data-structure changes, don't reveal anything interesting. This is something that can probably be improved over time.

    Exhaustively testing all powerlosses at a depth of 1 takes 12.79 minutes with 84,484 total powerlosses.

    Exhaustively testing all powerlosses at a depth of 2 takes at least 4 days, and is still running... I'll let you know when it finishes...

  3. scripts/bench.py and runners/bench_runner

    This PR introduces scripts/bench.py and runners/bench_runner, siblings to scripts/test.py and runners/test_runner, for measuring the performance of littlefs. Instead of reporting pass/fail, the bench_runner reports the total number of bytes read, programmed, and erased during a bench case. This can be useful for comparing different littlefs implementations, as these numbers map directly to hardware-dependent performance in IO-bound applications.

    One feature that makes this useful, added to both the bench_runner and test_runner, is a flexible configuration system evaluated at runtime. This has the downside of limiting configurable bench/test defines to uintmax_t integers, but makes it easy to quickly test/compare/reproduce different configurations:

    $ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor --list-defines
    READ_SIZE=1
    PROG_SIZE=1
    BLOCK_SIZE=4096
    BLOCK_COUNT=256
    CACHE_SIZE=64
    LOOKAHEAD_SIZE=16
    BLOCK_CYCLES=-1
    ERASE_VALUE=255
    ERASE_CYCLES=0
    BADBLOCK_BEHAVIOR=0
    POWERLOSS_BEHAVIOR=0
    CHUNK_SIZE=64
    ORDER=0,1,2
    SIZE=131072
    $ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)"
    using runner: ./runners/bench_runner -Gnor -DORDER=0 '-DSIZE=range(0,24576,64)'
    found 1 suites, 1 cases, 384/384 permutations
    
    running benches: 1/1 suites, 1/1 cases, 384/384 perms
    
    done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.08s

    At the moment I've only added a handful of benchmarks, though the number may increase in the future. The goal isn't to maintain a fully cohesive benchmark suite, as much as it is to have a set of tools for analyzing specific performance bottlenecks.

  4. Reworked scripts/summary.py and other scripts to be a bit more flexible

    This mainly means scripts/summary.py is no longer hard-wired to work with the compile-time measurements, allowing it to be used with other results such as benchmarks, though this comes with the cost of a large number of flags for controlling the output.

    Each measurement script also now comes with a *-diff Makefile rule for quick comparisons.

    $ make summary
    ./scripts/code.py lfs.o lfs_util.o -q  -o lfs.code.csv
    ./scripts/data.py lfs.o lfs_util.o -q  -o lfs.data.csv
    ./scripts/stack.py lfs.ci lfs_util.ci -q  -o lfs.stack.csv
    ./scripts/structs.py lfs.o lfs_util.o -q  -o lfs.structs.csv
    ./scripts/summary.py lfs.code.csv lfs.data.csv lfs.stack.csv lfs.structs.csv -fcode=code_size -fdata=data_size -fstack=stack_limit --max=stack -fstructs=struct_size -Y
                                code    data   stack structs
    TOTAL                      25614       -    2176     908
    $ make summary-diff
    ./scripts/summary.py <(./scripts/code.py ./lfs.o ./lfs_util.o -q -o-) <(./scripts/data.py ./lfs.o ./lfs_util.o -q -o-) <(./scripts/stack.py ./lfs.ci ./lfs_util.ci -q -o-) <(./scripts/structs.py ./lfs.o ./lfs_util.o -q -o-) -fcode=code_size -fdata=data_size -fstack=stack_limit --max=stack -fstructs=struct_size -Y -d <(./scripts/summary.py ./lfs.code.csv ./lfs.data.csv ./lfs.stack.csv ./lfs.structs.csv -q -o-)
                               ocode   odata  ostack    ostructs   ncode   ndata  nstack    nstructs   dcode   ddata  dstack    dstructs
    TOTAL                      25614       -    2176         908   25614       -    2176         908      +0      +0      +0          +0
  5. Reworked scripts/cov.py to take advantage of the --json-format introduced in GCC 9

    It's a bit concerning that this was a breaking change in gcov's API, albeit on a major version, but the new --json-format is much easier to work with.

    It's also worth noting this PR includes a change in ideology around coverage measurement. Instead of collecting coverage from as many sources as possible in CI, coverage is only collected from the central make test run. This will result in lower coverage numbers than previously, but these are the coverage numbers we actually care about: test coverage via easy to reproduce and isolate tests.

    This also simplifies coverage collection in CI, which is a plus.

  6. scripts/perf.py and scripts/perfbd.py

    perf.py was added as an experiment with Linux's perf tool, which uses an interest method of sampling performance counters to build an understanding of the performance of a system. Unfortunately this isn't the most useful measurement for littlefs, as we should expect littlefs's performance to be dominated by IO overhead. But it may still be useful for tracking down CPU bottlenecks

    perfbd.py takes the ideas in Linux's perf tool and applies them to the bench_runner. Instead of sampling performance counters, we can sample littlefs's trace output to find low-level block-device operations. Combining this with stack traces provided by the backtrace function, we can propagate IO cost to their callers, building a useful map of the source of IO operations in a given benchmark run:

    $ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -t lfs.bench.trace --trace-backtrace --trace-freq=10000
    using runner: ./runners/bench_runner -Gnor -tlfs.bench.trace --trace-backtrace --trace-freq=10000 -DORDER=0 '-DSIZE=range(0,24576,64)'
    found 1 suites, 1 cases, 384/384 permutations
    
    running benches: 1/1 suites, 1/1 cases, 384/384 perms
    
    done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.10s
    $ ./scripts/perfbd.py ./runners/bench_runner lfs.bench.trace -Flfs.c -s
    function                      readed  proged  erased                                           
    lfs_bd_erase                       0       0   36864                                           
    lfs_format                       372      52   28672                                           
    lfs_dir_orphaningcommit           64      52   28672                                           
    lfs_dir_relocatingcommit          64      52   28672                                           
    lfs_dir_compact                   28      52   28672                                           
    lfs_fs_deorphan                    0       0   28672                                           
    lfs_file_rawwrite               3268    3200    8192                                           
    lfs_file_write                  3268    3200    8192                                           
    lfs_file_flushedwrite           3268    3200    4096                                           
    lfs_file_relocate                516     512    4096                                           
    lfs_bd_flush                    2752    3252       0                                           
    lfs_bd_prog                     2752    3200       0                                           
    lfs_dir_commit                    64      52       0                                           
    lfs_dir_commitcrc                 64      52       0                                           
    lfs_bd_read                     3688       0       0                                           
    lfs_bd_cmp.constprop.0          2752       0       0                                           
    lfs_dir_fetchmatch               868       0       0                                           
    lfs_dir_fetch                    856       0       0                                           
    lfs_alloc                        516       0       0                                           
    lfs_fs_rawtraverse               516       0       0                                           
    lfs_file_open                     36       0       0                                           
    lfs_file_rawopencfg               36       0       0                                           
    lfs_mount                          8       0       0                                           
    lfs_dir_find                       4       0       0                                           
    lfs_dir_get                        4       0       0                                           
    lfs_dir_getslice                   4       0       0                                           
    lfs_file_close                     4       0       0                                           
    lfs_file_rawclose                  4       0       0                                           
    lfs_file_rawsync                   4       0       0                                           
    TOTAL                          25780   16876  204800                                           

    It's worth noting that these numbers are samples. They are a subset and don't add up to the total IO cost of the benchmark. But they are still useful as a metric for understand benchmark performance.

    You could parse the entire trace output without sampling, but this would be quite slow and not really show you any more info.

  7. scripts/plot.py and scripts/plotmpl.py

    Added plot.py and plotmpl.py for quick plotting of littlefs measurements in the terminal and with Matplotlib. I think these will mostly be useful for looking for growth rates in benchmark results. And also future documentation.

    $ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -o lfs.bench.csv
    using runner: ./runners/bench_runner -Gnor -DORDER=0 '-DSIZE=range(0,24576,64)'
    found 1 suites, 1 cases, 384/384 permutations
    
    running benches: 1/1 suites, 1/1 cases, 384/384 perms
    
    done: 4801420 readed, 4729056 proged, 5492736 erased, in 0.25s
    $ ./scripts/plotmpl.py lfs.bench.csv -o lfs.bench.svg -tbench_file_write -l -xSIZE -ybench_readed -ybench_proged -ybench_erased --x2 --xunits=B --y2 --yunits=B --github
    updated lfs.bench.svg, 3 datasets, 1152 points

    example-plot-gh
    example-plot-gh-dark

  8. scripts/tracebd.py, scripts/tailpipe.py, scripts/teepipe.py

    These are some extra scripts for interacting with/viewing littlefs's trace output.

    tailpipe.py and teepipe.py behave similarly to Unix's tail and tee programs, but work a bit better with Unix pipes, with resumability and fast paging.

    The most interesting script is tracebd.py, which parses littlefs's trace output for block-device operations and renders it as ascii art. I've used this sort of block-device operation rendering previously for a quick demo and it can be surprisingly useful for understanding how filesystem operations interact with the block-device.

    $ mkfifo trace
    $ ./scripts/bench.py ./runners/bench_runner bench_file_write -Gnor -DORDER=0 -DSIZE="range(0,24576,64)" -t trace
    ...
    $ ./scripts/tracebd.py trace -c10000 -z                 
    e.........................................e.....................................
    e.........................................e.....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................ee....................................
    e.........................................eee...................................
    e.........................................eee...................................
    e.........................................eee...................................
    e.........................................eee...................................
    e.........................................eee...................................
  9. Changed lfs.a -> liblfs.a in default build rule

    The lib* prefix is usually required by the linker, so I suspect this won't break anything. But it's worth mentioning this change in case someone relies on the current build target.

  10. Added a make help rule

    I think I first saw this here, this self-documenting Makefile rule gives some of the useful Makefile rules a bit more discoverability.

  11. Adopted script changes in CI, added a bot comment on PRs

    Thanks to GitHub Actions, we have a lot of info about builds in CI. Unfortunately, statuses on GitHub have been becoming harder to find each UI change. To help keep this info discoverable I've added an automatically generated comment that @geky-bot should post after a succesful CI run. Hopefully this will contribute to PRs without being too annoying.

    You can see some example comments on the PR I created on my test fork:
    WIP NULL test pr geky/littlefs-test-repo#4


The increased testing did find a couple bugs: eba5553 and 0b11ce0. Their commit messages have more details on the bugs and their fixes. And with the new test identifiers I can tell you the exact state that will trigger the failures:

  • test_relocations_reentrant_renames:112gg261dk1e3f3:123456789abcdefg1h1i1j1k1l1m1n1o1p1q1r1s1t1u1v1g2h2i2j2k2l2m2n2o2p2q2r2s2t2 - eba5553 - found with linear heuristic powerlosses
  • test_dirs_many_reentrant:2gg2cb:k4o6 - 0b11ce0 - found with 2-deep exhaustive powerlosses

geky and others added 30 commits April 16, 2022 13:50
This is to try a different design for testing, the goals are to make the
test infrastructure a bit simpler, with clear stages for building and
running, and faster, by avoiding rebuilding lfs.c n-times.
This moves defines entirely into the runtime of the test_runner,
simplifying thing and reducing the amount of generated code that needs
to be build, at the cost of limiting test-defines to uintmax_t types.

This is implemented using a set of index-based scopes (created by
test.py) that allow different layers to override defines from other
layers, accessible through the global `test_define` function.

layers:
1. command-line overrides
2. per-case defines
3. per-geometry defines
- Indirect index map instead of bitmap+sparse array
- test_define_t and test_type_t
- Added back conditional filtering
- Added suite-level defines and filtering
- Added filtering based on suite, case, perm, type, geometry
- Added --skip, --count, and --every (will be used for parallelism)
- Implemented --list-defines
- Better helptext for flags with arguments
- Other minor tweaks
In the test-runner, defines are parameterized constants (limited
to integers) that are generated from the test suite tomls resulting
in many permutations of each test.

In order to make this efficient, these defines are implemented as
multi-layered lookup tables, using per-layer/per-scope indirect
mappings. This lets the test-runner and test suites define their
own defines with compile-time indexes independently. It also makes
building of the lookup tables very efficient, since they can be
incrementally populated as we expand the test permutations.

The four current define layers and when we need to build them:

layer                           defines         predefine_map   define_map
user-provided overrides         per-run         per-run         per-suite
per-permutation defines         per-perm        per-case        per-perm
per-geometry defines            per-perm        compile-time    -
default defines                 compile-time    compile-time    -
- Added --disk/--trace/--output options for information-heavy debugging

- Renamed --skip/--count/--every to --start/--stop/--step.

  This matches common terms for ranges, and frees --skip for being used
  to skip test cases in the future.

- Better handling of SIGTERM, now all tests are killed, reported as
  failures, and testing is halted irregardless of -k.

  This is a compromise, you throw away the rest of the tests, which
  is normally what -k is for, but prevents annoying-to-terminate
  processes when debugging, which is a very interactive process.
- Expanded test defines to allow for lists of configurations

  These are useful for changing multi-dimensional test configurations
  without leading to extremely large and less useful configuration
  combinations.

- Made warnings more visible durring test parsing

- Add lfs_testbd.h to implicit test includes

- Fixed issue with not closing files in ./scripts/explode_asserts.py

- Add `make test_runner` and `make test_list` build rules for
  convenience
- Added internal tests, which can run tests inside other source files,
  allowing access to "private" functions and data

  Note this required a special bit of handling our defining and later
  undefining test configurations to not polute the namespace of the
  source file, since it can end up with test cases from different
  suites/configuration namespaces.

- Removed unnecessary/unused permutation argument to generated test
  functions.

- Some cleanup to progress output of test.py.
Previously test defines were implemented using layers of index-mapped
uintmax_t arrays. This worked well for lookup, but limited defines to
constants computed at compile-time. Since test defines themselves are
actually calculated at _run-time_ (yeah, they have deviated quite
a bit from the original, compile-time evaluated defines, which makes
the name make less sense), this means defines can't depend on other
defines. Which was limiting since a lot of test defines relied on
defines generated from the geometry being tested.

This new implementation uses callbacks for the per-case defines. This
means they can easily contain full C statements, which can depend on
other test defines. This does means you can create infinitely-recursive
defines, but the test-runner will just break at run-time so don't do that.

One concern is that there might be a performance hit for evaluating all
defines through callbacks, but if there is it is well below the noise
floor:

- constants: 43.55s
- callbacks: 42.05s
- Added --exec for wrapping the test-runner with external commands, such as
  Qemu or Valgrind.

- Added --valgrind, which just aliases --exec=valgrind with a few extra
  flags useful during testing.

- Dropped the "valgrind" type for tests. These aren't separate tests
  that run in the test-runner, and I don't see a need for disabling
  Valgrind for any tests. This can be added back later if needed.

- Readded support for dropping directly into gdb after a test failure,
  either at the assert failure, entry point of test case, or entry point
  of the test runner with --gdb, --gdb-case, or --gdb-main.

- Added --isolate for running each test permutation in its own process,
  this is required for associating Valgrind errors with the right test
  case.

- Fixed an issue where explicit test identifier conflicted with
  per-stage test identifiers generated as a part of --by-suite and
  --by-case.
This mostly required names for each test case, declarations of
previously-implicit variables since the new test framework is more
conservative with what it declares (the small extra effort to add
declarations is well worth the simplicity and improved readability),
and tweaks to work with not-really-constant defines.

Also renamed test_ -> test, replacing the old ./scripts/test.py,
unfortunately git seems to have had a hard time with this.
This simplifies the interaction between code generation and the
test-runner.

In theory it also reduces compilation dependencies, but internal tests
make this difficult.
A small mistake in test.py's control flow meant the failing test job
would succesfully kill all other test jobs, but then humorously start
up a new process to continue testing.
GCC is a bit annoying here, it can't generate .cgi files without
generating the related .o files, though I suppose the alternative risks
duplicating a large amount of compilation work (littlefs is really
a small project).

Previously we rebuilt the .o files anytime we needed .cgi files
(callgraph info used for stack.py). This changes it so we always
built .cgi files as a side-effect of compilation. This is similar
to the .d file generation, though may be annoying if the system
cc doesn't support --callgraph-info.
This also adds coverage support to the new test framework, which due to
reduction in scope, no longer needs aggregation and can be much
simpler. Really all we need to do is pass --coverage to GCC, which
builds its .gcda files during testing in a multi-process-safe manner.

The addition of branch coverage leverages information that was available
in both lcov and gcov.

This was made easier with the addition of the --json-format to gcov
in GCC 9.0, however the lax backwards compatibility for gcov's
intermediary options is a bit concerning. Hopefully --json-format
sticks around for a while.
These scripts can't easily share the common logic, but separating
field details from the print/merge/csv logic should make the common
part of these scripts much easier to create/modify going forward.

This also tweaked the behavior of summary.py slightly.
On one hand this isn't very different than the source annotation in
gcov, on the other hand I find it a bit more readable after a bit of
experimentation.
Also renamed GCI -> CI, this holds .ci files, though there is a risk
of confusion with continuous integration.

Also added unused but generated .ci files to clean rule.
- Renamed explode_asserts.py -> pretty_asserts.py, this name is
  hopefully a bit more descriptive
- Small cleanup of the parser rules
- Added recognization of memcmp/strcmp => 0 statements and generate
  the relevant memory inspecting assert messages

I attempted to fix the incorrect column numbers for the generated
asserts, but unfortunately this didn't go anywhere and I don't think
it's actually possible.

There is no column control analogous to the #line directive. I thought
you might be able to intermix #line directives to put arguments at the
right column like so:

    assert(a == b);

    __PRETTY_ASSERT_INT_EQ(
    #line 1
           a,
    #line 1
                b);

But this doesn't work as preprocessor directives are not allowed in
macros arguments in standard C. Unfortunately this is probably not
possible to fix without better support in the language.
Yes this is more expensive, since small programs need to rewrite the
whole block in order to conform to the block device API. However, it
reduces code duplication and keeps all of the test-related block device
emulation in lfs_testbd.

Some people have used lfs_filebd/lfs_rambd as a starting point for new block
devices and I think it should be clear that erase does not need to have side
effects. Though to be fair this also just means we should have more
examples of block devices...
On one hand this seems like the wrong place for these tests, on the
other hand, it's good to know that the block device is behaving as
expected when debugging the filesystem.

Maybe this should be moved to an external program for users to test
their block devices in the future?
The main change here from the previous test framework design is:

1. Powerloss testing remains in-process, speeding up testing.

2. The state of a test, included all powerlosses, is encoded in the
   test id + leb16 encoded powerloss string. This means exhaustive
   testing can be run in CI, but then easily reproduced locally with
   full debugger support.

   For example:

   ./scripts/test.py test_dirs#reentrant_many_dir#10#1248g1g2 --gdb

   Will run the test test_dir, case reentrant_many_dir, permutation #10,
   with powerlosses at 1, 2, 4, 8, 16, and 32 cycles. Dropping into gdb
   if an assert fails.

The changes to the block-device are a work-in-progress for a
lazily-allocated/copy-on-write block device that I'm hoping will keep
exhaustive testing relatively low-cost.
With more features being added to test.py, the one-line status is
starting to get quite long and pass the ~80 column readability
heuristic. To make this worse this clobbers the terminal output
when the terminal is not wide enough.

Simple solution is to disable line-wrapping, potentially printing
some garbage if line-wrapping-disable is not supported, but also
printing a final status update to fix any garbage and avoid a race
condition where the script would show a non-final status.

Also added --color which disables any of this attempting-to-be-clever
stuff.
Before this was available implicitly by supporting both rambd and filebd
as backends, but now that testbd is a bit more complicated and no longer
maps directly to a block-device, this needs to be explicitly supported.
These have no real purpose other than slowing down the simulation
for inspection/fun.

Note this did reveal an issue in pretty_asserts.py which was clobbering
feature macros. Added explicit, and maybe a bit hacky, #undef _FEATURE_H
to avoid this.
As expected this takes a significant amount of time (~10 minutes for all
1 powerlosses, >10 hours for all 2 powerlosses) but this may be reducible in
the future by optimizing tests for powerloss testing. Currently
test_files does a lot of work that doesn't really have testing value.
… fifos

This mostly involved futzing around with some of the less intuitive
parts of Unix's named-pipes behavior.

This is a bit important since the tests can quickly generate several
gigabytes of trace output.
Based on a handful of local hacky variations, this sort of trace
rendering is surprisingly useful for getting an understanding of how
different filesystem operations interact with the underlying
block-device.

At some point it would probably be good to reimplement this in a
compiled language. Parsing and tracking the trace output quickly
becomes a bottleneck with the amount of trace output the tests
generate.

Note also that since tracebd.py run on trace output, it can also be
used to debug logged block-device operations post-run.
@geky geky added needs minor version new functionality only allowed in minor versions tooling labels Dec 2, 2022
@geky
Copy link
Member Author

geky commented Dec 5, 2022

After only 4 days, 20 hours, with 144,437,889 powerlosses, the exhaustive powerloss testing with all 2-deep powerlosses finished successfully:

$ ./scripts/test.py runners/test_runner -b -j -P2
using runner: runners/test_runner -P2
found 17 suites, 130 cases, 390/400 permutations

running test_alloc: 12/12 cases, 0/0 perms
running test_attrs: 4/4 cases, 0/0 perms
running test_badblocks: 4/4 cases, 0/0 perms
running test_bd: 5/5 cases, 0/0 perms
running test_dirs: 14/14 cases, 18/18 perms, 87700pls!
running test_entries: 8/8 cases, 0/0 perms
running test_evil: 8/8 cases, 0/0 perms
running test_exhaustion: 5/5 cases, 0/0 perms
running test_files: 10/10 cases, 245/245 perms, 7559438pls!
running test_interspersed: 4/4 cases, 30/30 perms, 123954557pls!
running test_move: 17/17 cases, 10/10 perms, 6120pls!
running test_orphans: 2/2 cases, 13/13 perms, 73319pls!
running test_paths: 13/13 cases, 0/0 perms
running test_relocations: 4/4 cases, 24/24 perms, 142460pls!
running test_seek: 6/6 cases, 15/15 perms, 12501750pls!
running test_superblocks: 14/14 cases, 15/15 perms, 64018pls!
running test_truncate: 7/7 cases, 20/20 perms, 48527pls!

done: 390/390 passed, 0/390 failed, 144437889pls!, in 418702.72s

- Moved to Ubuntu 22.04

  This notably means we no longer have to bend over backwards to
  install GCC 10!

- Changed shell in gha to include the verbose/undefined flags, making
  debugging gha a bit less painful

- Adopted the new test.py/test_runners framework, which means no more
  heavy recompilation for different configurations. This reduces the test job
  runtime from >1 hour to ~15 minutes, while increasing the number of
  geometries we are testing.

- Added exhaustive powerloss testing, because of time constraints this
  is at most 1pls for general tests, 2pls for a subset of useful tests.

- Limited coverage measurements to `make test`

  Originally I tried to maximize coverage numbers by including coverage
  from every possible source, including the more elaborate CI jobs which
  provide an extra level of fuzzing.

  But this missed the purpose of coverage measurements, which is to find
  areas where test cases can be improved. We don't want to improve coverage
  by just shoving more fuzz tests into CI, we want to improve coverage by
  adding specific, intentioned test cases, that, if they fail, highlight
  the reason for the failure.

  With this perspective, maximizing coverage measurement in CI is
  counter-productive. This changes makes it so the reported coverage is
  always less than actual CI coverage, but acts as a more useful metric.

  This also simplifies coverage collection, so that's an extra plus.

- Added benchmarks to CI

  Note this doesn't suffer from inconsistent CPU performance because our
  benchmarks are based on purely simulated read/prog/erase measurements.

- Updated the generated markdown table to include line+branch coverage
  info and benchmark results.
For long running processes (testing with >1pls) these logs can grow into
multiple gigabytes, humorously we never access more than the last n lines
as requested by --context. Piping the stdout with --stdout does not use
additional RAM.
The littlefs CI is actually in a nice state that generates a lot of
information about PRs (code/stack/struct changes, line/branch coverage
changes, benchmark changes), but GitHub's UI has changed overtime to
make CI statuses harder to find for some reason.

This bot comment should hopefully make this information easy to find
without creating too much noise in the discussion. If not, this can
always be changed later.
changeprefix.py only works on prefixes, which is a bit of a problem for
flags in the workflow scripts, requiring extra handling to not hide the prefix
from changeprefix.py
Two flags introduced: -fcallgraph-info=su for stack analysis, and
-ftrack-macro-expansions=0 for cleaner prettyassert.py warnings, are
unfortunately not supported in Clang.

The override vars in the Makefile meant it wasn't actually possible to
remove these flags for Clang testing, so this commit changes those vars
to normal, non-overriding vars. This means `make CFLAGS=-Werror` and
`CFLAGS=-Werror make` behave _very_ differently, but this is just an
unfortunate quirk of make that needs to be worked around.
- Renamed struct_.py -> structs.py again.

- Removed lfs.csv, instead prefering script specific csv files.

- Added *-diff make rules for quick comparison against a previous
  result, results are now implicitly written on each run.

  For example, `make code` creates lfs.code.csv and prints the summary, which
  can be followed by `make code-diff` to compare changes against the saved
  lfs.code.csv without overwriting.

- Added nargs=? support for -s and -S, now uses a per-result _sort
  attribute to decide sort if fields are unspecified.
Mostly for benchmarking, this makes it easy to view and compare runner
results similarly to other csv results.
The linear powerloss heuristic provides very good powerloss coverage
without a significant runtime hit, so there's really no reason to run
the tests without -Plinear.

Previous behavior can be accomplished with an explicit -Pnone.
lfs_emubd_getreaded      -> lfs_emubd_readed
lfs_emubd_getproged      -> lfs_emubd_proged
lfs_emubd_geterased      -> lfs_emubd_erased
lfs_emubd_getwear        -> lfs_emubd_wear
lfs_emubd_getpowercycles -> lfs_emubd_powercycles
When you add a function to every benchmark suite, you know if should
probably be provided by the benchmark runner itself. That being said,
randomness in tests/benchmarks is a bit tricky because it needs to be
strictly controlled and reproducible.

No global state is used, allowing tests/benches to maintain multiple
randomness stream which can be useful for checking results during a run.

There's an argument for having global prng state in that the prng could
be preserved across power-loss, but I have yet to see a use for this,
and it would add a significant requirement to any future test/bench runner.
…ground

The difference between ggplot's gray and GitHub's gray was a bit jarring.

This also adds --foreground and --font-color for this sort of additional
color control without needing to add a new flag for every color scheme
out there.
@geky geky force-pushed the test-and-bench-runners branch 2 times, most recently from 076f871 to 17c9665 Compare December 16, 2022 06:18
Driven primarily by a want to compare measurements of different runtime
complexities (it's difficult to fit O(n) and O(log n) on the same plot),
this adds the ability to nest subplots in the same .svg which try to align
as much as possible. This turned out to be surprisingly complicated.

As a part of this, adopted matplotlib's relatively recent
constrained_layout, which behaves much more consistently.

Also dropped --legend-left, no one should really be using that.
As well as --legend* and --*ticklabels. Mostly for close feature parity, making
it easier to move plots between plot.py and plotmpl.py.
- Added support for negative numbers in the leb16 encoding with an
  optional 'w' prefix.

- Changed prettyasserts.py rule to .a.c => .c, allowing other .a.c files
  in the future.

- Updated .gitignore with missing generated files (tags, .csv).

- Removed suite-namespacing of test symbols, these are no longer needed.

- Changed test define overrides to have higher priority than explicit
  defines encoded in test ids. So:

    ./runners/bench_runner bench_dir_open:0f1g12gg2b8c8dgg4e0 -DREAD_SIZE=16

  Behaves as expected.

  Otherwise it's not easy to experiment with known failing test cases.

- Fixed issue where the -b flag ignored explicit test/bench ids.
This allows debugging strategies such as binary searching for the point
of "failure", which may be more complex than simply failing an assert.
@geky geky added this to the v2.6 milestone Apr 17, 2023
@geky geky removed the needs minor version new functionality only allowed in minor versions label Apr 26, 2023
@geky geky changed the base branch from master to devel April 26, 2023 06:04
@geky geky merged commit 0a7eca0 into devel Apr 26, 2023
@geky geky mentioned this pull request May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant