Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality improvements and best current practices #3

Merged
merged 30 commits into from
Oct 10, 2024

Conversation

SmithChart
Copy link
Member

This PR adds lots of suggestions done by @Bastian-Krause aiming at improving the overall quality of the test suite.

Our RAUC install will not change the partition layout during install.
Thus there is no need to check the size of the same partitions in both
slots.
Using `lsblk --json` we can directly parse the tools output and must not
rely on parsing shell output.
Using `findmnt --json` we can directly parse the tools output.
This is considered better practice than parsing the shell output of
`df`.
The previous test very bluntly only tested if the I2C-Subsystem in the
kernel has been loaded at all. But this does not tell us anything about
the actual devices on the bus we want to use.

The new tests instead try to use the I2C devices the way userspace
would.
Actually none of the I2C devices are directly used by userspace:

* The EEPROMs are read by the bootloader and their information
  are added to the devicetree's chosen_node.
  But for the EEPROMs we can at least do a read, so we can check
  if this works.
* The PMIC is used inside the kernel.
  So we will only check the hardware has been probed.
* The USB-Hub is also used inside the kernel
  So we will only check the hardware has been probed.
The previous test very bluntly only tested if the SPI-subsystem in the
kernel has been loaded at all. But this does not tell us anything about
the actual devices on the bus we want to use.

The new tests instead try to use the SPI devices the way userspace
would.

* The Ethernet switch is managed by DSA.
  By successfully reading the statistics we can assume that the SPI
  communication to the Ethernet Switch works.
* The ADC is managed by IIO.
  By successfully reading a non-zero ADC reading we can assume that
  we are reading actual values from the ADC.
* The LCD is managed by drm.
  We only check if the correct driver has been probed.
Without the mmc the system would not boot in the first place.
Additionally we have tests in `test_filesystems.py` that check if the
partitions on the mmc have been probed correctly.
If these succeed we can assume that mmc-subsystem was loaded correctly.
This is just another representation of the configuration EEPROMs present
on i2c-0 and i2c2. Both are not used by userspace directly.

So there is no need to check if these nodes in `/sys` exist.
SmithChart and others added 7 commits September 11, 2024 17:14
Until now we only tested, if the endpoint provided a somewhat useful
value.
By also comparing the value to the output of `sensors` we can check the
complete `tacd` stack.
The original test case contained a lot of code to handle different
format strings. But it's unclear why that all that complexity was
actually needed.

This change removes all the complex format handling and reduces it to a
bare minimum for this DUT.
In some places we have been using `stdout,_ rc = shell.run()` followed
by `assert rc == 0`.
The better way to do this is by using `stdout = shell.run_check()` since
this removes a bit of boilerplate in this places.
By disconnecting the USB drive during the test we can be sure that Linux
has dropped all buffers for the USB storage.
And thus that this test not only tests buffers somewhere in the kernel
without actually writing data to the USB drive.

Since we already have the _eet_ connected to the TAC we can simply use
it to disconnect and connect the USB-stick during the test.
This test makes sure the third state (On, Off, OffFloating) of the TACs
power switch works electrically.

This is especially important since this state is not exposed on the web
interface (or rather only in the API).

With this test we make sure that the load is actually engaged, when the
power switch is "Off" - and that we can also switch it to "OffFloating".
If we do not overwrite the parent's __attrs_post_init__(), we do not
need to call it.
Copy link
Member

@Bastian-Krause Bastian-Krause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Here are some minor suggestions:

tests/test_filesystems.py Outdated Show resolved Hide resolved
tests/test_filesystems.py Outdated Show resolved Hide resolved
tests/test_filesystems.py Outdated Show resolved Hide resolved
tests/test_filesystems.py Outdated Show resolved Hide resolved
tests/test_filesystems.py Outdated Show resolved Hide resolved
tests/test_network.py Outdated Show resolved Hide resolved
tests/test_userspace.py Outdated Show resolved Hide resolved
tests/test_linux_hardware.py Outdated Show resolved Hide resolved
tests/test_tacd.py Outdated Show resolved Hide resolved
tests/helper.py Show resolved Hide resolved
Running commands in the shell in the background is merely a hack to have
commands run in parallel:
* You have to take care of `stdout` and `stderr` not messing up output of
  other programs.
* You have to track the `pid` yourself.

Using `systemd-run` solves these problems.

This commit also introduces a context manager that wraps around
`systemd-run` and makes test cases very readable.
Inferring the block device from the USB path at runtime is not needed,
since the path in `/dev/disk/by-path/` will be stable for a given USB
storage device in a given USB port.

This way we get rid of a few extra commands.
But more important: We get rid of pipes in shell commands, that may fail
without failing the whole command.
The fixture `prepare_network` tears down the whole network setup of the
TAC and replaces it with it's own setup. At the end of the test the
process is reversed.

But to make sure that the TAC is actually online we should at least wait
until we have reached the same conditions as in the `network` state of
the strategy.
This way, if following tests rely on the TAC being online we can still
fulfill that.
Using `try .. finally` we can make sure that we try to clean up after
our test cases run - even if they fail.
Otherwise this line would only be a no-op.
Until now we have tested the size of the file systems in `slot 0` and
`slot 1`. This seems natural, since these are two independent systems.

But both slots basically contain the same system - only installed via
two different methods.
We currently have no reason to believe that the system installed via
RAUC may have another file system layout that the system installed via
fastboot.

Thus we rewrite the tests to only test the file system sizes in `slot 0`
and remove a good amount of complexity and testing time.
If we ever encounter problems in this realm we can still write tests
that focus on the actual problems fixed.
The strategy already checks, whether the system is running.
Thus, it makes total sense to also collect debug information if this
state could not be reached.

This is a preparation for the removal of RAUC handling from the
strategy.
After support for RAUC  has been removed from the strategy, the strategy
will always assume, that we are in `system0`.
Thus it does not make sense to have tests for `system1` anymore.
Until now we handling of RAUC installs and handling of state transitions
between these two systems was done by the strategy.
This led to a fair amount of added complexity in the strategy.
For example did some states pretend to be reached (shell, network), even
if actually a we were in one of the RAUC-states (`system0`, `system1`).

Reducing the complexity in the strategy makes it also clear that other
special cases, like reconfiguring network from `online` to `loopback`
should not be part of the strategy.

This commit removes handling of RAUC installation and tracking of the
active slot from the strategy.
As a replacement it adds pytest `fixtures` that allow test cases
to manipulate which slot should be booted.

The fixture `set_bootstate_in_bootloader` should be used in tests that
want to boot into `system1` or change retries, `mark-good` etc.
This fixture makes sure (using the `default_bootstate` fixture) that the
system will always be in `system0` and has the default bootste after the
test has finished.

The only requirements for tests is, that they should not install a new
system into `system0`, nor make `system0` unusable.

While doing all these changes the actual RAUC tests have been adapted to
more closely follow the best current practices for these tests.
SmithChart and others added 2 commits September 16, 2024 10:49
Both states `shell` and `network` are basically the same:
* `shell`: The system is up in userspace and all services have started.
* `network`: Like `shell`, but we have checked that we are online.

To ease the transitions between `network` and `shell` there is even code
in place that acts like both states are the same.

This commit now removes the additional `network` state and adds the
online check to the `shell` state.
Unpacking a single element list using the following notation has the
benefit, that it will raise an exception, if the list to unpack is not
exactly one element long:

> [stdout] = shell.run_check("some_command")

This is especially useful for commands that are expected to only return
a single line of output.

Co-authored-by: Bastian Krause <bst@pengutronix.de>
@SmithChart SmithChart marked this pull request as ready for review September 16, 2024 09:57
@SmithChart
Copy link
Member Author

@Bastian-Krause From my side this is ready to merge. Feel free to have another look.

tests/test_tacd.py Outdated Show resolved Hide resolved
Without this change all tests left the *eet* in what ever state happend
to be the last. This does not work well with the notion that tests
should leave the system in it's initial state.

With this change the new `eet`-fixture will make sure that the *eet* is
returned to "no connection" at tear down.
With just `0.2 s` of delay the measurement of `v1/dut/feedback/current`
did not reliably succeed.
It seems the test-setup simply needs a short moment until there is
actually current flowing that can be measured.
Copy link
Member

@Bastian-Krause Bastian-Krause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks for sticking with it! The test suite is now in a much better and more reliable shape.

@SmithChart SmithChart merged commit 54aba7d into linux-automation:master Oct 10, 2024
2 checks passed
@SmithChart SmithChart deleted the cfi/review-feedback branch October 10, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants