A strategy for evolving the `pkgs/by-name` CI checks #256788

infinisil · 2023-09-23T00:10:53Z

Context

The pkgs/by-name CI check gets the nixpkgs-check-by-name tooling, which lives in Nixpkgs itself, from the latest NixOS channel. This makes CI very fast and predictable for all PRs, because it's able to re-use the pre-built tooling from Hydra.

Problem

However, we have a problem if we want to change the tool:
Say we increase the strictness of the tool with a PR, while fixing all the problems it newly detects in the same PR. But now we have to wait for perhaps days until the NixOS channel updates until the new tool is used in CI. In that timespan, new problems could've been introduced without being detected.

Proposed solution

To solve this I propose to temporarily adjust CI for every strictness increase in the tool as follows:

In addition to the latest NixOS channel version of the tool, also use a version that is pinned to the then-latest NixOS channel revision at the time of the tool change
The two versions of the tool are then used to determine whether the check should succeed or not as follows:
- If the latest tool succeeds on the base branch of the PR, it must also succeed for the PR.
  
  This indicates that the pinned tooling isn't necessary anymore, a message is traced that the pin can get removed
- Otherwise, if the pinned tool succeeds on the base branch, it must also succeed for the PR.
  
  This indicates that the base branch needs to be fixed for the new tooling. The logs will contain the failures of the latest tool.
- Otherwise, either the pinned or the latest tool must succeed for the PR
  
  This indicates that the base branch is broken, either due to checks of a PR being ignored, or the PR being merged after the checks have changed.
  
  In this state we don't know whether the base branch already succeeded with the latest tool, so a PR can pass if it fixes the breakage using either version.

Once the channel updates the next time after the tooling update is merged, another PR can be made to fix any remaining problems. Repeat for some time until no new problems are introduced from PR's that were merged without running CI again.

This could also benefit from some automation to rerun PR checks if it's been say 1 week, which could then be used as the time window to be sure of no PR's still having old checks.

And in a final PR, once the base branch is definitely working with the new check, the temporary pinned version can be disabled again, only using the latest NixOS channel version of the tool once more.

Having thought through this, I think it's about as smooth as it can get, and it sounds generally useful for all CI changes.

I'd love to hear if there are other proposals to handle this though. In the end I think we need something like this for RFC 140, because we'll have a lot of PR's affected.

The text was updated successfully, but these errors were encountered:

infinisil · 2023-09-27T19:33:35Z

It was pointed out that the above text is really hard to digest. So here's my attempt at a more visual explanation.

With only a single check

When there's only one check, Nixpkgs can only be in two states, pretty simple:

Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI:

stateDiagram-v2
    invalid: Check fails
    old: Check succeeds

    invalid --> old
    old --> old

We can only get back to the "Check fails" state if a PR gets merged even if CI fails.
And once there, only a PR that fixes the check will pass CI.

With a new stricter check

With a new stricter check, Nixpkgs can now be in three states:

Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI:

stateDiagram-v2
    invalid: Both checks fail
    old: Old check succeeds
    new: New check succeeds

    invalid --> old
    invalid --> new
    old --> old
    new --> new
    old --> new

Noteworthy here is that:

If a CI check runs for a PR where the base branch already passes the stricter check, the PR must also pass that check.
It's only possible to go from "New check succeeds" to "Old check succeeds" from merging a PR that wasn't updated since the new stricter check was introduced
To go from "Old check succeeds" to "New check succeeds", manual work is required to regularly check master for some time, fixing it with the new checks as old PR's get merged. The goal is for the base branch to succeeding with the new checks 100% of the time, so that any PR's based on it will also be using the new check.

Expand for a part that's not necessary

However, writing down that last point, it's a bit too manual imo. Maybe that can be automated by: - Running this CI check not only for PRs, but also for the master branch. If the new stricter check fails (while the old one succeeds), automatically create an issue that pings me. I'll see this and create a PR that fixes it. - After such PR's are merged, and once we've gone say a week without any new automatic issues created, it means that no PR's breaking the new checks have been merged for a week. - At this point we can get rid of the old check and go back to the simpler single check scenario

infinisil · 2023-09-27T19:45:36Z

I'm conflating two related issues here though:

CI not triggering again as stricter tooling is introduced on the base branch, therefore breaking master if they're merged
- This is ideally fixed by just re-triggering that CI check for all old PRs.
- The above approach may be more suitable if it's too expensive to do that however.
The nixpkgs-check-by-name tooling being used from a channel, so updates to it are delayed
- Only this seems to require an approach as above, but only with a single PR to fix the new checks for the base branch, and only once the channel updates.
- At that point, a CI re-trigger should also be done.

I think I'll have to write this down again..

infinisil · 2023-09-27T23:26:35Z

An independent initial step towards this: #257735

Previously, even if the check also failed on the base branch, it looked like the PR introduced the failure. We can easily have a better error message for such cases. Meanwhile this also paves the road for something like NixOS#256788

infinisil · 2023-10-12T00:34:16Z

Alternate approach is to add --version <VERSION> to the tool. This way there's no need to have a pinned version of the tool, we can always use the latest one, but just run it twice (once with the previous version, once with the new one) to achieve the same effect

I'm adding such a --version in #256792 now

infinisil · 2023-10-13T01:02:10Z

This doesn't really work how I imagined it after all, because after testing this in a separate repo I discovered that it's impossible to re-trigger workflows for older PR's with an updated workflow file, GitHub always uses the same workflow file. So all of these "change the workflow file to run the tool twice" don't really work.

Instead my new plan is to:

Pin the CI tooling to a channel release in the base branch
Update the workflow to call the pinned tooling from the base branch (this workflow hopefully never needs to be updated again..)
Whenever it's time to use a new stricter tool, update the pin and fix the base branch in a single PR
When the pinned version in the base branch updates, automatically re-trigger the pkgs/by-name check for all open PRs that change pkgs/top-level/all-packages.nix or pkgs/by-name.

This means that updates to nixpkgs-check-by-name do not automatically end up being used by CI. Instead the pinned version in the base branch determines the version, and that can only update once the tooling is built by a channel.

infinisil · 2023-10-13T21:00:20Z

Okay change of plans again, the above is way too messy and doesn't catch everything. Instead I'm thinking now of going for this:

Pass both the base branch version and the merged PR version of Nixpkgs to nixpkgs-check-by-name
The tool fails only if the PR violates any checks that passed on the base branch.

This way, there will be a gradual migration for every new check, because all new (and updated) PRs won't be able to make it worse.

thufschmitt · 2023-10-26T14:49:18Z

I don't know whether it's a viable solution, but I would advocate for considering using some kind of merge train (either GitHub's native merge queues or one of the many bots that do that). It looks like it's the “obvious” solution to the problem of having PRs pass the CI but break master once they are merged.

nixos-discourse · 2023-10-31T18:01:18Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-10-31-nixpkgs-architecture-team-meeting-45/34846/1

infinisil · 2023-12-06T02:31:34Z

I now opened #272395 where I use the approach from #256788 (comment). The changes from #261939 made this fairly straightforward :)

infinisil · 2023-12-15T18:19:27Z

I now consider this done with #272395!

infinisil added this to the RFC 140 milestone Sep 23, 2023

This was referenced Sep 23, 2023

When updating nixpkgs-check-by-name, it should be run on Nixpkgs #256789

Closed

[RFC 140 2b] pkgs/by-name: Automated migration #211832

Draft

infinisil mentioned this issue Sep 27, 2023

workflows/check-by-name: Better error when base branch also fails #257735

Merged

6 tasks

infinisil mentioned this issue Oct 2, 2023

RFC 140 migration plan #258650

Open

infinisil mentioned this issue Oct 12, 2023

Implement check to disallow pkgs/by-name package definitions in all-packages.nix with an empty argument #256792

Merged

2 tasks

infinisil mentioned this issue Oct 19, 2023

nixpkgs-check-by-name: Intermediate error representation refactor #261939

Merged

3 tasks

Artturin mentioned this issue Nov 11, 2023

workflows/check-by-name: print failed command output #261693

Merged

12 tasks

infinisil mentioned this issue Dec 6, 2023

tests.nixpkgs-check-by-name: Implement gradual empty arg check migration #272395

Merged

3 tasks

infinisil closed this as completed Dec 15, 2023

tomodachi94 added the 6.topic: continuous integration Affects continuous integration (CI) in Nixpkgs, including Ofborg and GitHub Actions label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A strategy for evolving the `pkgs/by-name` CI checks #256788

A strategy for evolving the `pkgs/by-name` CI checks #256788

infinisil commented Sep 23, 2023 •

edited

Loading

infinisil commented Sep 27, 2023 •

edited

Loading

infinisil commented Sep 27, 2023 •

edited

Loading

infinisil commented Sep 27, 2023

infinisil commented Oct 12, 2023 •

edited

Loading

infinisil commented Oct 13, 2023 •

edited

Loading

infinisil commented Oct 13, 2023

thufschmitt commented Oct 26, 2023

nixos-discourse commented Oct 31, 2023

infinisil commented Dec 6, 2023

infinisil commented Dec 15, 2023

A strategy for evolving the pkgs/by-name CI checks #256788

A strategy for evolving the pkgs/by-name CI checks #256788

Comments

infinisil commented Sep 23, 2023 • edited Loading

Context

Problem

Proposed solution

infinisil commented Sep 27, 2023 • edited Loading

With only a single check

With a new stricter check

infinisil commented Sep 27, 2023 • edited Loading

infinisil commented Sep 27, 2023

infinisil commented Oct 12, 2023 • edited Loading

infinisil commented Oct 13, 2023 • edited Loading

infinisil commented Oct 13, 2023

thufschmitt commented Oct 26, 2023

nixos-discourse commented Oct 31, 2023

infinisil commented Dec 6, 2023

infinisil commented Dec 15, 2023

A strategy for evolving the `pkgs/by-name` CI checks #256788

A strategy for evolving the `pkgs/by-name` CI checks #256788

infinisil commented Sep 23, 2023 •

edited

Loading

infinisil commented Sep 27, 2023 •

edited

Loading

infinisil commented Sep 27, 2023 •

edited

Loading

infinisil commented Oct 12, 2023 •

edited

Loading

infinisil commented Oct 13, 2023 •

edited

Loading