-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A strategy for evolving the pkgs/by-name
CI checks
#256788
Comments
It was pointed out that the above text is really hard to digest. So here's my attempt at a more visual explanation. With only a single checkWhen there's only one check, Nixpkgs can only be in two states, pretty simple: Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI: stateDiagram-v2
invalid: Check fails
old: Check succeeds
invalid --> old
old --> old
We can only get back to the "Check fails" state if a PR gets merged even if CI fails. With a new stricter checkWith a new stricter check, Nixpkgs can now be in three states: Here are the Nixpkgs state transitions from the (latest) base commit to the merge commit that should pass in CI: stateDiagram-v2
invalid: Both checks fail
old: Old check succeeds
new: New check succeeds
invalid --> old
invalid --> new
old --> old
new --> new
old --> new
Noteworthy here is that:
Expand for a part that's not necessaryHowever, writing down that last point, it's a bit too manual imo. Maybe that can be automated by: - Running this CI check not only for PRs, but also for the master branch. If the new stricter check fails (while the old one succeeds), automatically create an issue that pings me. I'll see this and create a PR that fixes it. - After such PR's are merged, and once we've gone say a week without any new automatic issues created, it means that no PR's breaking the new checks have been merged for a week. - At this point we can get rid of the old check and go back to the simpler single check scenario |
I'm conflating two related issues here though:
I think I'll have to write this down again.. |
An independent initial step towards this: #257735 |
Previously, even if the check also failed on the base branch, it looked like the PR introduced the failure. We can easily have a better error message for such cases. Meanwhile this also paves the road for something like NixOS#256788
Previously, even if the check also failed on the base branch, it looked like the PR introduced the failure. We can easily have a better error message for such cases. Meanwhile this also paves the road for something like NixOS#256788
Previously, even if the check also failed on the base branch, it looked like the PR introduced the failure. We can easily have a better error message for such cases. Meanwhile this also paves the road for something like NixOS#256788
Alternate approach is to add I'm adding such a |
This doesn't really work how I imagined it after all, because after testing this in a separate repo I discovered that it's impossible to re-trigger workflows for older PR's with an updated workflow file, GitHub always uses the same workflow file. So all of these "change the workflow file to run the tool twice" don't really work. Instead my new plan is to:
This means that updates to |
Okay change of plans again, the above is way too messy and doesn't catch everything. Instead I'm thinking now of going for this:
This way, there will be a gradual migration for every new check, because all new (and updated) PRs won't be able to make it worse. |
I don't know whether it's a viable solution, but I would advocate for considering using some kind of merge train (either GitHub's native merge queues or one of the many bots that do that). It looks like it's the “obvious” solution to the problem of having PRs pass the CI but break master once they are merged. |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2023-10-31-nixpkgs-architecture-team-meeting-45/34846/1 |
I now opened #272395 where I use the approach from #256788 (comment). The changes from #261939 made this fairly straightforward :) |
I now consider this done with #272395! |
Context
The
pkgs/by-name
CI check gets thenixpkgs-check-by-name
tooling, which lives in Nixpkgs itself, from the latest NixOS channel. This makes CI very fast and predictable for all PRs, because it's able to re-use the pre-built tooling from Hydra.Problem
However, we have a problem if we want to change the tool:
Say we increase the strictness of the tool with a PR, while fixing all the problems it newly detects in the same PR. But now we have to wait for perhaps days until the NixOS channel updates until the new tool is used in CI. In that timespan, new problems could've been introduced without being detected.
Proposed solution
To solve this I propose to temporarily adjust CI for every strictness increase in the tool as follows:
If the latest tool succeeds on the base branch of the PR, it must also succeed for the PR.
This indicates that the pinned tooling isn't necessary anymore, a message is traced that the pin can get removed
Otherwise, if the pinned tool succeeds on the base branch, it must also succeed for the PR.
This indicates that the base branch needs to be fixed for the new tooling. The logs will contain the failures of the latest tool.
Otherwise, either the pinned or the latest tool must succeed for the PR
This indicates that the base branch is broken, either due to checks of a PR being ignored, or the PR being merged after the checks have changed.
In this state we don't know whether the base branch already succeeded with the latest tool, so a PR can pass if it fixes the breakage using either version.
Once the channel updates the next time after the tooling update is merged, another PR can be made to fix any remaining problems. Repeat for some time until no new problems are introduced from PR's that were merged without running CI again.
This could also benefit from some automation to rerun PR checks if it's been say 1 week, which could then be used as the time window to be sure of no PR's still having old checks.
And in a final PR, once the base branch is definitely working with the new check, the temporary pinned version can be disabled again, only using the latest NixOS channel version of the tool once more.
Having thought through this, I think it's about as smooth as it can get, and it sounds generally useful for all CI changes.
I'd love to hear if there are other proposals to handle this though. In the end I think we need something like this for RFC 140, because we'll have a lot of PR's affected.
The text was updated successfully, but these errors were encountered: