Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0087] Promote aarch64-linux to Tier 1 support #87

Closed
wants to merge 16 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions rfcs/0087-aarch64-tier1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
feature: aarch64-tier1
start-date: 2021-03-09
author: Vika Shleina
co-authors: Graham Christensen
shepherd-team: @samueldr, @Kloenk, @dhess, @grahamc
shepherd-leader: @samueldr
related-issues: TBD
---

# Summary
[summary]: #summary

Move `aarch64-linux` from a Tier 2 platform to Tier 1, as described in [RFC 0046](/rfcs/0046-platform-support-tiers.md)

# Motivation
[motivation]: #motivation

`aarch64-linux` support in Nixpkgs and NixOS matures over time and becomes
more and more stable, and more devices appear having NixOS on ARM support.
Moving it to a Tier 1 platform will allow us to block release channels on
aarch64-related build failures, making it easier and safer for ARM users
to upgrade their systems, and will help in keeping software versions in
sync between several architectures due to `x86_64-linux` and `aarch64-linux`
builds sharing a channel.

`aarch64-linux` will benefit from increased perceived binary cache coverage
as an additional result of channel bumps waiting for aarch64 builds to finish,
saving on build times for end users.

## Prior art
There were prior attempts at the same feat, but they failed due to technical
limitations of Hydra:
- NixOS/nixpkgs@74c4e30 - disabled in 2017 because of memory issues
- NixOS/nixpkgs#52534, NixOS/nixpkgs@36a0c13 - re-enabled in 2018 to pre-build important outputs
- NixOS/nixpkgs@1bfe8f1 - demoted to partial support due to hydra-evaluator issues

As a result, since NixOS/nixpkgs@1bfe8f1, in late 2018 NixOS already is
blocking channel releases on the base system closure required to produce the
installer image.

Since then, hydra-evaluator has been rewritten, which probably will make
these concerns obsolete.

# Detailed design
[design]: #detailed-design

If this RFC is accepted, `aarch64-linux` builds will be added to stable
and unstable channels' `tested` aggregate jobs on Hydra, giving them ability
to block channel advances. Hydra will start building aarch64 packages and run
aarch64-based tests as part of stable and unstable channels, including them in
the binary cache, increasing its coverage as a result.
vikanezrimaya marked this conversation as resolved.
Show resolved Hide resolved

A team for aarch64-specific build failures should be established (or, more
precisely, revived, as per [RFC0046](https://github.com/NixOS/rfcs/blob/master/rfcs/0046-platform-support-tiers.md) there should already be a team for ARM platforms called @NixOS/aarch64-maintainers)
to help track down and fix breaking changes specific to ARM platforms and be
at the front line of battle for ARM stability.

Additionally, maintainers of critical packages (e.g. binutils) should be given
an advance notice before implementing the change to ensure no delays in channel
advances happen and no surprises occur, such as suddenly increased workload. An
option would be to ping them on the PR implementing this RFC and discuss it there
(and potentially delay the merge if there is any pushback).
vikanezrimaya marked this conversation as resolved.
Show resolved Hide resolved

For help with testing packages on aarch64 for those maintainers who don't have
access to the machine, the [community build box](https://github.com/nix-community/aarch64-build-box) by @grahamc should provide a sufficient environment for running basic tests (and provide quick rebuilds using remote building).

## Dealing with Capacity Issues
[design-capacity]: #design-capacity

It is possible that the availability of aarch64 builders from Equinix Metal will
at times be reduced, causing delays in aarch64 build capacity. We will extend the
nixos-org-configurations implementation of hydra-provisioner to dynamically allocate
aarch64 builders on AWS during these capacity shortfalls.

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

<!-- This section illustrates the detailed design. This section should clarify all
confusion the reader has from the previous sections. It is especially important
to counterbalance the desired terseness of the detailed design; if you feel
your detailed design is rudely short, consider making this section longer
instead. -->

In [nixos/release-combined.nix](https://github.com/NixOS/nixpkgs/blob/master/nixos/release-combined.nix)
`aarch64-linux` will be moved to `supportedSystems`, enabling NixOS tests
to block channel advances in case of failures.

Merging this RFC should happen simultaneously with the merging of documentation
around configuring qemu-binfmt as a fallback method for building aarch64 packages on
x86_64 machines. Additionally, a sub-project that's out-of-scope for this RFC may be
established to catch build failures (of which sightings were reported) when using
emulation.

The list of NixOS AMIs on NixOS.org will also be extended to include aarch64 images.

# Drawbacks
[drawbacks]: #drawbacks

- Some build failures could unneccesarily delay channel advances, delaying critical updates
- Already an issue on `x86_64-linux` from time to time
- We already run aarch64 tests on Hydra and they're mostly green - what's not green could be fixed if more attention is paid to those failures
- Maintainers of critical packages might not be ready for additional load
- Will there be additional load on them? Sounds more like already existing problems will float up to the surface and need to be fixed ASAP
- Potentially alleviated by reviving the aarch64-specific maintainer team and pinging it on all aarch64-specific issues not reproducible on x86_64-linux
- that makes `aarch64-linux` still sound inferior to `x86_64-linux` though...

# Alternatives
[alternatives]: #alternatives

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am of the opinion a separate channel should be the first step. Show there are enough resources and commitment to keep it green for half a year to a year. If that is the case, we can upgrade it to Tier 1. I don't think generally there is enough understanding in how much day-to-day effort goes into actually keeping the channels green. At the same time, I have no idea how few breakage there is with aarch64 nowadays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your suggestion to the manuscript.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel already blocks on aarch64-linux since late 2018 for the limited support set. So I guess we're ready to upgrade it to Tier 1 since things were kept green for half a year to a year.

It's all about having the jobs being tried to be built. Not about adding new blockers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth specifying which channel we are talking about here. If you mean the unstable channel: it mostly works. If you mean the stable channel: It has been blocked for a few days when I opened the channel PR and probably nobody noticed yet.

## Create a separate channel
Create an aarch64-focused channel that would build same things current `unstable` does,
but for aarch64 only. This has a significant drawback: it is possible for the x86_64
channel and the aarch64 channels to never pass on the same commit, making deployment
to a heterogeneous cluster of x86_64 and aarch64 machines very challenging.

### Use a separate channel as a stepping stone
Elaborating on the previous alternative, create an aarch64-focused channel. Show
there are enough resources and commitment to keep it green for half a year to a year.
Carry on with the RFC topic once this is the case.

## "Just use your own CI"
Everyone needing `aarch64-linux` would then just track master on their own and build commits,
which would result in a lot of wasted work that could be saved by centralizing builds (which
is one of the reasons Hydra exists) and a lot of complexity for end-users.

# Unresolved questions
[unresolved]: #unresolved-questions

~~Do we have enough machines to handle aarch64 builds without delaying `x86_64-linux` builds?~~ (see [Dealing with Capacity Issues](#dealing-with-capacity-issues))

# Future work
[future]: #future-work

Track down build failures when using `boot.binfmt.emulatedSystems` and qemu-binfmt to build
aarch64 packages on `x86_64-linux` machines (e.g. by building a minimal closure fully without
binary caches and emulation).