Switch over to managed Arm64 hosts #16801

alexellis · 2023-10-20T14:00:35Z

Description

Switch over to managed Arm64 hosts

The service is provided by actuated.dev, and sponsored by both Ampere and the CNCF.

Details

This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM.

Benefits to the etcd maintainers and community:

Securely isolated, immutable environments for every build
No more server maintenance, reduced documentation and process
Support from an externally contracted team, much more efficient use of CNCF credits.
Potential to enable builds for each PR instead of just nightly

@dims and other etcd maintainers have been briefed on the discussion / plans via Slack.

Please feel free to ask questions here too.

ahrtr · 2023-10-20T14:13:44Z

This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM.

thx for raising the PR. What's the scale of the pool? In this case, probably we can run workflow on each PR instead of nightly?

alexellis · 2023-10-20T14:17:47Z

There's a total of 20 build slots available to be shared between all CNCF projects in the pool.

On average, how long does the Arm build take?

In order for the PR to work, we need a maintainer to install the Actuated app onto the etcd-io GitHub organisation - this enables runner management and event subscriptions in order to work.

ahrtr · 2023-10-20T14:27:01Z

There's a total of 20 build slots available to be shared between all CNCF projects in the pool.

Based on the scale, and it's shared by all CNCF projects, It seems that we still need to run it nightly instead of trigging on each PR.

we need a maintainer to install the Actuated app onto the etcd-io GitHub organisation

any objection? @etcd-io/maintainers-etcd let's do this after we get consensus. thx for the link.

alexellis · 2023-10-20T14:42:47Z

Based on the scale, and it's shared by all CNCF projects, It seems that we still need to run it nightly instead of trigging on each PR.

It depends on how quick the PRs are to run. Have you got a timing for the nightly job? I'm seeing about 30 minutes.

Depending on how many PRs you tend to get, it may well be fine to run on every PR, until we have the other projects fully onboarded.

let's do this after we get consensus. thx for the link.

Sure, this has been discussed with @dims for a while already, I'm sure he can answer questions if anyone has them.

dims · 2023-10-20T17:06:01Z

@alexellis i am not a etcd maintainer per se, so the consensus that @ahrtr mentions among etcd maintainers is important here.

jmhbnz

Many thanks for raising this proposal @alexellis. This sounds like a great initiative.

Two questions:

Is the scale of the pool fixed at 20 or is there ability to expand it in future if required? (Give there are ~160 CNCF projects who could be wanting to use the service).
We require 8cpus for lazyfs robustness ci nightlies in order to run tests at a decent level of QPS. Currently we use large github amd64 runners. Is there an option for a large arm64 runner i.e. actuated-arm64-8cpu-32gb?

alexellis · 2023-10-21T07:20:59Z

The 20 slots are what Chris A has agreed to pilot. It's up to projects to show demand in order for that to potentially be increased.

32GB RAM is fine if that's what you know you need. There is a free telemetry plugin we can recommend later on which will confirm if 16 or 24GB for instance is being used.

And as you can see from this commit, it's a trivial change to revert if you should need to do so for any reason.

I'd encourage the team to get this moving forward and we can cover further questions and suggestions via Slack.

jmhbnz · 2023-10-21T07:32:17Z

32GB RAM is fine if that's what you know you need. There is a free telemetry plugin we can recommend later on which will confirm if 16 or 24GB for instance is being used.

Yes please. We need to set spec to 8CPU and 32GB Ram to match the github large runners that we found to be a requirement for our robustness nightly consistency stress tests with lazyfs enabled and a high enough QPS to create confidence.

If we can update that spec, then this pr will look good to me.

alexellis · 2023-10-21T09:55:04Z

Ack

You can customise the sizes as you require by editing the label in runs-on - just put whatever you need. Since the PR only runs overnight, I'd say get this merged, then send your own one after that with what you want.

I'm just here to help get the ball rolling.

jmhbnz

LGTM - Thanks @alexellis. I believe managed on demand arm64 ci hosts will definitely be a big win for the project. Keen to trial this.

I'm happy to follow up with tweaking the sizing for the machines though it would certainly be appreciated if you could make that small tweak to save a second pr and extra commit being required. Will defer to maintainers on if that is deemed a blocking issue or not.

fuweid · 2023-10-22T07:59:04Z

nice！ maybe we can remove the container on GitHub action in the follow-up? It seems we don't need to worry about leaky resources.

tao12345666333 · 2023-10-22T08:46:47Z

Looks great! Maybe we can run it for a while and see if there are any other parts that need to be modified/tweaked

ahrtr

LGTM

thx @alexellis

The followups:

Install the Actuated app onto the etcd-io GitHub organisation per Switch over to managed Arm64 hosts #16801 (comment)
Customise the size per Switch over to managed Arm64 hosts #16801 (comment)

cc @serathius @wenjiaswe

wenjiaswe · 2023-10-23T15:58:04Z

lgtm thanks!

alexellis · 2023-10-23T17:11:48Z

@fuweid

nice！ maybe we can remove the container on GitHub action in the follow-up? It seems we don't need to worry about leaky resources.

Could you link me to that?

Every build will run in an isolated VM with its own short-lived Docker daemon and library.

This change switches over from 2x self-managed runners where side effects are possible between builds, to a pool of servers where each build runs in an isolated VM. The service is provided by actuated.dev, and sponsored by both Ampere and the CNCF. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis · 2023-10-23T17:15:40Z

LGTM - Thanks @alexellis. I believe managed on demand arm64 ci hosts will definitely be a big win for the project. Keen to trial this.

Awesome, that's why we're doing this 👍

I'm happy to follow up with tweaking the sizing for the machines though it would certainly be appreciated if you could make that small tweak to save a second pr and extra commit being required. Will defer to maintainers on if that is deemed a blocking issue or not.

The idea is to give you access, then for you to manage this and take ownership, but I've force pushed my commit with the size you asked for.

There may be other jobs in etcd-io repositories which do not require as many resources, or that may even require more. The labels are customisable as per the docs - https://docs.actuated.dev/troubleshooting/#a-job-is-running-out-of-ram-or-needs-more-cores

We've not had anyone perform the GitHub App installation yet. What's needed for this to happen?

serathius · 2023-10-24T06:55:58Z

All arm workflows are periodic currently, so we cannot see if this change is incompatible with them. For example our fromJSON sharade when making the runs-on a parameter in template. I think there is high risk it will not work with your change. Only after merging the PR we will see the results.

Is this something that was missed, or are we ok with merging PR first and fixing later? @jmhbnz @ahrtr

I would prefer that we test it like we did in previous cases. @alexellis Can you temporarily push a commit that also makes the arm workflows execute on PR? We just need to have one run succeed and then we can remove the commit from PR.

jmhbnz · 2023-10-24T08:40:59Z

.github/workflows/robustness-nightly.yaml

@@ -23,7 +23,7 @@ jobs:
      count: 80
      testTimeout: 200m
      artifactName: main-arm64
-      runs-on: "['self-hosted', 'Linux', 'ARM64']"
+      runs-on: actuated-arm64-8cpu-32gb


Good catch from @serathius. I believe for robustness this needs to be "['actuated-arm64-8cpu-32gb']", refer line 18 above for example.

We can probably simplify the fromJson that the robustness nightly template does after this pr merges because the array will no longer be present.

Don't think "['actuated-arm64-8cpu-32gb']" is the correct solution as Github Actions would treat actuated-arm64-8cpu-32gb as a label and not machine type. I think the proposed value makes sense assuming we will remove fromJson, but I'm not sure. It's a matter of trial and error.

jmhbnz · 2023-10-24T08:43:07Z

I would prefer that we test it like we did in previous cases. @alexellis Can you temporarily push a commit that also makes the arm workflows execute on PR? We just need to have one run succeed and then we can remove the commit from PR.

We will need a maintainer to add the Github App first before the test can be successful, refer: https://docs.actuated.dev/register/#install-the-github-app

ahrtr · 2023-10-24T09:42:07Z

We will need a maintainer to add the Github App first before the test can be successful, refer: https://docs.actuated.dev/register/#install-the-github-app

Done.

Only after merging the PR we will see the results.

It's a good point. It'd better to verify the change in this PR firstly (see below). Once it's confirmed, we can change it back. thx

$ git diff
diff --git a/.github/workflows/robustness-nightly.yaml b/.github/workflows/robustness-nightly.yaml
index e3e1d51f3..663fd4720 100644
--- a/.github/workflows/robustness-nightly.yaml
+++ b/.github/workflows/robustness-nightly.yaml
@@ -1,11 +1,7 @@
 ---
 name: Robustness Nightly
 permissions: read-all
-on:
-  # schedules always run against the main branch, hence we have to create separate jobs
-  # with individual checkout actions for each of the active release branches
-  schedule:
-    - cron: '25 9 * * *' # runs every day at 09:25 UTC
+on: [push, pull_request]
 jobs:
   main:
     # GHA has a maximum amount of 6h execution time, we try to get done within 3h

alexellis · 2023-10-24T09:45:11Z

@ahrtr I was going to suggest adding a workflow_dispatch to the job, and getting this merged.

I don't think you can test it in a PR until the initial merge of this PR has been performed.

See my follow-up commit, to enable the job to be tested without waiting for the cron.

Once you've seen it working, I'd suggest removing the container elements which forces the steps to run inside Docker, from the file as it'll run much quicker using the native runner that way.

alexellis · 2023-10-24T09:52:05Z

As per your request: #16801 (comment)

I've added the requested extra triggers.

ahrtr · 2023-10-24T09:54:29Z

@alexellis could you rebase this PR to get rid of the commits coming from main branch?

alexellis · 2023-10-24T11:00:49Z

Looks like the nightly test ran as expected, in about the same time as without any management:

@ahrtr

Adding workflow_dispatch as an "on" trigger enables manual testing by maintainers, without having to wait for the nightly cron schedule. @ahrtr requested this temporary change in order to trigger the arm64 jobs via CI. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis · 2023-10-24T12:12:09Z

@jmhbnz that's been done and the Arm64 jobs have completed as far as I can see.

Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>

…tcd-io#16912 KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>

alexellis · 2023-11-16T22:14:51Z

Hi folks 👋

We've seen around 200 jobs scheduled for the org today, with bbolt having also been added on. That's not a very big number, and shouldn't be hitting any rate limits, however for etcd-io and no other customers, we're seeing a 403 / rate-limit error whilst trying to obtain a registration token (GitHub support said the actuated app is within its limit for etcd-io).

I don't know if GitHub rolled out a bug or is having a partial outage, but I'm pinging them. It looks like there are 18 jobs queued which can be retried on our end in the morning by @welteki on the actuated team.

Just gathering info.. do you have other different GitHub apps, bots, or integrations installed/operating on this organisation?

Alex

jmhbnz · 2023-11-16T22:20:36Z

Just gathering info.. do you have other different GitHub apps, bots, or integrations installed/operating on this organisation?

As of recently we have kubernetes/test-infra#31218 in progress to finalise kubernetes/org#4498 k8s ci robot / k8s prow test infra integration.

You can see this start to kick in under a recent pr like #16950 but it is early days.

We have an arm64 runner issue under #16948 we would like actuated assistance on.

Lastly we are planning for k8s ci robot to be able to trigger actions workflows for new contributors via #16956.

alexellis · 2023-11-17T09:28:20Z

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

alexellis · 2023-11-17T09:30:09Z

Please send me your email for a Slack invite to alex@actuated.dev for ongoing support for etcd

alexellis · 2023-11-17T09:31:14Z

https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/monitoring-and-troubleshooting-self-hosted-runners#using-repository-level-self-hosted-runners

serathius · 2023-11-17T09:47:46Z

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

I think it might be the result of

alexellis · 2023-11-17T10:58:01Z

The enterprise has its own set of configurations so will need to be configured appropriately to enable the use of repository-level self hosted runners.

The GitHub team told me that the default is to allow for them.

ahrtr · 2023-11-17T11:02:55Z

@ahrtr seems like someone on your org has disabled self hosted runners 🙈

Forbidden","errors":"Repository level self-hosted runners are disabled on this repository.

You’ll need to look into this and change the setting so that actuated can add runners for CI.

Yes, it might be caused by recently etcd-io org transferring. See screenshot below as well,

But it seems that actuated still has the required permission?

Please send me your email for a Slack invite to alex@actuated.dev

sent.

alexellis · 2023-11-17T13:10:44Z

The issue is not whether the actuated GitHub app has permissions, it's whether your enterprise and organisation allow self-hosted runners, and they do not at present.

You'll need to go through the settings at both levels and toggle it over. Here are two screenshots GitHub's actions PM sent to me earlier today:

ahrtr · 2023-11-17T13:20:47Z

Pls see screenshot below. I see message below under the section "Runner", it seems that I have no permission to do that? @mrbobbytables @palnabarun could you help on this?

Choose which repositories are allowed to create repository-level self-hosted runners.

This setting has been set by enterprise administrators.

ahrtr · 2023-11-17T19:18:04Z

@jmhbnz you are also invited to the slack space actuated by @alexellis to discuss & coordinate any arm64 related workflow issues. If you do not receive the mail, please feel free to send an email to alex@actuated.dev to request to join. Thanks.

alexellis · 2023-11-18T08:47:24Z

@mrbobbytables could you set the following at the enterprise level please? All of the etcd jobs are backed up since this was changed / transferred. There's a total of 85 that cannot run due to this setting.

mrbobbytables · 2023-11-18T16:09:07Z

Sorry about that - I've flipped it so etcd-io should be able to add them. If someone could verify I'd appreciate it 🙏

@ahrtr @serathius @wenjiaswe @jmhbnz - as we integrate etcd-io into k8s more little bumps like this are bound to happen, is there an existing issue here or in k8s that we could use to surface these issues as they arise?

If one does not exist, I'd lean towards creating one in https://github.com/kubernetes/org as the GitHub admins all watch that repo.

ahrtr · 2023-11-18T16:24:51Z

Thanks @mrbobbytables.

@alexellis The configuration should be correct now, please see screen shot below. But lots of workflows on arm64 are still pending to run (queued). Should we manually re-trigger all the already queued workflows or wait for more time for them to automatically be triggered?

If one does not exist, I'd lean towards creating one in https://github.com/kubernetes/org as the GitHub admins all watch that repo.

Leave it to @jmhbnz to follow this. Thanks.

…tcd-io#16912 KubeconNA 2023 Contribfest issue etcd-io#16893 . Recently in etcd-io#16801 we introduced on demand github actions runners for the arm64 platform. Having on demand runner infrastructure in place means we should now have enough capacity to begin running arm64 tests for every pull request. Currently we have: .github/workflows/e2e-arm64-template.yaml - Shared template .github/workflows/e2e-arm64-nightly.yaml - Runs template against both release-3.5 and main branches nightly. Moving forward we can just rename .github/workflows/e2e-arm64-template.yaml to .github/workflows/e2e-arm64.yaml and delete the other file. We can then just make the template file a standard workflow that will run on pull request. Signed-off-by: Ming Li <mli103hawk@gmail.com>

mrbobbytables · 2023-11-18T16:49:14Z

I'm not super familiar with the self hosted runners^^;; If they don't retrigger automatically soon though it may need to be done manually

ahrtr · 2023-11-18T16:51:40Z

Yes, manually re-triggering works. FYI. etcd-io/bbolt#614

ahrtr · 2023-11-18T17:01:28Z

Just manually re-triggered all workflows.

jmhbnz · 2023-11-18T18:17:49Z

Thanks @ahrtr, @mrbobbytables. I've opened kubernetes/org#4590 to get our large amd64 runners re-enabled. Another side-effect of moving the org.

alexellis · 2023-11-19T09:00:45Z

actuated customers (and members of enrolled GitHub organisations) can retrigger queued jobs as per the docs:

https://docs.actuated.dev/tasks/cli/#schedule-a-repair-to-re-queue-jobs

I can see jobs are running again now.

jmhbnz reviewed Oct 20, 2023

View reviewed changes

jmhbnz approved these changes Oct 22, 2023

View reviewed changes

ahrtr mentioned this pull request Oct 22, 2023

Setup workflow running on arm64 etcd-io/bbolt#583

Closed

ahrtr approved these changes Oct 23, 2023

View reviewed changes

jmhbnz requested a review from serathius October 23, 2023 21:02

jmhbnz requested changes Oct 24, 2023

View reviewed changes

serathius approved these changes Oct 24, 2023

View reviewed changes

mingli103 mentioned this pull request Nov 10, 2023

etcd-e2d-test:rename e2e-arm64 file and runs it on every pull request #16911

Closed

mingli103 mentioned this pull request Nov 10, 2023

etcd-e2d-test:rename e2e-arm64 file and runs it on every pull request #16912

Closed

mingli103 mentioned this pull request Nov 15, 2023

etcd-e2d-test:rename e2e-arm64 file and runs it on every pull request… #16950

Merged

upodroid mentioned this pull request Nov 19, 2023

etcd-io Infra and CI Migration kubernetes/k8s.io#6102

Open

This was referenced Nov 30, 2023

add workflow telemetry to collect action metrics #17046

Merged

Power down & de-provision old equinix metal arm64 ci runners #17082

Closed

ivanvc mentioned this pull request Dec 29, 2023

Enable ARM64 GitHub workflows etcd-io/raft#122

Closed

Switch over to managed Arm64 hosts #16801

Switch over to managed Arm64 hosts #16801

Conversation

alexellis commented Oct 20, 2023 • edited Loading

Description

Details

ahrtr commented Oct 20, 2023

alexellis commented Oct 20, 2023 • edited Loading

ahrtr commented Oct 20, 2023

alexellis commented Oct 20, 2023

dims commented Oct 20, 2023

jmhbnz left a comment • edited Loading

Choose a reason for hiding this comment

alexellis commented Oct 21, 2023

jmhbnz commented Oct 21, 2023 • edited Loading

alexellis commented Oct 21, 2023

jmhbnz left a comment • edited Loading

Choose a reason for hiding this comment

fuweid commented Oct 22, 2023

tao12345666333 commented Oct 22, 2023

ahrtr left a comment

Choose a reason for hiding this comment

wenjiaswe commented Oct 23, 2023

alexellis commented Oct 23, 2023

alexellis commented Oct 23, 2023

serathius commented Oct 24, 2023 • edited Loading

jmhbnz Oct 24, 2023

Choose a reason for hiding this comment

serathius Oct 24, 2023

Choose a reason for hiding this comment

jmhbnz commented Oct 24, 2023

ahrtr commented Oct 24, 2023

alexellis commented Oct 24, 2023 • edited Loading

alexellis commented Oct 24, 2023

ahrtr commented Oct 24, 2023

alexellis commented Oct 24, 2023

alexellis commented Oct 24, 2023

alexellis commented Nov 16, 2023

jmhbnz commented Nov 16, 2023

alexellis commented Nov 17, 2023

alexellis commented Nov 17, 2023 • edited Loading

alexellis commented Nov 17, 2023

serathius commented Nov 17, 2023

alexellis commented Nov 17, 2023

ahrtr commented Nov 17, 2023

alexellis commented Nov 17, 2023

ahrtr commented Nov 17, 2023

ahrtr commented Nov 17, 2023

alexellis commented Nov 18, 2023

mrbobbytables commented Nov 18, 2023

ahrtr commented Nov 18, 2023

mrbobbytables commented Nov 18, 2023

ahrtr commented Nov 18, 2023

ahrtr commented Nov 18, 2023

jmhbnz commented Nov 18, 2023 • edited Loading

alexellis commented Nov 19, 2023

alexellis commented Oct 20, 2023 •

edited

Loading

alexellis commented Oct 20, 2023 •

edited

Loading

jmhbnz left a comment •

edited

Loading

jmhbnz commented Oct 21, 2023 •

edited

Loading

jmhbnz left a comment •

edited

Loading

serathius commented Oct 24, 2023 •

edited

Loading

alexellis commented Oct 24, 2023 •

edited

Loading

alexellis commented Nov 17, 2023 •

edited

Loading

jmhbnz commented Nov 18, 2023 •

edited

Loading