Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch self-hosted runners to ubicloud github runners #1877

Merged
merged 37 commits into from
May 1, 2024
Merged

Conversation

saraswatpuneet
Copy link
Collaborator

@saraswatpuneet saraswatpuneet commented Feb 13, 2024

Goal

The goal of this PR is to switch the self hosted runners in github actions (except for benchmarks as they would run on self-hosted) to ubicloud-standard-4, code coverage is an exception as it requires higher verticals hence using ubicloud-standard-8 only for code coverage jobs

Tests

  • Run benchmarks
  • Check e2e tests
  • Check rust tests

Checklist

  • Chain spec updated
  • Custom RPC OR Runtime API added/changed? Updated js/api-augment.
  • Design doc(s) updated
  • Tests added
  • Benchmarks added
  • Weights updated

@saraswatpuneet saraswatpuneet changed the title runs-on: ubicloud Switch self-hosted runners to ubicloud github runners Feb 14, 2024
Copy link

codecov bot commented Feb 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.13%. Comparing base (10c144c) to head (6d2915e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1877   +/-   ##
=======================================
  Coverage   83.13%   83.13%           
=======================================
  Files          56       56           
  Lines        4530     4530           
=======================================
  Hits         3766     3766           
  Misses        764      764           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@saraswatpuneet saraswatpuneet marked this pull request as ready for review February 15, 2024 17:56
Copy link

Running benchmarks and calculating weights. DO NOT MERGE! A new commit will be added upon completion...

Copy link

Finished running benchmarks. Updated weights have been committed to this PR branch in commit bab3a68.

demisx pushed a commit that referenced this pull request Feb 15, 2024
@github-actions github-actions bot added metadata-changed Metadata has changed since the latest full release metadata-version-not-incremented Metadata has changed since the latest full release, but the version has not been incremented labels Feb 15, 2024
saraswatpuneet added a commit that referenced this pull request Feb 15, 2024
@github-actions github-actions bot removed metadata-changed Metadata has changed since the latest full release metadata-version-not-incremented Metadata has changed since the latest full release, but the version has not been incremented labels Feb 15, 2024
@@ -41,3 +41,6 @@ rustflags = [
"-Dclippy::unwrap_used",
"-Dclippy::expect_used",
]

[net]
git-fetch-with-cli = true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the random network issues on ubicloud warrants this check

https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli

@@ -371,7 +371,8 @@ jobs:
RUST_LOG=debug SRTOOL_TAG="1.70.0" srtool build \
--build-opts="'--features on-chain-release-build,no-metadata-docs,${{matrix.features}}'" \
--profile=${{matrix.build-profile}} \
--package=${{matrix.package}}
--package=${{matrix.package}} \
--root
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed for permissions

@furkansahin
Copy link

Hey @saraswatpuneet ,
I am a Ubicloud developer. It's exciting to see that you consider us for your runners.

I can see that you had a memory issue with our standard-8 runners. I would like to inform you that we found a misalignment in between our runners and github's in terms of swap space configurations. We have updated our runners to be in line with Github's as you can see here ubicloud/ubicloud#1244.

I can also see that you had a networking related issue. We found out that the memory related issues can manifest themselves as network drop as well. You may want to give a try to our latest runners and see if the issues persist :)

Is there any other specific problem I can help you and we can fix in ubicloud? Please just let me know if we can help anyway.
Best!

@saraswatpuneet
Copy link
Collaborator Author

saraswatpuneet commented Feb 26, 2024

@furkansahin we still get some intermittent issues (seems to be network but not sure), please check this job

https://github.com/LibertyDSNP/frequency/actions/runs/8051209010/job/21988632422

@saraswatpuneet saraswatpuneet marked this pull request as draft February 26, 2024 16:19
@furkansahin
Copy link

Hey @saraswatpuneet ,
I have cloned your repository and will debug this further. I should have more to share with you in the next 2 business days. A couple of questions that would help me to pinpoint the issues would be;

  1. Did you have this kind of issue before?
  2. I can see that you are switching from self hosted, is your self hosted instance located in a cloud provider? Also, what would be the location (US, europe?)

Best

@saraswatpuneet
Copy link
Collaborator Author

saraswatpuneet commented Feb 27, 2024

hey @furkansahin , thank you for taking a look, we havent faced this issue before, either on self hosted or github provider runner. this is new to us too. Following a re-run it passed, so its not reproducible everytime.

Regarding where we are hosting our runner, @wilwade can add details to it below.

@wilwade
Copy link
Collaborator

wilwade commented Feb 28, 2024

@furkansahin

It currently uses US based AWS runners. They are c7i.4xlarge and c7a.4xlarge.

@furkansahin
Copy link

Hey @saraswatpuneet ,
I can see what the problem is and reproduce it. However, since it fails randomly, it is difficult to catch the bug on the fly and reason about it. I have tried bunch of changes here in my fork https://github.com/furkansahin/frequency/actions/workflows/verify-pr-commit.yml. The problem seems like the srtool build is pulling bunch of repositories, the biggest being [https://github.com/paritytech/polkadot-sdk](https://github.com/paritytech/polkadot-sdk). During fetch, the connection is dropped, sometimes. This is not expected and I believe it has something to do with the location of our runners being in Europe.
In your self hosted runners, do you use caching? I believe the amount of pull in your self hosted runner may not be as high as it is in ubicloud, is that assumption correct?
I can see that you have added git-fetch-with-cli = true to overcome the networking issue but it did not help. Is that correct? Or, do you think it actually solved something but this issue is separate?
I am at the Devworld Conference in Amsterdam at the moment, so, unfortunately this is all I could find so far.
Best,
Furkan

@furkansahin
Copy link

Hey @saraswatpuneet , @wilwade
I have looked into this more and here is my analysis and a solution proposal for you.
Analysis: Looking into the networking logs in detail (tcpdump + wireshark), it is correct that when cargo fetches the repositories, there are some packet drops in between. While some packet drops are totally acceptable and harmless, in this instance, they cause random connection disruption. This mostly happens when the global traffic is high (US daytime). You would not see that in your self hosted instances or regular github runners because the repositories sit closer to those. In Ubicloud's scenario, our runners are provisioned either in Germany of Finland.
We are continuously improving our networking stack and I will look further into this problem how we can setup our networking in a way these issues would not happen. In the meantime, if you would give us a try, I found a way that makes the Verify Build Runtime jobs to pass.

Solution proposal: Unfortunately, this is a band aid proposal. I realised git-fetch-with-cli = true is not really necessary, but instead, you can use;

[net]
retry = 5

This way, cargo will retry if the fetch fails up to 5 times. In fact, in my last 27 Verify Build Runtime for X jobs, it had to try 3 times for only 1 of the jobs. You can see the latest one here https://github.com/furkansahin/frequency/actions/runs/8140985585.

I understand the inconvenience this issue may cause and would completely understand any decisions you make regarding this PR. However, I really hope you will consider giving us the opportunity to demonstrate our commitment to the overall customer experience.

Best,
Furkan

This reverts commit 0b3ce58.
@wilwade wilwade marked this pull request as ready for review April 30, 2024 22:23
@@ -41,3 +41,6 @@ rustflags = [
"-Dclippy::unwrap_used",
"-Dclippy::expect_used",
]

[net]
retry = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed due to the location of the ubicloud runners

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

Comment on lines +20 to +21
permissions:
contents: write
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to get rid of the special GHA_GIT_COMMIT token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think behind the scenes it generates a token with those permissions, unless we want to control the lifetime of a token

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. There is an option to control it such that the permission is only on specific steps instead the entire workflow, but I figured this is the best 1:1 swap for now.

@@ -12,7 +12,6 @@ jobs:
changes:
name: Determine Changed Files
runs-on: ubuntu-22.04
container: ghcr.io/libertydsnp/frequency/ci-base-image:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll see a lot of places where I removed the use of the ci-base-image. It really is only needed for things that use rust.

@@ -17,7 +17,7 @@ env:
jobs:
run-e2e:
name: Run E2E Tests
runs-on: [self-hosted, Linux, X64, build, v2]
runs-on: ubuntu-22.04
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things didn't need to be on ubicloud as they don't need the speed.

@@ -490,7 +490,7 @@ fn create_schema_v3_requires_valid_schema_size() {
expected: (Error::<Test>::LessThanMinSchemaModelBytes, 3),
},
TestCase {
input: r#"{"id": "long", "title": "I am a very very very long schema", "properties": "just way too long to live a long life", "description": "Just a never ending stream of bytes that goes on for a minute too long"}"#,
input: r#"{"id": "long", "title": "I am a very very very very long schema", "properties": "just way too long to live a long life", "description": "Just a never ending stream of bytes that goes on for a minute too long"}"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggering rust code changes

- name: Install Subwasm
run: |
cargo install --locked --git https://github.com/chevdor/subwasm --tag v0.19.1 --force
cargo install --locked --git https://github.com/chevdor/subwasm --tag v0.20.0 --force
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting version bump in subwasm

package: ${{matrix.package}}
chain: ${{matrix.chain}}
runtime_dir: ${{ matrix.runtime-dir }}
tag: "1.77.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting version bump in srtool

Copy link
Collaborator

@mattheworris mattheworris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Read through code changes
  • Read through comment thread

🚢 it!

Copy link
Collaborator

@JoeCap08055 JoeCap08055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through changes and comments... if we're satisfied with the solutions from ubicloud for the network drops and memory issues, :shipit:

Copy link
Collaborator Author

@saraswatpuneet saraswatpuneet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wilwade thank you for taking this ahead, great work and looks good to go
🚢

@wilwade wilwade merged commit d984c8b into main May 1, 2024
31 checks passed
@wilwade wilwade deleted the ubicloud branch May 1, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants