Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: ci: create gh workflow that runs go tests #11762

Merged
merged 11 commits into from
Mar 26, 2024
Merged

ci: ci: create gh workflow that runs go tests #11762

merged 11 commits into from
Mar 26, 2024

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Mar 20, 2024

Related Issues

#11734

Proposed Changes

This PR creates a new GitHub Actions workflow (test.yml) that performs the following jobs previously handled exclusively by CircleCI:

  • test
  • test-conformance
  • test-itest-*
  • test-unit-*

Additional Info

You can find an example run of the workflow at: https://github.com/filecoin-project/lotus/actions/runs/8376141826?pr=11762

Unlike the CircleCI test jobs, the ones from the newly added workflow do not have to wait for the build (make lotus deps) to finish. Instead, the test jobs call make deps as needed (we didn't find an instance where make lotus was needed). This results in some work duplication but reduces the overall workflow runtime by ~3 minutes.

We did use a configuration matrix in the newly added workflow because the number of jobs executed by the workflow is huge. We were able to use a single job "template" since all the test jobs are executed in almost exactly the same manner. In particular, we didn't split out test-conformance definition like it was done in CircleCI.

The newly added workflow makes use of 3 helper actions:

  • install-ubuntu-deps: which installs ocl-icd-opencl-dev libhwloc-dev pkg-config on the runner
  • install-go: which installs go on the runner (it uses the version it finds in the go.mod file)
  • start-yugabytedb: which starts yugabytedb docker container and waits for the DB to start running

The workflow's jobs run on a combination of self-hosted and hosted runners. We use self-hosted runners for two reasons: for resource-intensive jobs (2xlarge and 4xlarge) and to increase runner availability (large and xlarge). We can only use a limited number of hosted runners concurrently (60).

How did we decide what jobs to run on which runners? We used the largest runners - 4xlarge (16 CPU, 32 RAM) - only for some of the jobs that used to run on 2xlarge (16 CPU, 32 RAM) in CircleCI; namely - itest-deals_concurrent, itest-sector_pledge, and itest-worker. We used 2xlarge (8 CPU, 16 RAM) for jobs that, for whatever reason, we saw failing on hosted runners. These include itest-gateway, itest-sector_import_full, itest-sector_import_simple, itest-wdpost, unit-storage. Our assumption here is that more resources could help reduce the flakiness but it is to be verified in practice. Finally, we used xlarge (4 CPU, 8 RAM) for 42 jobs (half of overall 84) that were the quickest in one of the test runs where we run almost everything on xlarge runners. In other words, those jobs scheduled to run on xlarge now should be fine running on hosted runners too (or large (2 CPU, 4 RAM) to reduce the cost; we're sticking with xlarge not to accidentally introduce more flakiness during the evaluation period). Everything else runs on the default GitHub hosted runners (4 CPU, 16 RAM). This entire assignment is a subject to change but we do have to start somewhere and it seems to us like a sensible spot.

We decided to cache Proof Parameters because trying to download them from many jobs at the same time resulted in a number of connection closed errors. The job that ensures the proof parameters are cached properly adds only seconds of overhead to the workflow.

We decided to generate the test job matrix on the fly (instead of pre-generating it like it was for CircleCI). The job matrix generation is executed on every workflow run. It uses bash tools to combine JSON inputs into a final matrix. We evaluated keeping matrix generation as a Go script, but, in our opinion, it made it harder to reason about what the generation does and how the final matrix is configured. It also added a little bit of extra overhead required for Go setup (~40s vs ~10s).

The newly added workflow is intended to run alongside its' CircleCI counterpart for at least 1-2 weeks. After that period, we want to evaluate its success rate and execution time. Based on this information, we'll either remove the applicable CircleCI jobs or apply necessary fixes to the new workflow and repeat the evaluation.

During testing, we have seen one instance of self-hosted runners failing to be scheduled (https://github.com/filecoin-project/lotus/actions/runs/8362652752/attempts/2). This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store. We'll apply for a higher quota (https://aws.amazon.com/about-aws/whats-new/2023/07/aws-systems-manager-parameter-store-api-limit/) and evaluate whether a patch that introduces retries for that error is in order.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

We have seen some test flakiness around tests like itest-sector_import_simple, itest-deals_concurrent, itest-sector_import_full, itest-gateway, itest-deals_anycid. The flakiness happened at different stages of the workflow development, and it might or might not have been related to the new setup. At this stage, we think it would be the most beneficial to start running the CircleCI and GitHub Actions workflows alongside each other as this will provide better data on which tests and how often flake and whether it's platform-specific or not.

We did not install statediff because we couldn't install it successfully. It used to be used only in the conformance test job, which was the only one that gathered coverage data, but it seemed to be unused.

Questions

  1. Is statediff still needed for conformance?
  2. Is coverage still needed for conformance?

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@galargh galargh requested a review from laurentsenta March 20, 2024 18:40
@galargh galargh force-pushed the ipdx-gha-test branch 3 times, most recently from c71a9e1 to dbdb080 Compare March 21, 2024 08:47
@galargh galargh marked this pull request as ready for review March 21, 2024 17:19
@galargh galargh requested review from masih and Stebalien March 21, 2024 17:19
@Stebalien
Copy link
Member

This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store.

The only issue I'm seeing is failure to download the proof parameters, which should have been cached, right? See https://github.com/filecoin-project/lotus/actions/runs/8362652752/job/22895146629. Looking at the other jobs in that run, I'm seeing us download the parameters in every single job as well.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

Personally, I would do this. Otherwise, can we pass a more restricted token? I think we can just disable all permissions (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions).

@galargh
Copy link
Contributor Author

galargh commented Mar 22, 2024

This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store.

The only issue I'm seeing is failure to download the proof parameters, which should have been cached, right? See https://github.com/filecoin-project/lotus/actions/runs/8362652752/job/22895146629. Looking at the other jobs in that run, I'm seeing us download the parameters in every single job as well.

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

We have also noticed one instance where make deps failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for the make dep result. This would add only <30 seconds of overhead to the workflow.

Personally, I would do this. Otherwise, can we pass a more restricted token? I think we can just disable all permissions (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions).

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

@galargh
Copy link
Contributor Author

galargh commented Mar 22, 2024

The test failures seem consistent between CircleCI, GitHub Actions, and the default branch.

@Stebalien
Copy link
Member

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

Ah, I see. So this wouldn't show up in the GitHub Actions output?

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

Yes, but can we remove all permissions? I think our GitHub tokens get read/write access by default.

Alternatively, we can change either the org default or the repo default to be read-only. Do you think that will break anything?

@galargh
Copy link
Contributor Author

galargh commented Mar 25, 2024

The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.

Ah, I see. So this wouldn't show up in the GitHub Actions output?

Not really - it would only show up as "The job is waiting for the runner startup" or something along those lines. The quick fix for that is to cancel the job and rerun failed, but hopefully, with all the alerting we have now, even that won't be necessary much.

Sure, no problem. This makes sense to us too. We added caching for make deps outputs in #11773. We're still making GITHUB_TOKEN available in the env for that step. Is that OK? It is only used to make an authorised request to filecoin-ffi releases as opposed to an unauthorised which is more likely to get rate limited (which is still quite unlikely now that we're caching the whole thing). On PRs that token gets limited permissions by default.

Yes, but can we remove all permissions? I think our GitHub tokens get read/write access by default.

Alternatively, we can change either the org default or the repo default to be read-only. Do you think that will break anything?

Oh yeah! Sure! I don't think it should break anything with the workflows we have now, but I'm not 100% sure. I've just changed the default to read-only for the Lotus repository, and I'm going to monitor if everything's behaving as expected. If you don't mind, I'd put off changing the organization default until the work setting up the new workflows here is done. Then, we're going to change the org-wide default and monitor all the repositories for unexpected failures.

.github/workflows/test.yml Outdated Show resolved Hide resolved
.github/workflows/test.yml Show resolved Hide resolved
@galargh galargh merged commit 3f4eaf0 into master Mar 26, 2024
183 of 184 checks passed
@galargh galargh deleted the ipdx-gha-test branch March 26, 2024 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants