-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: ci: create gh workflow that runs go tests #11762
Conversation
This reverts commit d1da190.
c71a9e1
to
dbdb080
Compare
The only issue I'm seeing is failure to download the proof parameters, which should have been cached, right? See https://github.com/filecoin-project/lotus/actions/runs/8362652752/job/22895146629. Looking at the other jobs in that run, I'm seeing us download the parameters in every single job as well.
Personally, I would do this. Otherwise, can we pass a more restricted token? I think we can just disable all permissions (https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions). |
The problem with AWS Parameter Store shows up as jobs just hanging and never getting scheduled. We do have monitoring for that set up and, as mentioned, we will try to make sure that address any issues with runner scheduling that we know of.
Sure, no problem. This makes sense to us too. We added caching for |
The test failures seem consistent between CircleCI, GitHub Actions, and the default branch. |
Ah, I see. So this wouldn't show up in the GitHub Actions output?
Yes, but can we remove all permissions? I think our GitHub tokens get read/write access by default. Alternatively, we can change either the org default or the repo default to be read-only. Do you think that will break anything? |
Not really - it would only show up as "The job is waiting for the runner startup" or something along those lines. The quick fix for that is to cancel the job and rerun failed, but hopefully, with all the alerting we have now, even that won't be necessary much.
Oh yeah! Sure! I don't think it should break anything with the workflows we have now, but I'm not 100% sure. I've just changed the default to read-only for the Lotus repository, and I'm going to monitor if everything's behaving as expected. If you don't mind, I'd put off changing the organization default until the work setting up the new workflows here is done. Then, we're going to change the org-wide default and monitor all the repositories for unexpected failures. |
Related Issues
#11734
Proposed Changes
This PR creates a new GitHub Actions workflow (
test.yml
) that performs the following jobs previously handled exclusively by CircleCI:Additional Info
You can find an example run of the workflow at: https://github.com/filecoin-project/lotus/actions/runs/8376141826?pr=11762
Unlike the CircleCI test jobs, the ones from the newly added workflow do not have to wait for the build (
make lotus deps
) to finish. Instead, the test jobs callmake deps
as needed (we didn't find an instance wheremake lotus
was needed). This results in some work duplication but reduces the overall workflow runtime by ~3 minutes.We did use a configuration matrix in the newly added workflow because the number of jobs executed by the workflow is huge. We were able to use a single job "template" since all the test jobs are executed in almost exactly the same manner. In particular, we didn't split out test-conformance definition like it was done in CircleCI.
The newly added workflow makes use of 3 helper actions:
ocl-icd-opencl-dev libhwloc-dev pkg-config
on the runnergo
on the runner (it uses the version it finds in the go.mod file)yugabytedb
docker container and waits for the DB to start runningThe workflow's jobs run on a combination of self-hosted and hosted runners. We use self-hosted runners for two reasons: for resource-intensive jobs (
2xlarge
and4xlarge
) and to increase runner availability (large
andxlarge
). We can only use a limited number of hosted runners concurrently (60).How did we decide what jobs to run on which runners? We used the largest runners - 4xlarge (16 CPU, 32 RAM) - only for some of the jobs that used to run on 2xlarge (16 CPU, 32 RAM) in CircleCI; namely - itest-deals_concurrent, itest-sector_pledge, and itest-worker. We used 2xlarge (8 CPU, 16 RAM) for jobs that, for whatever reason, we saw failing on hosted runners. These include itest-gateway, itest-sector_import_full, itest-sector_import_simple, itest-wdpost, unit-storage. Our assumption here is that more resources could help reduce the flakiness but it is to be verified in practice. Finally, we used xlarge (4 CPU, 8 RAM) for 42 jobs (half of overall 84) that were the quickest in one of the test runs where we run almost everything on xlarge runners. In other words, those jobs scheduled to run on xlarge now should be fine running on hosted runners too (or large (2 CPU, 4 RAM) to reduce the cost; we're sticking with xlarge not to accidentally introduce more flakiness during the evaluation period). Everything else runs on the default GitHub hosted runners (4 CPU, 16 RAM). This entire assignment is a subject to change but we do have to start somewhere and it seems to us like a sensible spot.
We decided to cache Proof Parameters because trying to download them from many jobs at the same time resulted in a number of connection closed errors. The job that ensures the proof parameters are cached properly adds only seconds of overhead to the workflow.
We decided to generate the test job matrix on the fly (instead of pre-generating it like it was for CircleCI). The job matrix generation is executed on every workflow run. It uses bash tools to combine JSON inputs into a final matrix. We evaluated keeping matrix generation as a Go script, but, in our opinion, it made it harder to reason about what the generation does and how the final matrix is configured. It also added a little bit of extra overhead required for Go setup (~40s vs ~10s).
The newly added workflow is intended to run alongside its' CircleCI counterpart for at least 1-2 weeks. After that period, we want to evaluate its success rate and execution time. Based on this information, we'll either remove the applicable CircleCI jobs or apply necessary fixes to the new workflow and repeat the evaluation.
During testing, we have seen one instance of self-hosted runners failing to be scheduled (https://github.com/filecoin-project/lotus/actions/runs/8362652752/attempts/2). This was caused by the self-hosted runners' setup being rate-limited by the AWS Parameter Store. We'll apply for a higher quota (https://aws.amazon.com/about-aws/whats-new/2023/07/aws-systems-manager-parameter-store-api-limit/) and evaluate whether a patch that introduces retries for that error is in order.
We have also noticed one instance where
make deps
failed due to rate limiting applied by GitHub. Since then we provisioned the step with a proper GitHub Token which should make it less prone to failure, but we'll also watch if this issue reappears during normal usage. If so, we'd propose restoring caching for themake dep
result. This would add only <30 seconds of overhead to the workflow.We have seen some test flakiness around tests like itest-sector_import_simple, itest-deals_concurrent, itest-sector_import_full, itest-gateway, itest-deals_anycid. The flakiness happened at different stages of the workflow development, and it might or might not have been related to the new setup. At this stage, we think it would be the most beneficial to start running the CircleCI and GitHub Actions workflows alongside each other as this will provide better data on which tests and how often flake and whether it's platform-specific or not.
We did not install statediff because we couldn't install it successfully. It used to be used only in the conformance test job, which was the only one that gathered coverage data, but it seemed to be unused.
Questions
Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, testarea
, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps