-
Notifications
You must be signed in to change notification settings - Fork 36.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI timeouts #30851
Comments
Yea, seems like our ccache efficiency has tanked post-CMake. Is it possible this is somehow a stagnant cache or related issue (although given it's been 2 weeks since the switch, I'd think caches would have cycled)?. ccache hits for some CI jobs of the last merge into master (https://github.com/bitcoin/bitcoin/runs/29789973072): 32-bit CentOS :
Right, and given that |
Can you clear all |
I believe that https://cirrus-ci.com/build/5697885973512192 is more representative, as the change is docs-only. As for Win64 Cross, the statistics shows |
The CI machines may work on pull requests, which may (in)validate the cache, so ccache stats may generally not be 100% reliable. It is possible to trigger a run with a clean cache volume locally (by deleting all volumes), or on CI, by creating a new namespace for the volumes: diff --git a/ci/test/02_run_container.sh b/ci/test/02_run_container.sh
index afd447c347..53c0b4d8bc 100755
--- a/ci/test/02_run_container.sh
+++ b/ci/test/02_run_container.sh
@@ -25,15 +25,15 @@ if [ -z "$DANGER_RUN_CI_ON_HOST" ]; then
--tag="${CONTAINER_NAME}" \
"${BASE_READ_ONLY_DIR}"
- docker volume create "${CONTAINER_NAME}_ccache" || true
- docker volume create "${CONTAINER_NAME}_depends" || true
- docker volume create "${CONTAINER_NAME}_depends_sources" || true
- docker volume create "${CONTAINER_NAME}_previous_releases" || true
-
- CI_CCACHE_MOUNT="type=volume,src=${CONTAINER_NAME}_ccache,dst=$CCACHE_DIR"
- CI_DEPENDS_MOUNT="type=volume,src=${CONTAINER_NAME}_depends,dst=$DEPENDS_DIR/built"
- CI_DEPENDS_SOURCES_MOUNT="type=volume,src=${CONTAINER_NAME}_depends_sources,dst=$DEPENDS_DIR/sources"
- CI_PREVIOUS_RELEASES_MOUNT="type=volume,src=${CONTAINER_NAME}_previous_releases,dst=$PREVIOUS_RELEASES_DIR"
+ docker volume create "new_${CONTAINER_NAME}_ccache" || true
+ docker volume create "new_${CONTAINER_NAME}_depends" || true
+ docker volume create "new_${CONTAINER_NAME}_depends_sources" || true
+ docker volume create "new_${CONTAINER_NAME}_previous_releases" || true
+
+ CI_CCACHE_MOUNT="type=volume,src=new_${CONTAINER_NAME}_ccache,dst=$CCACHE_DIR"
+ CI_DEPENDS_MOUNT="type=volume,src=new_${CONTAINER_NAME}_depends,dst=$DEPENDS_DIR/built"
+ CI_DEPENDS_SOURCES_MOUNT="type=volume,src=new_${CONTAINER_NAME}_depends_sources,dst=$DEPENDS_DIR/sources"
+ CI_PREVIOUS_RELEASES_MOUNT="type=volume,src=new_${CONTAINER_NAME}_previous_releases,dst=$PREVIOUS_RELEASES_DIR"
if [ "$DANGER_CI_ON_HOST_CACHE_FOLDERS" ]; then
# ensure the directories exist |
Possibly, but this still looks all over the place: 32-bit CENTOS
Do we have any scope to increase the size of the ccache cache, to reduce the likelyhood of this? |
Can we not bump the ccache size on the (cirrus) workflows which we use self-hosted runners for? Currently these values are pretty low IMO, 100-400MB. These ccaches IIUC are mounted volumes from the runner host itself, and don't need to be e.g. restored from a GitHub Actions cache, so do not need to be constrained the same as is best for those workflows. Even at 1GB per cache, this would be 12GB total size, and so should be fine on an 80 or 160GB disk? edit: ah I see fanquake got here first... |
About the cache issue, a better solution may be to set However, the CI timeout issue seems to be of a different cause, because the runtime for the unit tests shouldn't fluctuate by 100%: https://cirrus-ci.com/task/4803595281891328?logs=ci#L8618 vs https://cirrus-ci.com/task/4981166594326528?logs=ci#L9169 |
Thought this might just be constrained to that job, but: |
Asan is running on GHA infra, so this highlights that the issue may not be infra related, but related to the code, config or build files in this repo? |
Yea. Seems like it is also not a difference between PRs and merge commits, given that now the current merge run is slow, and a simultaneously running PR is fast: |
It also doesn't look cmake related, because the latest 28.x build also times out: https://github.com/bitcoin/bitcoin/runs/29755145359 |
One contributing factor to the timeouts may be a recently introduced bug in the exclude parsing #30244 (comment), which affects one specific CI task. However this still does not explain why suddenly all(?) CI tasks in all branches (at least master, 28.x) are affected. |
macOS also seems to time out: https://github.com/bitcoin/bitcoin/actions/runs/10828462406/job/30043878794 on GHA |
I opened #30879 to re-bucket these tests which are taking a few minutes longer on CI than they are bucketed for. It's not much difference, but may help a little. |
Yeah, I think we keep finding small mistakes, but there still must be a large underlying problem. In all the years, I've never seen the CI go from working fine to a large chunk of tasks timing out after 2(!) hours with no apparent reason, even across completely separate infrastructures (GHA, Cirrus workers) |
72b46f2 test: fix exclude parsing for functional runner (Max Edwards) Pull request description: This restores previous behaviour of being able to exclude a test by name without having to specify .py extension. It was noticed in #30851 that tests were no longer being excluded. PR #30244 introduced being able to exclude a specific tests based on args (such as `--exclude "rpc_bind.py --ipv6`) but it made the wrong assumption that test names intended to be excluded would include the .py extension. The following #30244 (comment) shows that this is not how the `--exclude` flag was used in CI. #30244 (comment) gave three examples of `--exclude` being used in CI so I compared the number of tests that the runner would run for these three examples in three situations, before #30244 was introduced, in master today and with this PR applied. Example: `--previous-releases --coverage --extended --exclude feature_dbcrash` Test count: Before #30244 introduced: 314 Master: 315 With this PR: 314 Example: `--exclude feature_init,rpc_bind,feature_bind_extra` Test count: Before #30244 introduced: 306 Master 311 With this PR: 306 Example: `--exclude rpc_bind,feature_bind_extra` Before #30244 introduced: 307 Master 311 With this PR: 307 I've also tested that the functionality introduced with #30244 remains and we can still exclude specific tests by argument. ACKs for top commit: maflcko: review ACK 72b46f2 willcl-ark: ACK 72b46f2 Tree-SHA512: 37c0e3115f4e3efdf9705f4ff8cd86a5cc906aacc1ab26b0f767f5fb6a953034332b29b0667073f8382a48a2fe9d649b7e60493daf04061260adaa421419d8c8
I recreated a CI timeout and the machine's CPU was idle flat at 0% for most of the part for 2 hours, but I didn't look where it spent so much time "doing nothing". |
I have noticed this in the cirrus GUI before, it seems to have always been like this though IIRC: e.g. https://cirrus-ci.com/task/6481793086390272 : or a random pre-cmake run: https://cirrus-ci.com/task/6010695337115648 Perhaps these shared machines are hitting other IO limits? Doesn't seems to be much other explanation for it. Out of interest, what configuration did you use to re-create CI resources? I'd like to be able to do the same. For compaison (with 0%) this is what I see when running unit and funcitonal tests locally (not a clean system, but gets the rough idea) 20 - 75s is waiting for these to complete: 133/134 Test #5: noverify_tests ....................... Passed 36.41 sec And the drop-off towards the end is the longer-running functional tests completing: Remaining jobs: [feature_assumeutxo.py, p2p_node_network_limited.py --v1transport, p2p_node_network_limited.py --v2transport, p2p_permissions.py, feature_config_args.py, p2p_handshake.py, p2p_handshake.py --v2transport, p2p_ibd_txrelay.py, p2p_seednode.py] I will try a local CI run soon and see what that looks like, and if there's any difference. |
This part is addressed in bitcoin-core/secp256k1#1581. |
Yeah, I think the graph there are not accurate and a bit misleading. It may be best to ignore them. I used a
It is certainly possible that this could be an IO limit, but the other graphs were also at 0%, so the most plausible explanation would be that this is a network IO limit, not something local. I couldn't reproduce since then, but I keep trying to reproduce. I could reproduce the macOS 2h timeout issue on GHA yesterday. However, today I also fail to reproduce them. |
Reminds me that Cirrus Logs are randomly disappearing. So maybe there is an issue in the Cirrus backend that streams logs and the streaming process dies or turns into a zombie process? |
Some examples:
Edit: https://www.github.com/cirruslabs/cirrus-ci-docs/issues/1294 |
Actually, it looks like the GHA macOS CI timeouts remain today, for example: https://github.com/bitcoin/bitcoin/actions/runs/10833854937/job/30163805645?pr=30866 or https://github.com/bitcoin/bitcoin/actions/runs/10829942238/job/30163952659?pr=30856 I am not really familiar with GHA, nor with macOS, so it would be good if someone else checked:
|
https://github.com/hebasto/bitcoin/actions/runs/10880757045 (caches have been cleaned for all jobs) |
https://github.com/hebasto/bitcoin/actions/runs/10881199478/job/30189639531 |
Thanks. So it looks like even with an empty cache, the macOS 13 GHA can complete in less than 50 minutes, way less than the 2h timeout. Also, the macOS 14 GHA may complete even faster in less than 40 minutes. Not sure if this helps to debug this issue. Maybe #30866 and #30856 are based on an old(er) commit, which may be used by GitHub to run the CI. So I guess, rerunning them via the web interface won't help and a rebase could be considered. (Unrelated to this issue, one could consider moving to the M1 runners, given that they may speed up the build. Is the removal of |
I did a macOS 13 GHA run (https://github.com/maflcko/bitcoin-core-with-ci/actions/runs/10886978505/job/30208236022) and it failed with: So possibly Github could be facing hardware issues on macOS, leading to other issues down the line? |
Proposed alternative in #30913 |
I am seeing a 2h timeout on running So I fail to see how this could be anything other than a GitHub hardware, software, or data corruption bug. I don't think this can be fixed on our side, apart from removing the task. |
Closing for now. Possibly commit a95e742 (x86_64-apple-darwin22.6.0 -> arm64-apple-darwin23.6.0) may have switched to using more recent hardware, which may be less likely to be corrupt. In any case, if someone is still seeing a CI timeout (that is a CI failure caused by a 2 hour timeout), or a CI task that takes longer than half of the timeout (1h), it would be good to mention it either in this issue, or in a new issue. The 2 hour CI timeout is picked, so that even with an empty ccache, the CI is able to easily pass. It only exists to catch real timeouts (network timeout, zombie process, infinite sleep, infinite loop, ...). Any task taking longer than half the overall task timeout should be tracked down and fixed. |
Looks like this is still a problem. The macOS 14 task was able to finish in less than 30 minutes, even without a ccache at all. Now it is stuck again for more than 90 minutes: https://github.com/bitcoin/bitcoin/actions/runs/10956545011/job/30422825062?pr=30935#step:7:3249 |
The full tail of the log is below. It shows that 202 fuzz targets have been selected, but only 201 completed. Not sure which one is missing, but it seems odd that one target would take more than 1.5 hours, when previously everything passed in less than .5 hours. full log tail
|
Sorry, wrong alarm. The timeout may be real. It may be |
Looks like some CI tasks are timing out, starting a few days ago.
It would be good to investigate and fix the problem.
A quick glance on the ccache stats may indicate that ccache stopped working for some tasks.
However, CI should also pass with a clean ccache, so the issue may be a different one, or there may be more than one issue.
The text was updated successfully, but these errors were encountered: