-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artifact caching proxy is unreliable #3481
Comments
Quick report of the actions performed Friday 30 of March by the Jenkins Infra team:
=> there is no "easy and obvious" cause
I heard @lemeurherve mentioning the builds were back to normal despite jenkinsci/bom#1921 but I'm not sure: I'll let him report here monday (as the current main build of the BOM seems to succeed: we can wait for the week-end to pass). Next steps:
|
Did we consider the possibility of file descriptor exhaustion? Not sure what
I did not see any report today, but I am assuming the reasoning is that the Maven upgrade restored retries and therefore stability. That is possible; it could be confirmed by reverting jenkinsci/bom#1921 and seeing if the problem returns. I have never seen a failure with the old version of Maven and
A theory of file descriptor exhaustion is consistent with the pathology being only observable in BOM, as BOM is a pretty good stress test for the web server. |
I'm not sure about the However, I realize we did NOT check for the ingress (which happen to be Nginx) configuration that might be the culprit (we do not hit ACP directly). Anyway that could be worth it to:
|
Also, there are fundamentals questions at stake:
|
Please note that with #3487, investing too much time in DigitalOcean might be a waste as we are not sure if we'll get more credits. |
Metrics on the ACP side confirms that the 30th of March, the cumulated amount of active connections on the whole ACP for DigitalOcean never crossed the threshold of 400 simultaneously. However, it seems there is a lot of connections "waiting" which goes into the direction of jenkins-infra/helm-charts#471 |
Is it worth trying an artifactory instance running on each cloud provider as a proxy instead of nginx? I assume nginx was used as that's what was used previously? |
Yes, but not only:
I'm really bothered by this issue: there is nothing pointing finger at ACP specifically (otherwise the issue would have been seen in AWS as well since quite some time) so it really looks a weird issue. |
We've ran it in a pod on Kubernetes for about 4 years and can count on one hand the number of times we've had to touch it. |
these are the maintenance I'm thinking of. (edit)
|
I tried enabling the artifact caching proxy in ATH and got the same error, even with Maven 3.9.1. So Maven 3.9.1 is not the solution.
|
Seeing the same error in ATH even in Azure with Maven 3.9.1 in https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1097/10/ -- giving up on enabling artifact caching proxy for ATH.
|
I am not sure why scare quotes were used. When trying to enable the artifact caching proxy in ATH, I got failures on both DigitalOcean and Azure, but not when I removed the artifact caching proxy (using Maven 3.9.1 in all three cases). The artifact caching proxy is clearly unreliable, and this is clearly not BOM-only. |
About the ACP Azure problem encountered with ATH:
Nothing obvious at first sight. Currently checking the metrics of the VMs used as agent by the ATH to see if there has been an error on these machines |
Extracted the list of agent names from https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1097/10/ with the command => it appears that only one of these machines reported metrics during the past 24h, and only for a few minutes: while there is a build of the ATH (https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/722/) currently being build and its VM agents are reporting metrics: Azure status does not report any incident though: https://azure.status.microsoft/en-us/status/ and https://azure.status.microsoft/en-us/status/history/. I'm not sure where to go from here: again, there is nothing pointing at the ACP particularly, but it is hard to diagnose so any idea to help is welcome. Feels like a network issue, which might be caused by the IP overlap in the legacy networks (ref. #3351). |
I can edit my comment if the quotes bother you. There is nothing to understand from this quotes. It was written by reflex by someone who is not a native English speaker. Yes, you have issues when enabling the ACP. Now that you reported the issue with the ATH we are looking on it. |
Scare quotes are used to elicit doubt, so I wanted to be clear that there is a problem. I am happy to step back and let the infrastructure team investigate. I do not have the context and access to dig deeper anyway, and I do not debug problems like this with dashboards and log files — I am more old-school and debug problems like this by doing experiments and running system-level commands to observe the effects. Nevertheless I have never met a debugging problem that I could not eventually solve. |
FWIW since the merge of your PR introducing ACP on ATH, there has been 3 builds on master and 3 builds on PRs without any ACP error. |
That is expected, because I removed the lines introducing the artifact caching proxy from that PR:
I would have kept them, if it were not for the proxy being unreliable as described in this ticket. |
Hello 👋 It's worth retrying ATH artifact caching proxy as #3535 (comment) was implemented : that should remove a lot of the network errors seen in the past months |
I've opened a PR at jenkinsci/acceptance-test-harness#1279 |
Since the PR merge two weeks ago I didn't notice particular ATH failures directly related to the artifact caching proxy. It seems to me we can close this issue, WDYT @basil ? |
No objection, closing this issue. |
Sorry, I missed this notification last week. In any case I have not seen any ACP failures. Thank you very much! |
Service(s)
Artifact-caching-proxy
Summary
Despite #2752 (comment), https://ci.jenkins.io/job/Tools/job/bom/job/master/1568/ failed after approximately 90 minutes with
Reproduction steps
Run a lot of
jenkinsci/bom
builds without theskip-artifact-caching-proxy
label and you are bound to see it within a few attempts.The text was updated successfully, but these errors were encountered: