-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure DevOps: Source controller getting stuck #402
Comments
We are facing the same behavior with our AKS clusters running in Azure west europe and china east 2. The problem only occurs in the china region. |
Hey @o-otte, are you also using Azure DevOps git source (libgit2) or any other git like Github? |
Hi @mfamador, yes we also use Azure DevOps |
Hello, Git source-controller same as above v0.15.3 |
The latest source-controller is v0.21.2 from Flux CLI v0.27.2. There have been substantial changes in all parts of Flux since June, when the source-controller v0.15.3 was released. Some of the most recent changes are updates that targeted issues like these, and those updates might have already resolved this issue for the original poster. Are you able to reproduce it consistently @natarajmb, or does it go away when you restart? (Are you able to try an upgrade?) If you can reproduce it with a current release, then we can dedicate some resources to trying to reproduce it again. I have a feeling this issue is either solved now, or it will be solved soon; but it seems tricky to reproduce. It will not be possible to investigate effectively based on reports for an older version, if we do not have a report stating for sure if this issue remains with the current version of Flux. |
@kingdonb I now restarted source-controller and it started to pull the latest version of the code. FYI, we run flux from single repo based on path separation for multiple environments/clusters. I think this has occurred as we had a git revert on the path specific to this cluster after which it never pulled. Restarting source-controller fixed it. Thanks for the heads up will upgrade to the latest Flux. |
Re-opening until we get confirmation that the issue has been fixed. @mfamador @natarajmb do you mind trying the fixes included in the new experimental Managed Transport and let us know whether that fixes your issue please? |
@pjbgf, will do that, thanks. Removed the cronjobs that are restarting the source-controller every 30 mins and will monitor and let you know if the issue is gone. |
If forgot to apply the patch with the new env var with the experimental managed transport. Until adding it, the version
The source-controller deployment:
|
I have most of my GitRepositories with |
@mfamador Thank you so much for providing this information. I am investigating further into this and can see a few improvements to be done already. Would you be able to change your Are you experiencing the error at every reconcile or is this happening intermittently? In order words, does the |
@pjbgf, I've set It seems that
Some logs when reconciling:
|
I guess it crashes when reconciling for the second time only. (sorry, it's hard to be deterministic, I have a lot of GitRepos on this cluster) |
@mfamador that's absolutely fine, thank you for all the information. We will be releasing a minor patch today including a potential fix for this under version |
@pjbgf I've just tried
Removing the EXPERIMENTAL_GIT_TRANSPORT env var it reconciles just fine. |
@mfamador thank you for reporting back. I have found a few scenarios in which this happens that are linked to some upstream issues (links below). In most cases I reproduced (with managed transport) it was related to concurrency. On my tests, depending at which point a second ssh connection is created, it gets stuck before completing its handshake. Would you mind trying to reduce your controller's concurrency to 1 please and check again? This will reduce the concurrency, which hopefully will decrease the likelihood of the controller hanging. Do you mind telling me how many repositories do you have running in your setup? Are all of them on a 5min interval? |
Sure, I'll try to reduce the concurrency and let it run over the weekend. On each cluster, we have around 36 AzureDevOps repos. Most of them have
My current concurrency is |
And always in the same region, |
@mfamador would you be able to check what is latency like between the clusters that have the issue vs the ones that don't? |
@pjbgf I didn't make any formal test, but I'm pretty sure that the latency is bigger on the ones failing, yes. I've created a bastion container on every cluster and cloned the git repos a few times from there, and despite the latency being usually bigger in Europe I wouldn't consider it relevant though, it clones the repos pretty fast. US (not getting stuck):
Europe (getting stuck):
|
@mfamador thank you for sharing this. I think the latency may be causing the issue, the reason for that is that the crypto library has a known behaviour that if the stdout of a SSH session is not serviced fast enough, it may lead to the SSH connection to block. Here's a release candidate version that I would be keen to see whether resolves the issues you are observing: Quick summary of the changes vs the current last published image:
|
@pjbgf, thanks, I'll use this version on the problematic regions for a while.
I'm using "--concurrent=1" as you've recommended a while ago. Do you think I can change it back to 4 or 6? |
@mfamador yes, with the release candidate version you can revert to 4-6 as appropriate. |
@pjbgf, unfortunately, it seems to get stuck the same. I've been running the release candidate in both north European regions since yesterday.
While most of them are still reconciling:
one of them is stuck and never reconciles:
Nothing shows up in the logs when I issue the reconcile cmd on the stuck git repo, but when reconciling the other ones it already shows up:
As usual, restarting the
|
After a few days I got a different error:
|
@mfamador Thanks again for always following through with the tests. 🙇 We are releasing a new version soon with a bunch of improvements for |
@pjbgf 2 days running with this new |
@pjbgf 5 days running with this new rc and didn't get stuck yet.
|
We're only using the The
|
Hi @mfamador, would it be possible for you to post the logs of the controller before it restarts? It'd go a long way to help in understanding what's happening. Thanks! |
@mfamador thanks again for helping us debug this. Would you mind to please run the command below against the clusters that are misbehaving?
|
Here it goes:
and in another cluster in same problematic region:
|
@mfamador after some additional changes I think we have a RC that may also mitigate the restarting issue.
|
@pjbgf after two days, we got 0 restarts and still reconciling with no issues.
|
@mfamador that's great news, thank you for helping us through this. 🙇 We will release a new patch with this fix later on this week. |
@pjbgf after 4d had a few restarts, adding the logs:
|
0 restarts since yesterday, and in the other cluster, same region, 0 restarts over the last 5 days, pretty stable and working fine:
|
@mfamador have the restarts kept at |
@pjbgf there are a few restarts in both problematic north european clusters:
Not sure if it helps but here are some logs:
Anyway, there's no stuck controller now which mitigates our initial issue. Thanks for your help |
@mfamador as you mentioned, the restarts are orthogonal to the initially reported issue so I created a new issue for that one, whilst I will be closing this one. Thank you so much for all the help getting this resolved. |
Hello.
We have 3 AKS clusters, all running the exact same versions of flux (0.16.1) in two different Azure regions (North Europe and East US).
The
source-controller
version is0.15.3
.Both clusters are synching with the same Azure DevOps git repositories (gitImplementation: libgit2).
Everything is working great on East US clusters but in North Europe
source-controller
gets stuck multiple times a day and only killing it seems to make the sources to reconcile again (we've created a cronjob to restartsource-controller
every half a hour).Even restarting every half a hour we're still getting a lot of gaps where there's no source reconciliation.
In this state, any manual reconciliation also gets stuck and never finishes:
There's no logs on
source-controller
when it's in this lock state ...I'm pretty sure it's a connectivity problem to Azure DevOps or something not directly related to
source-controller
, but maybe it should recover or timeout from whatever it's trying to do (?)I've also increased
concurrent
from the default2
to6
but it seems to not be doing anything differently.Thanks!
The text was updated successfully, but these errors were encountered: