Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Managed Transport - Known Issues #636

Closed
pjbgf opened this issue Mar 24, 2022 · 4 comments · Fixed by fluxcd/image-automation-controller#337
Closed

Experimental Managed Transport - Known Issues #636

pjbgf opened this issue Mar 24, 2022 · 4 comments · Fixed by fluxcd/image-automation-controller#337
Assignees
Labels
area/git Git related issues and pull requests
Milestone

Comments

@pjbgf
Copy link
Member

pjbgf commented Mar 24, 2022

Version v0.22 introduced an experimental managed transport to move towards fixing some stability issues when executing git network operations.

This issue catalogues all known issues with the new transport and their respective statuses. Please note that some of this issues could also be experienced with go-git and the non-managed libgit2 implementations.

1) ssh.Dial hangs indefinitely ✔️

SSH connections hang indefinitely during a ssh.Dial call. Behind the scenes the transport handshake seems to get stuck during key exchange (at kexLoop). More information can be found at upstream issue.

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.22.4
image-automation-controller -> image-automation-controller:v0.21.2

2) HTTP leaked connections ✔️

The controllers shown an ever increasing number of HTTP established connections (i.e. netstat).

Upon investigation, some requests were not completely processed and closed, impairing the likelihood of the underlying connections to be reused. The transport instances were created per request and never shared.

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.22.4
image-automation-controller -> image-automation-controller:v0.21.2

3) SSH leaked connections ✔️

The controllers shown an ever increasing number of SSH established connections (i.e. netstat).

SSH connections are now cached based on the remote target, meaning that all the operations take place as part of the same connection instead of the previous 1 connection per command (clone/push).

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.22.4
image-automation-controller -> image-automation-controller:v0.21.2

4) Intermittent SSH errors ✔️

The upstream git and crypto libraries do not support multiple and concurrent SSH connections very well (i.e. golang/go#27140).

An initial attempt to cache ssh connections and reuse them cross ssh commands completely eliminated intermittent errors (i.e. #439) during long-running tests.

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.22.4
image-automation-controller -> image-automation-controller:v0.21.2

5) Panic when closing SSH connections ✔️

The upstream git2go implementation was trying to call .Wait() and .Close() in session or stdin objects that could be nil, leading to panic.

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.22.5
image-automation-controller -> image-automation-controller:v0.21.3

6) multi-ack protocol over SSH ✔️

Connecting to ssh servers that require Git's multi-ack feature (i.e. Azure DevOps) results in consistent errors:

  • EOF
  • transport closed

This seems to occur due to the fact that the remote server closes the connection mid-flight.

Connections to Azure DevOps will fallback to unmanaged transport and users will also gain opt-in/out powers based on #662

7) BitBucket ✔️

Multiple concurrent Git connections (one per key type for example) lead to errors ssh.Dial: dial tcp xxx.xxx.xxx.xxx:22: i/o timeout or ssh: rejected: administratively prohibited (cannot open additional channels).

The removal of cached connections and servicing the PipeStdOut fast enough has fixed this.

8) git2go/libgit2 may panic and force the controller to crash ✔️

  • git2go internal state may cause panics. This has been replaced with TransportOptions.

9) Stale connections leading to continuous errors ✔️

Cached connections may stale over time. In some Git providers (e.g. GitLab) this may happen sooner than others.
Once the connections become stale, errors reconciling become common.

Fixed from:
source-controller -> ghcr.io/fluxcd/source-controller:v0.23.0
image-automation-controller -> pending

@pjbgf pjbgf added the area/git Git related issues and pull requests label Mar 24, 2022
@pjbgf pjbgf self-assigned this Mar 24, 2022
@pjbgf
Copy link
Member Author

pjbgf commented Mar 24, 2022

Some of the SSH issues may be connected a concurrency issue calling ssh.Dial as reported upstream: golang/go#27140

@pjbgf
Copy link
Member Author

pjbgf commented Mar 28, 2022

The first release with the fixes is now out, to test them you must first opt-in to the managed transport by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the source-controller's Deployment.

You can do that directly in your kustomization.yaml with:

patches:
- patch: |
    - op: add
      path: /spec/template/spec/containers/0/env/0
      value: 
        name: EXPERIMENTAL_GIT_TRANSPORT
        value: "true"
  target:
    kind: Deployment
    name: "(image-automation-controller|source-controller)"

Note that managed transport only works with the libgit2 implementation, and therefore your GitRepository objects must be set accordingly.

The official images with the fixes:

source-controller -> ghcr.io/fluxcd/source-controller:v0.22.5
image-automation-controller -> image-automation-controller:v0.21.3

@pjbgf
Copy link
Member Author

pjbgf commented Apr 5, 2022

Re-opening due to new issues reported (items 6 to 8) - issue description updated accordingly.

@pjbgf
Copy link
Member Author

pjbgf commented May 27, 2022

Fixed by the changes introduced in #689 and #727.

@pjbgf pjbgf closed this as completed May 27, 2022
Repository owner moved this from In Progress to Done in Maintainers' Focus May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants
@pjbgf and others