Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fleetctl: "askToTrustHost" doesn't work correctly with cluster and empty "known_hosts" #1499

Open
kayrus opened this issue Mar 11, 2016 · 16 comments

Comments

@kayrus
Copy link
Contributor

kayrus commented Mar 11, 2016

Steps to reproduce:

  • spawn 3 coreos nodes cluster
  • create a template hello@.service:
[Service]
ExecStart=/bin/bash -c "while true; do echo Hello, World %i!; sleep 1; done"
  • clear ~/.fleetctl/known_hosts
  • start units fleetctl start hello@{1..10}.service
  • fleetctl status hello@{1..10}.service then type "yes" for each request. second request will not work until you type it twice.
@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

Yet another example. When you have four nodes and you have units on all of them, try to approve ssh fingerprints for three nodes only and skip last one. Then try to retrieve status once again with the fleetctl status hello@{1..10}.service you'll notice that you have to type enter three times to get the result. Looks like each ssh connection adds something to stdin so it requires extra "enter" to start work.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

@jonboulle I have not a clue what is wrong. But when I removed session.Stdin = os.Stdin string from Execute function, fleetctl started to work fine. I don't know whether this fix is correct, because I still didn't find the root problem.

I also played a lot with these constants inside gossh.TerminalModes and it didn't help at all.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

UPD:
Nope, this fix is not correct. fleetctl ssh hello@5 "bash" doesn't work, because stdin is not forwarded.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

A few things to check:

  • check whether we can redirect stdin/out/err after we detect that host fingerprints were changed
  • check whether sending \n to stdin after close the session will help
  • check what will happen when we run fleetctl ssh unit1 unit2 unit3

@kayrus
Copy link
Contributor Author

kayrus commented Mar 11, 2016

Answer to 1st one:

  • askToTrustHost in ssh/known_hosts.go is referenced in NewHostKeyChecker in ssh/known_hosts.go
  • NewHostKeyChecker is referenced in getChecker in fleetctl/fleetctl.go
  • getChecker is passed into NewSSHClient or NewTunnelledSSHClient in ssh/ssh.go
  • NewSSHClient or NewTunnelledSSHClient are pass checker into sshClientConfig in cfg.HostKeyCallback config in ssh/ssh.go
  • HostKeyCallback is used in newClientTransport in golang.org/x/crypto/ssh/handshake.go and assigned to t.hostKeyCallback
  • t.hostKeyCallback (which is askToTrustHost in ssh/known_hosts.go) is called inside client function in golang.org/x/crypto/ssh/handshake.go
  • client is called by enterKeyExchange in golang.org/x/crypto/ssh/handshake.go
  • enterKeyExchange is called by readOnePacket in golang.org/x/crypto/ssh/handshake.go
  • readOnePacket is called in readLoop in golang.org/x/crypto/ssh/handshake.go
  • readLoop is called as go routine by newClientTransport (and newServerTransport, but we don't need it) in golang.org/x/crypto/ssh/handshake.go
  • newClientTransport is called by clientHandshake in golang.org/x/crypto/ssh/client.go
  • clientHandshake is called by NewClientConn in golang.org/x/crypto/ssh/client.go
  • NewClientConn is called by Dial in golang.org/x/crypto/ssh/client.go and NewTunnelledSSHClient in ssh/ssh.go
  • Dial is called by NewSSHClient and NewTunnelledSSHClient in ssh/ssh.go - is better to figure out why we call both Dial and NewClientConn inside NewTunnelledSSHClient. Probably we call NewClientConn twice (directly and by Dial)
  • NewClientConn is called by NewTunnelledSSHClient in ssh/ssh.go
  • NewTunnelledSSHClient is called by runRemoteCommand and runSSH in fleetctl/ssh.go
  • We need runRemoteCommand which is called by runCommand in fleetctl/ssh.go
  • runCommand is called by runStatusUnits in fleetctl/status.go
  • runRemoteCommand creates sshClient (*ssh.SSHForwardingClient) and passes it to the ssh.Execute in fleetctl/ssh.go
  • ssh.Execute is defined in ssh/ssh.go and it calls makeSession which actually sets stdin/out/err:

@kayrus
Copy link
Contributor Author

kayrus commented Mar 14, 2016

Answer to 3rd one:

bash expands fleetctl ssh hello@{1..3}.service into fleetctl ssh hello@1.service hello@2.service hello@3.service so it tries to execute hello@2.service hello@3.service on the server which runs hello@1.service

@kayrus
Copy link
Contributor Author

kayrus commented Mar 14, 2016

When you use pure bash without any promt string (ssh coreos1 bash), fleetctl status hello@{1..3}.service works fine.

@kayrus
Copy link
Contributor Author

kayrus commented Mar 15, 2016

Added debug of the stdin:

diff --git a/ssh/ssh.go b/ssh/ssh.go
index ca02dd9..32ae5df 100644
--- a/ssh/ssh.go
+++ b/ssh/ssh.go
@@ -18,6 +18,8 @@ import (
        "errors"
        "net"
        "os"
+       "io"
+       "fmt"
        "strconv"
        "strings"
        "time"
@@ -68,7 +70,12 @@ func makeSession(client *SSHForwardingClient) (session *gossh.Session, finalize

        session.Stdout = os.Stdout
        session.Stderr = os.Stderr
-       session.Stdin = os.Stdin
+       //session.Stdin = os.Stdin
+       stdin_file, err := os.OpenFile("/home/core/stdin.log", os.O_WRONLY|os.O_APPEND|os.O_CREATE, 0644)
+       if err != nil {
+               fmt.Fprintf(os.Stderr,"%v", err)
+       }
+       session.Stdin = io.TeeReader(os.Stdin, stdin_file)

        modes := gossh.TerminalModes{
                gossh.ECHO:          1,     // enable echoing
@@ -89,6 +96,8 @@ func makeSession(client *SSHForwardingClient) (session *gossh.Session, finalize

                finalize = func() {
                        session.Close()
+                       stdin_file.Close()
+                       fmt.Fprintf(os.Stderr,"state: %#v\n",oldState)
                        terminal.Restore(fd, oldState)
                }

when you use regular ssh stdin.log contains CR (0xOD) after ssh session is closed. when you use ssh coreos1 bash (which works fine) - stdin.log contains LF (0xOA).

@kayrus
Copy link
Contributor Author

kayrus commented Mar 15, 2016

Here is more simple solution fleetctl status hello@{1..10} > output_term1 2>&1:

Which provides this output with ssh coreos1

^[[1;32m●^[[0m hello@1.service^M
   Loaded: loaded (/run/fleet/units/hello@1.service; linked-runtime; vendor preset: disabled)^M
   Active: ^[[1;32mactive (running)^[[0m since Tue 2016-03-15 09:07:05 UTC; 16min ago^M
 Main PID: 12496 (bash)^M
   CGroup: /system.slice/system-hello.slice/hello@1.service^M
           ├─12496 /bin/bash -c while true; do echo Hello, World 1!; sleep 1; done^M
           └─16661 sleep 1^M
^M
Mar 15 09:23:20 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:21 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:22 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:23 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:24 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:25 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:26 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:27 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:28 coreos2 bash[12496]: Hello, World 1!^M
Mar 15 09:23:29 coreos2 bash[12496]: Hello, World 1!^M

ssh coreos1 bash provides this result:

● hello@1.service
   Loaded: loaded (/run/fleet/units/hello@1.service; linked-runtime; vendor preset: disabled)
   Active: active (running) since Tue 2016-03-15 09:07:05 UTC; 17min ago
 Main PID: 12496 (bash)
   CGroup: /system.slice/system-hello.slice/hello@1.service
           ├─12496 /bin/bash -c while true; do echo Hello, World 1!; sleep 1; done
           └─16970 sleep 1

Mar 15 09:24:23 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:24 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:25 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:26 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:27 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:28 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:29 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:30 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:31 coreos2 bash[12496]: Hello, World 1!
Mar 15 09:24:32 coreos2 bash[12496]: Hello, World 1!

@tixxdz
Copy link
Contributor

tixxdz commented Mar 15, 2016

@kayrus I'm not sure I follow all the notes, but you should also try to see if a PTY was allocated or not and the interactive mess here.. ?! not sure how the ssh Go implementation works either...

@kayrus
Copy link
Contributor Author

kayrus commented Mar 15, 2016

This code removes ^M, but terminal is still broken and requires extra key to start working.

gossh.ONLCR:         0,

@kayrus
Copy link
Contributor Author

kayrus commented Mar 15, 2016

https://github.com/madebymany/moltar/blob/master/sshclient.go - yet another ssh terminal implementation

@kayrus
Copy link
Contributor Author

kayrus commented Apr 13, 2016

The issue caused by unavailability to terminate io.Copy goroutine when you close session. It's being terminated only when user presses Enter key.

@kayrus
Copy link
Contributor Author

kayrus commented Apr 13, 2016

@jonboulle do you know how is it possible to interrupt io.Copy on demand?

kayrus added a commit to endocode/fleet that referenced this issue Apr 14, 2016
@kayrus
Copy link
Contributor Author

kayrus commented Apr 18, 2016

I was able to interrupt io.Reader.Read call by closing the fd/0 stdin descriptor but only in small golang test. This solution doesn't work inside fleetctl code. Trying to figure out why.

@kayrus
Copy link
Contributor Author

kayrus commented Jun 24, 2016

I guess it's worth a try to test this solution

dongsupark pushed a commit to endocode/fleet that referenced this issue Jun 24, 2016
Assign session.stdin using StdinPipe() to avoid cases of ssh connetions
being blocked on user input.

WIP.

Reported-by: kayrus <kay.diam@gmail.com>
maybe partly resolves coreos#1499
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants