Skip to content

Commit

Permalink
miniccc: Improve reconnect logic for serial clients (sandia-minimega#…
Browse files Browse the repository at this point in the history
…1459)

* miniccc: Improve reconnect logic for serial clients

The miniccc client and ron server have been updated to better support
reconnection capabilities when using the virtual serial port in QEMU
virtual machines. Past merge 850c445 added support for reconnecting
over serial after a virtual machine restart, but didn't address
connection issues that arise after a VM has been paused or restored from
snapshot.

When a VM is paused, the server side of the serial connection eventually
disconnects and resets. When the VM is resumed, the client is still
connected to the virtual serial port in the VM but messages are no
longer making it to the server because of the server-side reset. Since
the virtual serial port in the client never changed (magic of QEMU
serial ports that are beyond my understanding), the client never sees an
EOF and is still able to write to the port without error.

The same thing as above happens when a VM is restored from snapshot...
the server side makes a new connection to the unix socket that's mapped
to the VM's virtual serial port, and the client is still connected to
the virtual serial port in the VM like it was prior to the snapshot.

In order to allow for the client to detect the disconnect, a HEARTBEAT
message type was added and the server was updated to send a HEARTBEAT
message to the client every so often (default is 5s). The client does
nothing with this message, but can expect to receive it consistently,
and can now timeout and reset if no messages are received within a
certain amount of time (default is 13s).

The Linux miniccc client is able to reset by simply closing its
connection to the virtual serial port and reconnecting. This approach
fails on Windows, however, and the only way to reconnect to the virtual
serial port on Windows is to restart the miniccc client process. The
easiest way to do this is to run the miniccc client process as a Windows
service that's configured to restart on failure, and exit the process
when the client detects the need to reset the connection. To support
this, the Windows version of the miniccc client has been updated to
include a `-install` flag that can be used to install it as a Windows
service that will restart on failure.

* fixup! miniccc: Improve reconnect logic for serial clients
  • Loading branch information
activeshadow authored Sep 30, 2021
1 parent 9e4670e commit dd04c33
Show file tree
Hide file tree
Showing 68 changed files with 11,528 additions and 393 deletions.
112 changes: 112 additions & 0 deletions src/miniccc/dial.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
package main

import (
"encoding/gob"
"fmt"
"io"
"net"
"time"

log "minilog"
"ron"
)

// Retry to connect for 120 minutes, fail after that
const Retries = 480
const RetryInterval = 15 * time.Second

var errTimeout = fmt.Errorf("timeout waiting for function")

func dial() error {
client.Lock()
defer client.Unlock()

var err error

for i := Retries; i > 0; i-- {
if *f_serial == "" {
log.Debug("dial: %v:%v:%v", *f_family, *f_parent, *f_port)

var addr string
switch *f_family {
case "tcp":
addr = fmt.Sprintf("%v:%v", *f_parent, *f_port)
case "unix":
addr = *f_parent
default:
log.Fatal("invalid ron dial network family: %v", *f_family)
}

client.conn, err = net.Dial(*f_family, addr)
} else {
err = timeout(ron.CLIENT_RECONNECT_RATE*time.Second, func() (err error) {
client.conn, err = dialSerial(*f_serial)
if err != nil {
err = fmt.Errorf("dialing serial port: %v", err)
}

return
})
}

// write magic bytes
if err == nil {
_, err = io.WriteString(client.conn, "RON")
}

err = timeout(ron.CLIENT_RECONNECT_RATE*time.Second, func() (err error) {
// read until we see the magic bytes back
var buf [3]byte
for err == nil && string(buf[:]) != "RON" {
// shift the buffer
buf[0] = buf[1]
buf[1] = buf[2]
// read the next byte
_, err = client.conn.Read(buf[2:])
}

if err != nil {
err = fmt.Errorf("reading magic bytes from ron: %v", err)
}

return
})

if err == nil {
client.enc = gob.NewEncoder(client.conn)
client.dec = gob.NewDecoder(client.conn)
return nil
}

log.Error("%v, retries = %v", err, i)

// It's possible that we could have an error after the client connection has
// been created. For example, when using the serial port, writing the magic
// `RON` bytes can result in an EOF if the host has been rebooted and the
// minimega server hasn't cleaned up and reconnected to the virtual serial
// port yet. In such a case, the connection needs to be closed to avoid a
// "device busy" error when trying to dial it again.
if client.conn != nil {
client.conn.Close()
}

time.Sleep(15 * time.Second)
}

return err
}

func timeout(d time.Duration, f func() error) error {
c := make(chan error)

go func() {
c <- f()
}()

select {
case err := <-c:
return err
case <-time.After(d):
return errTimeout
}
}
32 changes: 23 additions & 9 deletions src/miniccc/heartbeat.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,32 @@ import (
const HeartbeatRate = 5 * time.Second

// periodically send the client heartbeat.
func periodic() {
func periodic(done chan struct{}) {
for {
log.Debug("periodic")
t := time.NewTimer(HeartbeatRate)

now := time.Now()
if now.Sub(client.lastHeartbeat) > HeartbeatRate {
// issue a heartbeat
heartbeat()
}
select {
case <-t.C:
log.Debug("periodic")

now := time.Now()
if now.Sub(client.lastHeartbeat) > HeartbeatRate {
// issue a heartbeat
heartbeat()
}

sleep := HeartbeatRate - now.Sub(client.lastHeartbeat)
// time.Sleep(sleep)
t.Reset(sleep)
case <-done:
if !t.Stop() {
<-t.C
}

sleep := HeartbeatRate - now.Sub(client.lastHeartbeat)
time.Sleep(sleep)
log.Debug("stopping periodic heartbeat")

return
}
}
}

Expand Down
Loading

0 comments on commit dd04c33

Please sign in to comment.