Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP proxy broken since v1.5.8 #524

Closed
sugdyzhekov opened this issue Jul 30, 2018 · 26 comments · Fixed by #581
Closed

TCP proxy broken since v1.5.8 #524

sugdyzhekov opened this issue Jul 30, 2018 · 26 comments · Fixed by #581
Milestone

Comments

@sugdyzhekov
Copy link

Impact
It's not possible to connect to MySQL since v1.5.8

STR

  • Setup fabio as TCP proxy that listen 3306 port and has rule to route traffic to MySQL container (fabio version v1.5.7)
  • Upgrade fabio to v1.5.8

Expected result
mysql client can establish connection to MySQL service

Actual result
mysql client hangs during connection establishment.

Additional info
v1.5.7

$ nc 10.0.0.10 3306
J
5.6.40G!
B\'J}[0z���9KM"^612FK%umysql_native_password

v1.5.8

$ nc 10.0.0.10 3306
<empty output, process hangs until enter press>
J
5.6.40G!
B\'J}[0z���9KM"^612FK%umysql_native_password

Configuration

proxy_addr

:3306;proto=tcp,:80,:22;proto=tcp,:23;proto=tcp,:8080,:1025;proto=tcp,:5672;proto=tcp,:5673;proto=tcp;cs=certstore,:443;cs=certstore,:8125;proto=tcp,:9000;cs=certstore,:27017;proto=tcp

mysql service register at Consul with following tag: urlprefix-:3306 proto=tcp

@aaronhurt
Copy link
Member

@sugdyzhekov Interesting, could you attach a tcpdump in addition to the nc output?

@magiconair
Copy link
Contributor

magiconair commented Oct 29, 2018

Could be this line: https://github.com/fabiolb/fabio/pull/423/files#diff-7c73f6a0cf2ff2671c80a43d0aad92a3R46

this is the sni proxy. sorry

@magiconair magiconair added the bug label Oct 29, 2018
@magiconair magiconair added this to the 1.5.11 milestone Oct 29, 2018
@magiconair magiconair added pinned and removed bug labels Oct 29, 2018
@pschultz
Copy link
Member

pschultz commented Dec 6, 2018

Just discovered this issue too (I assume it's the same). We noticed because our SMTP server seemed to be unresponsive.

If the server starts the conversation after a client connects (as opposed to the client sending the first message, which works fine), fabio doesn't send data to the client.

Both SMTP and MySQL servers start a connection by sending a banner first.

Here is my reproducer:

// smtp_mock.go
package main

import (
    "fmt"
    "log"
    "net"
)

func main() {
    log.Println("Listening on :8025")
    l, err := net.Listen("tcp", ":8025")
    if err != nil {
        log.Fatal(err)
    }

    for {
        conn, err := l.Accept()
        if err != nil {
            log.Fatal(err)
        }

        log.Println("serving", conn.RemoteAddr())
        fmt.Fprintf(conn, "220 localhost ESMTP Go\r\n")
        conn.Close()
    }
}
$ go run smtp_mock.go
2018/12/06 14:48:42 Listening on :8025

Next shell:

$ curl -XPUT localhost:8500/v1/agent/service/register -d@-<<<'
{
    "Address": "127.0.0.1",
    "Check": {
        "Interval": "5s",
        "TCP": "localhost:8025"
    },
    "ID": "09bf6151-243b-487e-a062-ea9741839818",
    "Name": "mock-smtpd",
    "Port": 8025,
    "Tags": [
        "dev-fabio-:8026"
    ]
}'
$ fabio \
    -registry.consul.tagprefix=dev-fabio- \
    -proxy.addr=':8026;proto=tcp' \
    -log.level=TRACE

Next shell:

# Connect directly
$ telnet localhost 8025
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 localhost ESMTP Go go1.11.2
Connection closed by foreign host.

# Connect via fabio
$ telnet localhost 8026
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
<hangs indefinitely>
asdf<enter>
220 localhost ESMTP Go go1.11.2
Connection closed by foreign host.

There's nothing interesting in fabio's logs, not even at TRACE level. Git bisect points fingers at a28d244:

a28d24440a78c441e189e90d8671332122fea90f is the first bad commit
commit a28d24440a78c441e189e90d8671332122fea90f
Author: Aaron Hurt
Date:   Mon Feb 12 21:24:00 2018 -0600

    add basic ip centric access control on routes

:040000 040000 8c2e3c4fd614bac972c3e5cf7f3c5d4403e4b462 3cdbf0b8f4e80bf5a4a0a6567227c936cb0cc77a M      docs
:040000 040000 e046f4421d2a958f63712cc59a843509ae486d86 5681ab0cef7610c12c29a35a6fb262d856f7a952 M      proxy
:040000 040000 34db75362cd6f9ab3a9125f8802e32f892bdc678 491cb45fa2072e457fec37e32cdb49cb1ae383b5 M      route

@pschultz
Copy link
Member

pschultz commented Dec 6, 2018

Here's a tcpdump. Note the two second gap after the third packet. That's where telnet was connected to fabio, but I didn't do anything. Then I hit enter, and then fabio established a connection to the backend.

15:15:26.826545 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [S], seq 2552979650, win 43690, options [mss 65495,sackOK,TS val 3127993891 ecr 0,nop,wscale 7], length 0
15:15:26.826565 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [S.], seq 193147615, ack 2552979651, win 43690, options [mss 65495,sackOK,TS val 3127993891 ecr 3127993891,nop,wscale 7], length 0
15:15:26.826586 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [.], ack 1, win 342, options [nop,nop,TS val 3127993891 ecr 3127993891], length 0
15:15:28.529959 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [P.], seq 1:3, ack 1, win 342, options [nop,nop,TS val 3127995594 ecr 3127993891], length 2
15:15:28.529988 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [.], ack 3, win 342, options [nop,nop,TS val 3127995594 ecr 3127995594], length 0
15:15:28.530519 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [S], seq 2746406685, win 43690, options [mss 65495,sackOK,TS val 3127995595 ecr 0,nop,wscale 7], length 0
15:15:28.530559 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [S.], seq 2444991286, ack 2746406686, win 43690, options [mss 65495,sackOK,TS val 3127995595 ecr 3127995595,nop,wscale 7], length 0
15:15:28.530587 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [.], ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530775 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [P.], seq 1:25, ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 24
15:15:28.530805 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [.], ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530844 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [F.], seq 25, ack 1, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.530846 IP 127.0.0.1.54084 > 127.0.0.1.8025: Flags [P.], seq 1:3, ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 2
15:15:28.530898 IP 127.0.0.1.8025 > 127.0.0.1.54084: Flags [R], seq 2444991311, win 0, length 0
15:15:28.531008 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [P.], seq 1:25, ack 3, win 342, options [nop,nop,TS val 3127995595 ecr 3127995594], length 24
15:15:28.531030 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [.], ack 25, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531097 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [F.], seq 25, ack 3, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531151 IP 127.0.0.1.53618 > 127.0.0.1.8026: Flags [F.], seq 3, ack 26, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0
15:15:28.531178 IP 127.0.0.1.8026 > 127.0.0.1.53618: Flags [.], ack 4, win 342, options [nop,nop,TS val 3127995595 ecr 3127995595], length 0

@aaronhurt
Copy link
Member

Interesting and thank you for detailed troubleshooting. That makes a lot of sense. We use TCP proxy but haven't noticed any breakage but in our case it is all client initiated.

@aaronhurt
Copy link
Member

aaronhurt commented Dec 6, 2018

... and it seems I may be to blame :)

@pschultz could you try reproducing this with master? I just made some changes to the ip access control that might be the solution. In short that section wasn't properly getting the remote connection address previously.

@aaronhurt
Copy link
Member

digging into this now ... using your steps I can reproduce it on master

@aaronhurt
Copy link
Member

It is indeed the AccessDenied check but I'm not yet sure why. Will continue to investigate but commenting out the relevant lines in proxy/tcp/tcp_proxy.go does make both test scenarios work.

@aaronhurt
Copy link
Member

aaronhurt commented Dec 6, 2018

I found the issue but I'm going to need to do a bit of research ... steps so far:

Using the same test scenario as above with these code changes in route/access_rules.go:

// AccessDeniedTCP checks rules on the target for TCP proxy routes.
func (t *Target) AccessDeniedTCP(c net.Conn) bool {
	log.Printf("[DEBUG] in AccessDeniedTCP 0")
	var addr *net.TCPAddr
	var ok bool
	// validate remote address assertion
	if addr, ok = c.RemoteAddr().(*net.TCPAddr); !ok {
		log.Printf("[ERROR] failed to assert remote connection address for %s", t.Service)
		return false
	}
	log.Printf("[DEBUG] in AccessDeniedTCP 1 addr=%+v", addr)
	// check remote connection address
	if t.denyByIP(addr.IP) {
		return true
	}
	log.Printf("[DEBUG] in AccessDeniedTCP 2")
	// default allow
	return false
}

Watching the fabio logs when telnetting to port 8026 from a remote host on the LAN ...

2018/12/06 14:12:49 [DEBUG] in AccessDeniedTCP 0
2018/12/06 14:12:53 [DEBUG] in AccessDeniedTCP 1 addr=100.127.255.1:22900
2018/12/06 14:12:53 [DEBUG] in AccessDeniedTCP 2

The 0 logging line is printed right after telnet collects. The 1 logging line is printed as soon as I press <enter> on the remote machine. It appears that golang is blocking and not returning from the RemoteAddr() function until input is received from the client.

@aaronhurt
Copy link
Member

This looks potentially related: golang/go#12943

@aaronhurt
Copy link
Member

This as well: armon/go-proxyproto#1

@aaronhurt
Copy link
Member

aaronhurt commented Dec 6, 2018

The issue seems to be in the proxy protocol implementation ...

// RemoteAddr returns the address of the client if the proxy
// protocol is being used, otherwise just returns the address of
// the socket peer. If there is an error parsing the header, the
// address of the client is not returned, and the socket is closed.
// Once implication of this is that the call could block if the
// client is slow. Using a Deadline is recommended if this is called
// before Read()
func (p *Conn) RemoteAddr() net.Addr {
	p.once.Do(func() {
		if err := p.checkPrefix(); err != nil && err != io.EOF {
			log.Printf("[ERR] Failed to read proxy prefix: %v", err)
		}
	})
	if p.srcAddr != nil {
		return p.srcAddr
	}
	return p.conn.RemoteAddr()
}

We would either have to disable proxy protocol or get the remote IP without checking for proxy protocol for server initiated requests.

@aaronhurt
Copy link
Member

aaronhurt commented Dec 6, 2018

Commenting out the proxy protocol wrapper allows the test to proceed.

With ln = &proxyproto.Listener{Listener: ln} commented out in proxy/listen.go the tests work with the ip access checks.

Logging output from a telnet check against the mock server from a remote host on the LAN:

2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 0
2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 1 addr=100.127.255.1:22982
2018/12/06 15:45:13 [DEBUG] in AccessDeniedTCP 2

How do we want to proceed?

@aaronhurt
Copy link
Member

aaronhurt commented Dec 6, 2018

My initial thoughts:

  1. Make proxy protocol selectable per listener (default to false)
proxy.addr = 1.2.3.4:443;proxyproto=tue, \
             1.2.3.5:3306;proto=tcp
  1. Skip access rule checking before requesting the remote address if there are not rules defined
  2. Update documentation to indicate that enabling proxyproto on a tcp listener with access or deny rules defined will break with server initiated services

I think making people explicitly enable proxy protocol is also better from a security perspective.

@aaronhurt aaronhurt added the bug label Dec 7, 2018
@magiconair
Copy link
Contributor

So this is broken for TCP proxies which do not use the PROXY protocol and where the client is waiting for an initial handshake from the server which never comes b/c fabio does not forward the connection (i.e. MySQL)? And this started to become an issue when we added access control since we're looking at the RemoteAddr which is a blocking call in the ProxyListener? HTTP is not affected since this was fixed in the stdlib, AFAICT.

Making the PROXY protocol configurable is the better solution, IMO. (I should have done this when I added that.) People know when they are using it, one would assume. The question is what to do with the default. We can have on and off but both settings have drawbacks. Switching to off by default may break existing setups. Leaving it on by default means to keep it broken for users.

Who is affected if we switch the default to off? Users on AWS with access control, for example. If we disable PROXY and they upgrade then the protocol breaks since the server is getting the PROXY header. Or it may silently work since the server can handle the PROXY protocol.

The 1.5.8 behavior was that TCP just worked and I'd prefer if we can get back to that state.

Maybe we also add an auto mode as the new default, where we establish the connection, buffer the first 100 bytes from the client and check for the PROXY header in the TCP proxy directly to handle access control (if there is any).

However, keeping it simple also has advantages. If we decided to switch to off by default we should mitigate the impact:

  • Big warning sign in the CHANGELOG and on top of the README
  • Notice in the Release Notes and the Twitter release notification
  • Can we detect the PROXY header on a TCP connection where PROXY is disabled and issue a warning in the logs?

@pschultz
Copy link
Member

pschultz commented Dec 7, 2018

I'll see if I can put together a PoC for automatic detection.

That being said, if there is a toggle for PROXY Protocol this enables another use-case: checking that a connection is from a particular proxy (as opposed to a particular client downstream). Not sure how common that is, or if it's ever been asked for, though.

@shantanugadgil
Copy link
Contributor

It would be good if the backend could decide if it wants to be enable PROXY or not.

I dunno if thats doable (single listener on Fabio sends PROXY to one backend but not PROXY to another backend)

@pschultz
Copy link
Member

pschultz commented Dec 7, 2018

armon/go-proxyproto#4 adds a ProxyHeaderTimeout option for the listener. If there is no proxy header within that timeout, it falls back to plain TCP proxying.

After some superficial testing, updating goproxy to at least 49fdb5c and exposing that option seems to be a good compromise. We can have it default to something sensible (say, 5ms), and we can interpret zero or a negative value as "disable PROXY protocol support".

In addition we shouldn't call RemoteAddr() unless there are access rules defined.

This will make fabio work for MySQL in the default configuration, costing a slight increase in connect latency, for exactly those users that use access rules and no PROXY protocol.

There is also a chance of subtle breakage by choosing too small a timeout.

@pschultz
Copy link
Member

pschultz commented Dec 7, 2018

If you want to play around, I just pushed to https://github.com/classmarkets/fabio/tree/fix-tcp (classmarkets/fabio@7246b92)

@aaronhurt
Copy link
Member

I made a few commits to a new branch referenced above that address the core points mentioned thus far. The current proxy-protocol toggle defaults to false but that may need to change per @magiconair comment's above.

Regardless of the default setting I believe additional documentation and release notes will be warranted.

@magiconair
Copy link
Contributor

@pschultz I've invited you to be a collaborator so that we can work on the same branches.

magiconair pushed a commit that referenced this issue Dec 7, 2018
Co-authored-by: Peter Schultz <peter.schultz@classmarkets.com>
magiconair added a commit that referenced this issue Dec 7, 2018
address concerns raised while troubleshooting #524
@magiconair
Copy link
Contributor

Please note that this change disabled PROXY protocol by default and you have to enable it if you need it. If you feel this is a grave mistake, the please let us know. (I'll mull over this a bit more before creating a new release).

aaronhurt added a commit that referenced this issue Dec 7, 2018
This fixes a mistake made in #524 that put the deny check
inside the assertion error block.  This effectively disabled
checking of access rules for TCP proxy requests.
aaronhurt added a commit that referenced this issue Dec 7, 2018
* update documentation around the changes to PROXY protocol
* be consistent with accessRules check
* fix mistake made in #524 that put the deny check inside the assertion error block
@pschultz
Copy link
Member

@pschultz I've invited you to be a collaborator so that we can work on the same branches.

@magiconair, thank you very much. Would you also add my SSH key, please?

ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFNuN8ltMS1Eg034DHD0fekzKSkdlzlWeSMbJy/LJxOYEUQMgNdtQ7GzDZUsArTObenJRKMnpyP6I2mcwesw0xM= pschultz

@magiconair
Copy link
Contributor

@pschultz I've enabled your account to be a collaborator with write permissions. That should be sufficient. If you want to use a deploy key instead then it has to be unique for this repo since you cannot use the same key for multiple repos. Trying to add this key returns "key already in use".

@pashinin
Copy link

When will this change be released? Can you make 1.5.11?

@magiconair
Copy link
Contributor

I'll try to release 1.5.11 tonight CET time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants