Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS.EnableFirewall=y breaks load balanced sets probing... #879

Closed
rkirkpat opened this issue Sep 8, 2017 · 6 comments
Closed

OS.EnableFirewall=y breaks load balanced sets probing... #879

rkirkpat opened this issue Sep 8, 2017 · 6 comments
Milestone

Comments

@rkirkpat
Copy link

rkirkpat commented Sep 8, 2017

We have several Ubuntu 14.04 LTS (classic) VMs in the Azure cloud running HTTPS web services on port 443. These web services are exposed to the Internet using load balanced sets with the probe port set also to be 443. Yesterday we upgraded these VMs with security updates, including an update of walinuxagent from v2.0.14 to v2.0.16, after which these web services were no longer accessible.

After much troubleshooting we discovered that the probes sent from Azure fabric IP, 168.63.129.16, were never getting a reply from our servers, as per this tcpdump output:

01:25:06.517671 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:09.532881 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:15.532769 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [S], seq 2458085120, win 8192, options [mss 1440,nop,nop,sackOK], length 0

We then proceeded to revert the updated packages one by one and eventually found that the updated walinuxagent package was the cause of failure. Reviewing /etc/waagent.conf we found a new config options, OS.EnableFirewall, and that it was enabled. Once we disabled that option and rebooted the server (on one that had not been downgraded), the web services were accessible again as the probe requests were getting responses now:

20:57:50.482060 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [SEW], seq 2427470624, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
20:57:50.482113 IP 10.0.0.6.https > 168.63.129.16.60021: Flags [S.], seq 2514945281, ack 2427470625, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
20:57:50.482157 IP 168.63.129.16.59962 > 10.0.0.6.https: Flags [.], ack 2, win 513, length 0
20:57:50.482276 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [.], ack 1, win 513, length 0

We reviewed the commits to the waagent.conf file on GitHub and found that a recent commit, e247e7b, had added this option and firewall rules blocking any non-root process from communicating with the fabric server 168.63.129.16. Of course our web services on port 443 are not running as root (it is a custom twisted python service running as a service user) and hence are not allowed to receive the probe from the fabric.

There was no warning about this change in any release notes, and it was enabled by default (in conflict with the comment directly above it in the config file that by default it was to be disabled). This issue cost us quite a bit of engineering time to find the solution and restore our web services. I would recommend this option be disabled by default or at least the user warned about it being enabled!

@brendandixon
Copy link
Contributor

@rkirkpat We apologize that this caused you time and pain. The Azure fabric IP address is not meant for general consumption. Restricting access to the Linux Agent process (which is, effectively, what the iptables rules do) is the right thing.

What probe are your services dependent upon? I'm unaware of any probe from the Azure fabric into a service.

@fjjimenez01
Copy link

Hi, we experienced the same issue. After the WALinuxAgent update our web services and the web app were no longer available through the load balancer.

2017/09/09 05:08:19.253230 INFO WALinuxAgent-2.2.16 running as process 1129
2017/09/09 05:08:19.267546 INFO Wire server endpoint:168.63.129.16
2017/09/09 05:08:19.294151 INFO Wire server endpoint:168.63.129.16
2017/09/09 05:08:19.496409 INFO Wire server endpoint:168.63.129.16
2017/09/09 05:08:19.560777 INFO Successfully added Azure fabric firewall rules

We are not using Azure Fabric, the WALinuxAgent was installed by default.

@rkirkpat
Copy link
Author

rkirkpat commented Sep 9, 2017

We are using load balanced sets as shown in this screen shot from the Azure portal after selecting one of (classic) VMs:
azureportal-loadbalancedset

The probes are configured (highlighted with the red box) to verify our HTTPS web service is responding to connections before routing connections from the Internet to it. This way if we take a node down for maintenance connections will instead be routed to the remaining up nodes. We have been using this functionality without issue for a couple of years now.

brendandixon added a commit that referenced this issue Sep 13, 2017
Signed-off-by: Brendan Dixon <brendandixon@me.com>
hglkrijger pushed a commit that referenced this issue Sep 14, 2017
* [#879] -- OS.EnableFirewall=y breaks load balanced sets probing

Signed-off-by: Brendan Dixon <brendandixon@me.com>

* Set version to v2.2.17

Signed-off-by: Brendan Dixon <brendandixon@me.com>
@brendandixon brendandixon added this to the 2.2.17 milestone Sep 14, 2017
@brendandixon
Copy link
Contributor

Address by #883

hglkrijger added a commit that referenced this issue Sep 19, 2017
* [#879] -- OS.EnableFirewall=y breaks load balanced sets probing

Signed-off-by: Brendan Dixon <brendandixon@me.com>

* Set version to v2.2.17

Signed-off-by: Brendan Dixon <brendandixon@me.com>
@Suvitruf
Copy link

Suvitruf commented Jul 15, 2018

Not sure why it was closed, I've just deployed VM and in /etc/waagent.conf OS.EnableFirewall=y was enabled.

And still: https://github.com/Azure/WALinuxAgent/blob/master/config/ubuntu/waagent.conf#L107

The comment says that by default it should be false, but in fact it is true.

@hglkrijger
Copy link
Member

@Suvitruf - the initial firewall rule was disabled because it was too restrictive, and hence this issue was closed. Since then we have started rolling out essentially the same functional change but with a less restrictive rule, which should not affect load balancer probes. Thanks for pointing out the comment in the config needs to be updated, I have opened #1260 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants