Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine-config-server should not listen in the local port range #166

Closed
squeed opened this issue Nov 12, 2018 · 19 comments
Closed

machine-config-server should not listen in the local port range #166

squeed opened this issue Nov 12, 2018 · 19 comments
Assignees

Comments

@squeed
Copy link
Contributor

squeed commented Nov 12, 2018

The machine-config-operator seems to listen on port 49500 (with hostNetwork: true). This is in the default ip_local_port_range, which means it can collide with active tcp sessions:

[root@test1-master-0 core]# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768    60999

It should serve on a port lower than 32768.

For example, I managed to collide with a persistent connection from the apiserver to etcd:

[root@test1-master-0 core]# nc -l -t -p 49500
Ncat: bind to 0.0.0.0:49500: Address already in use. QUITTING.
[root@test1-master-0 core]# ss -np | grep 49500
tcp    ESTAB      0      0      192.168.126.11:49500              192.168.126.11:2379                users:(("hypershift",pid=10044,fd=60))
@abhinavdahiya
Copy link
Contributor

/cc @crawford

@crawford
Copy link
Contributor

@squeed Do you have a specific range that we should use? Does OpenShift define a particular range that we can use for internal services? If not, should we define one?

@cgwalters
Copy link
Member

To clarify, this port is required to serve Ignition configs, and Ignition runs in the initramfs before a node has joined the cluster and can use cluster networking, etc.

That said, is there any reason we couldn't just pick a free port dynamically on startup?

@crawford
Copy link
Contributor

crawford commented Dec 8, 2018

That said, is there any reason we couldn't just pick a free port dynamically on startup?

All of the machines in the cluster would have to know what port number they should connect to. If it were dynamically when the MCS started, how would new machines know where to connect?

@ashcrow
Copy link
Member

ashcrow commented Dec 11, 2018

Service discovery through etcd might be an option, but it would be more complicated than a static, agreed upon port.

@squeed
Copy link
Contributor Author

squeed commented Dec 11, 2018

You just need to change the port. It cannot be in the local port range. Just pick a new number < 32768

@ashcrow
Copy link
Member

ashcrow commented Dec 11, 2018

32623 doesn't seem to be in use officially or unofficially AFAICT.

@cgwalters
Copy link
Member

Was glancing at this just for my own edification, it seems like when we change this we need to make a co-ordinated change to the installer:

https://github.com/openshift/installer/blob/ac006ae671a645553d58c8a29c676968dfa3d85f/pkg/asset/ignition/machine/node.go#L24

@wking
Copy link
Member

wking commented Jan 22, 2019

For folks blindly searching issues, the current behavior results in logs like:

F0122 18:58:33.952823       1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use```

leading to e2e errors like

fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:2, cap:2>: [
        "Pod openshift-machine-config-operator/machine-config-server-7mhkb is not healthy: container machine-config-server has restarted more than 5 times",
        "Pod openshift-machine-config-operator/machine-config-server-ntrdk is not healthy: container machine-config-server has restarted more than 5 times",
    ]
to be empty

...

failed: (2m3s) 2019-01-22T19:11:29 "[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]"

Out of band, @crawford said:

That error is usually the result of the process dying and the kernel not releasing those resources fast enough. You can get around that with SO_REUSEPORT

@squeed
Copy link
Contributor Author

squeed commented Jan 23, 2019

That can indeed happen, but that's not what happened here. When I filed this bug, there was a clear port conflict with an outgoing connection from the apiserver process to etcd. No amount of waiting would fix the issue

The port needs to be moved, or this random failure will continue to happen.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 1, 2019

This issue seems to have come up again seeing in MCS logs in payload promo gate:

I0131 23:33:57.210794       1 start.go:37] Version: 3.11.0-530-g71ace53d-dirty
I0131 23:33:57.211871       1 api.go:51] launching server
I0131 23:33:57.212117       1 api.go:51] launching server
F0131 23:33:57.212096       1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

https://storage.cloud.google.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3736/artifacts/release-e2e-aws/pods/openshift-machine-config-operator_machine-config-server-khq9r_machine-config-server_previous.log.gz?_ga=2.58549930.-1062251045.1532122709

From the other logs:

Jan 31 23:25:24.675: INFO: Some pods in error: openshift-machine-config-operator/machine-config-server-khq9r
Jan 31 23:25:29.688: INFO: Some pods in error: openshift-machine-config-operator/machine-config-server-khq9r
Jan 31 23:25:29.942: INFO: Some pods in error: openshift-machine-config-operator/machine-config-server-khq9r

https://gubernator.k8s.io/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3736

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 1, 2019

Happy to make the changes here and in installer, if someone can let me know what was settled on for the port?
cc: @cgwalters @ashcrow

@ashcrow
Copy link
Member

ashcrow commented Feb 1, 2019

There wasn't disagreement on 32623. Unless someone had a reason to avoid the port it's a fair change.

@abhinavdahiya
Copy link
Contributor

The default node port range is 30000-32767 for kubernetes nodeport services
Ref: https://kubernetes.io/docs/concepts/services-networking/service/#nodeport

Not sure if that will cause any problems? @squeed

@jlebon
Copy link
Member

jlebon commented Feb 1, 2019

Hmm yeah, staying outside the default range makes sense to me given that client apps could hardcode a nodePort that matches whatever we choose there. (And it doesn't seem like the installer has a knob to change the range easily, so that's good.)

@ashcrow
Copy link
Member

ashcrow commented Feb 1, 2019

22623?

@kikisdeliveryservice
Copy link
Contributor

any objections to 22623?

/assign

@ashcrow
Copy link
Member

ashcrow commented Feb 1, 2019

Seems like none 😸

@crawford
Copy link
Contributor

crawford commented Feb 1, 2019

22623 is fine.

kikisdeliveryservice added a commit to kikisdeliveryservice/machine-config-operator that referenced this issue Feb 5, 2019
Transition machine-config-server ports from 49500/49501 -> 22623/22624
to avoid conflict with local port and node port ranges. Listeners added
for legacy ports until installer transitions to using the new ports.

Closes: openshift#166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants