[Frontend] Fix tcp port reservation for api server #10012

russellb · 2024-11-05T00:18:01Z

PR #8537 changed this code to bind the TCP port prior to the engine
starting to ensure that port isn't used unexpectedly by something
launched during the engine startup. (ray is mentioned in the
discussino history and the comment in the code)

This change introduced some new unexpected behavior as discussed in
issue #9737. Restarting vllm within a short time after handling an API
request where the client hasn't closed its end of the connection would
cause vllm to fail to start with a "port in use" error.

The primary issue was the use of the fd option to the uvicorn
config. This option does not actually do what we want. The relevant
code can be found here:

https://github.com/encode/uvicorn/blob/fe3910083e3990695bc19c2ef671dd447262ae18/uvicorn/config.py#L496-L501

The important line to note is this one:

sock = socket.fromfd(self.fd, socket.AF_UNIX, socket.SOCK_STREAM)

Note that uvicorn is expecting the fd to be for an AF_UNIX socket.
We are passing in a TCP (AF_INET) socket. We seem to be lucky that
this mostly works anyway, though it's surprising it works at all!

Instead of using this fd option, set the SO_REUSEADDR option on
the socket, which will allow uvicorn to bind to the same address and
port.

When reserving the port, we previously always specified "" for the
host. I fixed that too so that we bind to the host that we pass to
uvicorn for it to bind to, as well.

Finally, explicitly close() the socket as a final step of cleanup.

Closes #9737

Signed-off-by: Russell Bryant rbryant@redhat.com

PR vllm-project#8537 changed this code to bind the TCP port prior to the engine starting to ensure that port isn't used unexpectedly by something launched during the engine startup. (ray is mentioned in the discussino history and the comment in the code) This change introduced some new unexpected behavior as discussed in issue vllm-project#9737. Restarting vllm within a short time after handling an API request where the client hasn't closed its end of the connection would cause vllm to fail to start with a "port in use" error. The primary issue was the use of the `fd` option to the uvicorn config. This option does not actually do what we want. The relevant code can be found here: https://github.com/encode/uvicorn/blob/fe3910083e3990695bc19c2ef671dd447262ae18/uvicorn/config.py#L496-L501 The important line to note is this one: sock = socket.fromfd(self.fd, socket.AF_UNIX, socket.SOCK_STREAM) Note that `uvicorn` is expecting the fd to be for an `AF_UNIX` socket. We are passing in a TCP (`AF_INET`) socket. We seem to be lucky that this mostly works anyway, though it's surprising it works at all! Instead of using this `fd` option, set the `SO_REUSEADDR` option on the socket, which will allow `uvicorn` to bind to the same address and port. When reserving the port, we previously always specified `""` for the host. I fixed that too so that we bind to the host that we pass to `uvicorn` for it to bind to, as well. Finally, explicitly `close()` the socket as a final step of cleanup. Closes vllm-project#9737 Signed-off-by: Russell Bryant <rbryant@redhat.com>

github-actions · 2024-11-05T00:18:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

njhill · 2024-11-05T01:06:35Z

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

russellb · 2024-11-05T01:24:20Z

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

Thanks for the pointer! The revert was #9852.

Hopefully that test can run on this PR so we can check? @khluu

khluu · 2024-11-05T01:37:08Z

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

Thanks for the pointer! The revert was #9852.

Hopefully that test can run on this PR so we can check? @khluu

I marked ready so CI can run (also unblocked multistep test)

russellb · 2024-11-05T13:57:03Z

Everything seems to be passing 🎉

njhill

Thanks @russellb!

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>

mergify bot added the frontend label Nov 5, 2024

khluu added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2024

njhill approved these changes Nov 5, 2024

View reviewed changes

njhill merged commit 5952d81 into vllm-project:main Nov 5, 2024
70 checks passed

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[Frontend] Fix tcp port reservation for api server (vllm-project#10012)

c8ef276

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Frontend] Fix tcp port reservation for api server (vllm-project#10012)

4a4db65

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Frontend] Fix tcp port reservation for api server (vllm-project#10012)

3bad025

Signed-off-by: Russell Bryant <rbryant@redhat.com>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[Frontend] Fix tcp port reservation for api server (vllm-project#10012)

84c8b51

Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Frontend] Fix tcp port reservation for api server (vllm-project#10012)

73ae504

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Fix tcp port reservation for api server #10012

[Frontend] Fix tcp port reservation for api server #10012

russellb commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

njhill commented Nov 5, 2024

russellb commented Nov 5, 2024

khluu commented Nov 5, 2024

russellb commented Nov 5, 2024

njhill left a comment

[Frontend] Fix tcp port reservation for api server #10012

[Frontend] Fix tcp port reservation for api server #10012

Conversation

russellb commented Nov 5, 2024

github-actions bot commented Nov 5, 2024

njhill commented Nov 5, 2024

russellb commented Nov 5, 2024

khluu commented Nov 5, 2024

russellb commented Nov 5, 2024

njhill left a comment

Choose a reason for hiding this comment