Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Fix tcp port reservation for api server #10012

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

russellb
Copy link
Collaborator

@russellb russellb commented Nov 5, 2024

PR #8537 changed this code to bind the TCP port prior to the engine
starting to ensure that port isn't used unexpectedly by something
launched during the engine startup. (ray is mentioned in the
discussino history and the comment in the code)

This change introduced some new unexpected behavior as discussed in
issue #9737. Restarting vllm within a short time after handling an API
request where the client hasn't closed its end of the connection would
cause vllm to fail to start with a "port in use" error.

The primary issue was the use of the fd option to the uvicorn
config. This option does not actually do what we want. The relevant
code can be found here:

https://github.com/encode/uvicorn/blob/fe3910083e3990695bc19c2ef671dd447262ae18/uvicorn/config.py#L496-L501

The important line to note is this one:

sock = socket.fromfd(self.fd, socket.AF_UNIX, socket.SOCK_STREAM)

Note that uvicorn is expecting the fd to be for an AF_UNIX socket.
We are passing in a TCP (AF_INET) socket. We seem to be lucky that
this mostly works anyway, though it's surprising it works at all!

Instead of using this fd option, set the SO_REUSEADDR option on
the socket, which will allow uvicorn to bind to the same address and
port.

When reserving the port, we previously always specified "" for the
host. I fixed that too so that we bind to the host that we pass to
uvicorn for it to bind to, as well.

Finally, explicitly close() the socket as a final step of cleanup.

Closes #9737

Signed-off-by: Russell Bryant rbryant@redhat.com

PR vllm-project#8537 changed this code to bind the TCP port prior to the engine
starting to ensure that port isn't used unexpectedly by something
launched during the engine startup. (ray is mentioned in the
discussino history and the comment in the code)

This change introduced some new unexpected behavior as discussed in
issue vllm-project#9737. Restarting vllm within a short time after handling an API
request where the client hasn't closed its end of the connection would
cause vllm to fail to start with a "port in use" error.

The primary issue was the use of the `fd` option to the uvicorn
config. This option does not actually do what we want. The relevant
code can be found here:

https://github.com/encode/uvicorn/blob/fe3910083e3990695bc19c2ef671dd447262ae18/uvicorn/config.py#L496-L501

The important line to note is this one:

    sock = socket.fromfd(self.fd, socket.AF_UNIX, socket.SOCK_STREAM)

Note that `uvicorn` is expecting the fd to be for an `AF_UNIX` socket.
We are passing in a TCP (`AF_INET`) socket. We seem to be lucky that
this mostly works anyway, though it's surprising it works at all!

Instead of using this `fd` option, set the `SO_REUSEADDR` option on
the socket, which will allow `uvicorn` to bind to the same address and
port.

When reserving the port, we previously always specified `""` for the
host. I fixed that too so that we bind to the host that we pass to
`uvicorn` for it to bind to, as well.

Finally, explicitly `close()` the socket as a final step of cleanup.

Closes vllm-project#9737

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Copy link

github-actions bot commented Nov 5, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@njhill
Copy link
Member

njhill commented Nov 5, 2024

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

@russellb
Copy link
Collaborator Author

russellb commented Nov 5, 2024

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

Thanks for the pointer! The revert was #9852.

Hopefully that test can run on this PR so we can check? @khluu

@khluu khluu added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2024
@khluu
Copy link
Collaborator

khluu commented Nov 5, 2024

Thanks @russellb. Just fyi there was another recent change to pass the host in #9798 which was reverted due to causing issues with multi-step, I'm not sure what those were or whether that might still apply here.

Thanks for the pointer! The revert was #9852.

Hopefully that test can run on this PR so we can check? @khluu

I marked ready so CI can run (also unblocked multistep test)

@russellb
Copy link
Collaborator Author

russellb commented Nov 5, 2024

Everything seems to be passing 🎉

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @russellb!

@njhill njhill merged commit 5952d81 into vllm-project:main Nov 5, 2024
70 checks passed
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: "Address already in use" for 1 minute after crash (since 0.6.2)
3 participants