Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 如何提前终止流式推理pipe.stream_infer #3106

Open
3 tasks done
youyc22 opened this issue Jan 31, 2025 · 3 comments
Open
3 tasks done

[Bug] 如何提前终止流式推理pipe.stream_infer #3106

youyc22 opened this issue Jan 31, 2025 · 3 comments

Comments

@youyc22
Copy link

youyc22 commented Jan 31, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

您好:我遇到了这样的问题:我希望在第一个流式推理满足一定条件时提前终止(比如图中的tokens>500),紧接着进行第二个问题的流式推理,但是我发现尽管break退出了第一个循环,第一个问题的推理仍在继续(占用了第二个问题的计算资源)

请问是否提供了相应的接口来终止第一个问题的请求而不影响第二个问题的推理?
Image

Reproduction

codes above

Environment

lmdeploy:0.7.0.post1

Error traceback

@youyc22
Copy link
Author

youyc22 commented Jan 31, 2025

我尝试使用stop_session,但是并没有用
重新初始化pipe能解决问题,但是依然会浪费一定的时间

@lzhangzz
Copy link
Collaborator

lzhangzz commented Feb 1, 2025

测了一下 turbomind 引擎可以通过 break 终止生成,pytorch 引擎似乎还没有支持在 generator 的的循环中 break。

目前 async_engine.py 里面有个地方没有处理 CanceledError,可以先把这个位置的 add_done_callback 部分注释掉。

asyncio.run_coroutine_threadsafe(_infer(), loop).add_done_callback(lambda x: x.result())

另外由于推理线程跟接口是异步的,cancel 了以后可能还会多生成几个 token 才会停下。

@youyc22
Copy link
Author

youyc22 commented Feb 1, 2025

测了一下 turbomind 引擎可以通过 break 终止生成,pytorch 引擎似乎还没有支持在 generator 的的循环中 break。

目前 async_engine.py 里面有个地方没有处理 CanceledError,可以先把这个位置的 add_done_callback 部分注释掉。

lmdeploy/lmdeploy/serve/async_engine.py

Line 430 in 637435f

asyncio.run_coroutine_threadsafe(_infer(), loop).add_done_callback(lambda x: x.result())
另外由于推理线程跟接口是异步的,cancel 了以后可能还会多生成几个 token 才会停下。

感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants