-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will AOT compilation still be supported after JIT compilation is added? #510
Comments
Hi @danieldk , thanks for bringing this up!
This is a reasonable concern, I think we can keep both JIT and AOT (for a set of "core" kernels, ~200mb). We should use "core" kernels whenever possible, and use JIT for the remaining kernels (new head dimensions, some attention variants, etc.), WDTY? |
I agree with @danieldk that AOT is important. For production use, we typically build a docker image. Then Kubernetes will spawn a pod running a container of that image. The container is ephemeral. So if AOT is missing, this would mean that every time the pod restart, we'll have to JIT compile. This would slow down the start time significantly. For PyPI, we can ship a sdist and do JIT only. This can make sure that PyPI size is small. For our hosted wheels, I agree with @yzh119 that AOT "core" kernels is a good idea. I think the "core" kernels should include kernels that popular pretrained models uses (e.g., Llama, QWen, DeepSeek). I have a few suggestions additionally -- Frist, For better user experience, output a log when JITing a kernel (maybe also include elapsed time). This way, if we experience an unexpected long start time, we can know that it comes from JIT FlashInfer kernels. Logging the JIT kernel names can also help us decide what to be included in "core" kernels. Second, wheels shouldn't pin to PyTorch versions. We can compile kernels that link to particular CUDA version and expose C ABI. We write a separate Shipping wheels tied to PyTorch version takes time and storage. And it might even be wrong. I don't think PyTorch explicitly guarantee that Third, it would also be good to provide a customizable AOT script, just in case some users want AOT beyond the "core" kernels. Fourth, as for the wheel size, I think even 2GB is acceptable. This is because CUDA + PyTorch already takes up maybe 10GB container size. It's already huge even without FlashInfer. So we shouldn't worry about FlashInfer takes up additional spaces. |
Thanks @abcdabcd987 for your thoughts, here is my tentative plan: Maintain two packages:
The two packages share version numbers. @danieldk how does this plan sound to you? |
Sounds good to me. It would be even more better if we could allow
|
I'd push against a separate |
This sounds great! I don't think we mind compiling flashinfer ourselves to get all the kernels AOT. For development we are caching builds anyway through Nix and for production docker containers we are also looking to improve build caching. I think even outside applications like TGI, AOTing the most-used kernels and JITing the rest sounds like a good strategy. |
After offline discussion, I think the updated plan is as follows (@yzh119 please confirm):
|
That sounds awesome. Thank you for taking our use case into account! |
Both JIT mode and AOT mode are supported in #507 . |
We saw that support for JIT compilation will be added in #507. We were wondering what the plans are for ahead-of-time compilation. We are happily using flashinfer in Text Generation Inference the support for KV caches with
block_size=1
has really been helpful for us to support fine-grained prefix caching.For many of our users it's pretty important that compilation is done ahead of time. When infrastructure is scaled up, we want to avoid delaying/slowing down processing of user requests due to JIT compilation and since infrastructure is often heterogeneous (both in the models served and in the GPUs used), we would have to compile most kernels anyway. So, for us it would be really useful if AOT will be supported going forward.
Thank you for your awesome work on flashinfer 🤗.
The text was updated successfully, but these errors were encountered: