Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Support xgrammar for faster constrained decoding #1680

Closed
wants to merge 17 commits into from

Conversation

DarkSharpness
Copy link
Contributor

@DarkSharpness DarkSharpness commented Oct 16, 2024

Motivation

We conducted experiments to compare the end-to-end performance of outlines and xgrammar libraries in constrained decoding.

Experiment Setup

  • CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
  • GPU: NVIDIA A100 80GB
  • Python: 3.9.20
  • outlines: Branch
  • xgrammar: Branch
  • Model: Llama-3.1-8B

We ran the experiment with the following command:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B --host 0.0.0.0 --port 55555 --mem-fraction-static 0.8 --disable-disk-cache

For the dataset, we selected 389 out of 400 questions from bfcl_v3_simple.

Settings

  • Single: Requests are made sequentially.
  • Batch: All requests are made almost simultaneously.

Experiment Results

Latency refers to the end-to-end time for single requests and the average time for batch requests. Output tokens refer to the average number of tokens in the output.

Settings Average Latency (s) Average Output Tokens
outlines + jump forward + batch 3.054 27.89
outlines + no jump forward + batch 3.072 25.59
outlines + jump forward + single 4.564 27.08
outlines + no jump forward + single 6.492 24.85
xgrammar + jump forward + batch 0.708 24.02
xgrammar + no jump forward + batch 0.904 25.48
xgrammar + jump forward + single 0.799 23.23
xgrammar + no jump forward + single 1.006 24.96

Modifications

We plan to support both xgrammar and outlines as the backend for constrained decoding in the future.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Oct 16, 2024

Nice work! May u resolve the conflicts? Thanks!

@Ying1123 Ying1123 changed the title [Performance] Replace outlines with xgrammar in constrained decoding [Performance] Support xgrammar for faster constrained decoding Oct 16, 2024
@DarkSharpness
Copy link
Contributor Author

Nice work! May u resolve the conflicts? Thanks!

Thank you for your feedback! We’ve resolved the conflicts in the latest commits.

By the way, we also have plans to implement a new version that will support both xgrammar and outlines as backends. We may introduce a command-line argument, like --grammar-backend outlines, to facilitate this.

@havetc
Copy link
Contributor

havetc commented Oct 16, 2024

Isn't that a problem to remove the regex support ? Not that I really mind for my use cases, but it's been only since end of august with my PR #1125 that support for json has been added, any idea if some users may still require / need the support for regex constrained decoding ?

@havetc
Copy link
Contributor

havetc commented Oct 16, 2024

Wait, is there any link to the xgrammar library somewhere ? I can't find any reference to it on pypi, and not even on github.
How could it be imported ?

@DarkSharpness
Copy link
Contributor Author

DarkSharpness commented Oct 16, 2024

Wait, is there any link to the xgrammar library somewhere ? I can't find any reference to it on pypi, and not even on github. How could it be imported ?

Hi! The xgrammar library is currently part of a private repository and hasn't been released on PyPI or GitHub yet. It will be made public soon, so stay tuned for updates! Possible future link here.

@binarycrayon
Copy link
Contributor

nice, looking forward to the release and to trying it out!

@merrymercy
Copy link
Contributor

merrymercy commented Oct 17, 2024

  1. Can you fix the unit tests?
  2. Add a grammar-backend to support both outlines and xgrammar, similar to
    parser.add_argument(
    "--attention-backend",
    type=str,
    choices=["flashinfer", "triton"],
    default=ServerArgs.attention_backend,
    help="Choose the kernels for attention layers.",
    )

@DarkSharpness DarkSharpness marked this pull request as draft October 17, 2024 05:12
@DarkSharpness
Copy link
Contributor Author

DarkSharpness commented Oct 17, 2024

  1. Can you fix the unit tests?
  2. Add a grammar-backend to support both outlines and xgrammar, similar to
    parser.add_argument(
    "--attention-backend",
    type=str,
    choices=["flashinfer", "triton"],
    default=ServerArgs.attention_backend,
    help="Choose the kernels for attention layers.",
    )

Hi! Thanks for your feedback.

  1. Regarding the unit tests: The failures are due to the missing xgrammar module, which unfortunately hasn't been made public yet.
  2. As for adding grammar-backend support for both outlines and xgrammar. Since we've removed all outlines code from this branch, we are considering a future PR where both xgrammar and outlines could be supported together.

@merrymercy
Copy link
Contributor

You can make the import of outlines and xgrammar optional when they are not used.

@merrymercy
Copy link
Contributor

moved to #1752

@merrymercy merrymercy closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants