[Performance] Support `xgrammar` for faster constrained decoding #1680

DarkSharpness · 2024-10-16T03:55:56Z

Motivation

We conducted experiments to compare the end-to-end performance of outlines and xgrammar libraries in constrained decoding.

Experiment Setup

CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
GPU: NVIDIA A100 80GB
Python: 3.9.20
outlines: Branch
xgrammar: Branch
Model: Llama-3.1-8B

We ran the experiment with the following command:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B --host 0.0.0.0 --port 55555 --mem-fraction-static 0.8 --disable-disk-cache

For the dataset, we selected 389 out of 400 questions from bfcl_v3_simple.

Settings

Single: Requests are made sequentially.
Batch: All requests are made almost simultaneously.

Experiment Results

Latency refers to the end-to-end time for single requests and the average time for batch requests. Output tokens refer to the average number of tokens in the output.

Settings	Average Latency (s)	Average Output Tokens
outlines + jump forward + batch	3.054	27.89
outlines + no jump forward + batch	3.072	25.59
outlines + jump forward + single	4.564	27.08
outlines + no jump forward + single	6.492	24.85
xgrammar + jump forward + batch	0.708	24.02
xgrammar + no jump forward + batch	0.904	25.48
xgrammar + jump forward + single	0.799	23.23
xgrammar + no jump forward + single	1.006	24.96

Modifications

We plan to support both xgrammar and outlines as the backend for constrained decoding in the future.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

…m xgrammar)

zhyncs · 2024-10-16T04:58:38Z

Nice work! May u resolve the conflicts? Thanks!

…state_matcher

DarkSharpness · 2024-10-16T06:50:13Z

Nice work! May u resolve the conflicts? Thanks!

Thank you for your feedback! We’ve resolved the conflicts in the latest commits.

By the way, we also have plans to implement a new version that will support both xgrammar and outlines as backends. We may introduce a command-line argument, like --grammar-backend outlines, to facilitate this.

havetc · 2024-10-16T09:53:49Z

Isn't that a problem to remove the regex support ? Not that I really mind for my use cases, but it's been only since end of august with my PR #1125 that support for json has been added, any idea if some users may still require / need the support for regex constrained decoding ?

havetc · 2024-10-16T13:17:10Z

Wait, is there any link to the xgrammar library somewhere ? I can't find any reference to it on pypi, and not even on github.
How could it be imported ?

DarkSharpness · 2024-10-16T17:23:57Z

Wait, is there any link to the xgrammar library somewhere ? I can't find any reference to it on pypi, and not even on github. How could it be imported ?

Hi! The xgrammar library is currently part of a private repository and hasn't been released on PyPI or GitHub yet. It will be made public soon, so stay tuned for updates! Possible future link here.

binarycrayon · 2024-10-16T17:27:34Z

nice, looking forward to the release and to trying it out!

merrymercy · 2024-10-17T01:22:57Z

Can you fix the unit tests?

Add a grammar-backend to support both outlines and xgrammar, similar to

sglang/python/sglang/srt/server_args.py

Lines 515 to 521 in b0facb3

    
           parser.add_argument( 
        
               "--attention-backend", 
        
               type=str, 
        
               choices=["flashinfer", "triton"], 
        
               default=ServerArgs.attention_backend, 
        
               help="Choose the kernels for attention layers.", 
        
           )

DarkSharpness · 2024-10-17T06:05:16Z

Can you fix the unit tests?

Add a grammar-backend to support both outlines and xgrammar, similar to

sglang/python/sglang/srt/server_args.py

Lines 515 to 521 in b0facb3

parser.add_argument(

"--attention-backend",

type=str,

choices=["flashinfer", "triton"],

default=ServerArgs.attention_backend,

help="Choose the kernels for attention layers.",

)

Hi! Thanks for your feedback.

Regarding the unit tests: The failures are due to the missing xgrammar module, which unfortunately hasn't been made public yet.
As for adding grammar-backend support for both outlines and xgrammar. Since we've removed all outlines code from this branch, we are considering a future PR where both xgrammar and outlines could be supported together.

merrymercy · 2024-10-17T14:59:29Z

You can make the import of outlines and xgrammar optional when they are not used.

merrymercy · 2024-10-23T06:29:05Z

moved to #1752

DarkSharpness added 14 commits October 9, 2024 03:07

feat(xgrammar): trying to replace outlines with xgrammar

7acd4f8

fix(xgrammar): rollback when retokenization happens

3a05b1a

minor(xgrammar): remove all dependencies on outlines

f188235

test(xgrammar): customize some testcases for xgrammar | yet buggy...

0adf268

minor(xgrammar): fix the testcase

66ce4c6

fix(xgrammar): fix the rollback of jump_forward

44906b7

Merge branch 'main' into xgrammar

0246ed3

minor(xgrammar): fix some merge errors

5582355

minor(xgrammar): fix some merge errors in testcases

435525f

minor(xgrammar): format the code

0e05a19

Merge branch 'main' into xgrammar

594a70b

feat(xgrammar): adapt to a newer version of xgrammar

0bda5ea

minor(xgrammar): disable bnf cache reset temporarily(need support fro…

8cf411b

…m xgrammar)

minor(xgrammar): adapt to newer api of xgrammar

262f567

Ying1123 changed the title ~~[Performance] Replace outlines with xgrammar in constrained decoding~~ [Performance] Support xgrammar for faster constrained decoding Oct 16, 2024

DarkSharpness added 3 commits October 16, 2024 06:32

fix(xgrammar): pass model.vocab_size to xgrammar to generate correct …

dba81d3

…state_matcher

Merge branch 'main' into xgrammar

be3a4f6

minor(xgrammar): run pre-commit to format the code

6ad4cce

DarkSharpness marked this pull request as draft October 17, 2024 05:12

merrymercy mentioned this pull request Oct 18, 2024

Development Roadmap (2024 Q4) #1487

Open

33 tasks

DarkSharpness mentioned this pull request Oct 22, 2024

[Performance] Support both xgrammar and outlines for constrained decoding #1752

Merged

3 tasks

merrymercy closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Support `xgrammar` for faster constrained decoding #1680

[Performance] Support `xgrammar` for faster constrained decoding #1680

DarkSharpness commented Oct 16, 2024 •

edited

Loading

zhyncs commented Oct 16, 2024

DarkSharpness commented Oct 16, 2024

havetc commented Oct 16, 2024

havetc commented Oct 16, 2024

DarkSharpness commented Oct 16, 2024 •

edited

Loading

binarycrayon commented Oct 16, 2024

merrymercy commented Oct 17, 2024 •

edited

Loading

DarkSharpness commented Oct 17, 2024 •

edited

Loading

merrymercy commented Oct 17, 2024

merrymercy commented Oct 23, 2024

[Performance] Support xgrammar for faster constrained decoding #1680

[Performance] Support xgrammar for faster constrained decoding #1680

Conversation

DarkSharpness commented Oct 16, 2024 • edited Loading

Motivation

Experiment Setup

Settings

Experiment Results

Modifications

Checklist

zhyncs commented Oct 16, 2024

DarkSharpness commented Oct 16, 2024

havetc commented Oct 16, 2024

havetc commented Oct 16, 2024

DarkSharpness commented Oct 16, 2024 • edited Loading

binarycrayon commented Oct 16, 2024

merrymercy commented Oct 17, 2024 • edited Loading

DarkSharpness commented Oct 17, 2024 • edited Loading

merrymercy commented Oct 17, 2024

merrymercy commented Oct 23, 2024

[Performance] Support `xgrammar` for faster constrained decoding #1680

[Performance] Support `xgrammar` for faster constrained decoding #1680

DarkSharpness commented Oct 16, 2024 •

edited

Loading

DarkSharpness commented Oct 16, 2024 •

edited

Loading

merrymercy commented Oct 17, 2024 •

edited

Loading

DarkSharpness commented Oct 17, 2024 •

edited

Loading