[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

a-ys · 2024-06-17T22:29:00Z

Description

Neo serving.properties output

Currently, the Neo Quantization script will always quantize at tensor_parallel_degree=8 and output tensor_parallel_degree=8 in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.

Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.

New behavior: Neo still quantizes with tensor_parallel_degree=8 but the output will depend on customer input to Neo.

If a customer passes tensor_parallel_degree in serving.properties or through the environment variable (but not both):
- The inputted tensor_parallel_degree will be passed through to the output.
If a customer passes tensor_parallel_degree in serving.properties AND the environment variable:
- The ENVVAR tensor_parallel_degree will be passed through to the output.
If a customer does not pass either:
- tensor_parallel_degree will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.

Neo environment variables updates

We will accept SM_NEO_HF_CACHE_DIR as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.

…figuration. (deepjavalibrary#2077)

a-ys added 3 commits June 17, 2024 22:27

[Neo] Add SM_NEO_HF_CACHE and SM_NEO_DEPLOYMENT_INSTANCE envvars

b5e75c7

[Neo] Fix outputted tp_degree in Neo quantization

b9dfd25

[Neo] Remove deployment_instance_type envvar

a8d17b6

a-ys requested review from zachgk, frankfliu and a team as code owners June 17, 2024 22:29

sindhuvahinis approved these changes Jun 17, 2024

View reviewed changes

sindhuvahinis merged commit 8045ad3 into deepjavalibrary:master Jun 17, 2024
8 checks passed

sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 17, 2024

[Neo] Fix Neo Quantization properties output. Add some additional con…

6637067

…figuration. (deepjavalibrary#2077)

sindhuvahinis mentioned this pull request Jun 17, 2024

[cherry-pick][0.28.0-dlc][Neo] Fix Neo Quantization properties output. Add some additional con… #2078

Merged

sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 18, 2024

[Neo] Fix Neo Quantization properties output. Add some additional con…

2156960

…figuration. (deepjavalibrary#2077)

a-ys deleted the neo_vllm_fixes branch June 18, 2024 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

a-ys commented Jun 17, 2024 •

edited

Loading

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

Conversation

a-ys commented Jun 17, 2024 • edited Loading

Description

Neo serving.properties output

Neo environment variables updates

a-ys commented Jun 17, 2024 •

edited

Loading