Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077

Merged
merged 3 commits into from
Jun 17, 2024

Conversation

a-ys
Copy link
Contributor

@a-ys a-ys commented Jun 17, 2024

Description

Neo serving.properties output

Currently, the Neo Quantization script will always quantize at tensor_parallel_degree=8 and output tensor_parallel_degree=8 in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.

Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.

New behavior: Neo still quantizes with tensor_parallel_degree=8 but the output will depend on customer input to Neo.

  • If a customer passes tensor_parallel_degree in serving.properties or through the environment variable (but not both):
    • The inputted tensor_parallel_degree will be passed through to the output.
  • If a customer passes tensor_parallel_degree in serving.properties AND the environment variable:
    • The ENVVAR tensor_parallel_degree will be passed through to the output.
  • If a customer does not pass either:
    • tensor_parallel_degree will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.

Neo environment variables updates

We will accept SM_NEO_HF_CACHE_DIR as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.

@a-ys a-ys requested review from zachgk, frankfliu and a team as code owners June 17, 2024 22:29
@sindhuvahinis sindhuvahinis merged commit 8045ad3 into deepjavalibrary:master Jun 17, 2024
8 checks passed
sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 17, 2024
sindhuvahinis pushed a commit to sindhuvahinis/djl-serving that referenced this pull request Jun 18, 2024
@a-ys a-ys deleted the neo_vllm_fixes branch June 18, 2024 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants