[Neo] Fix Neo Quantization properties output. Add some additional configuration. #2077
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Neo serving.properties output
Currently, the Neo Quantization script will always quantize at
tensor_parallel_degree=8
and outputtensor_parallel_degree=8
in serving.properties. This is often not compatible with serving, so we will avoid outputting this value.Specifically, with AWQ quantized small models like Llama-2-7b, they can not be served with tp=8. This is because the intermediate_size / tp_degree must be divisible by the quantization group size (128). In this case, intermediate_size after quantization is 5632, so valid tp_degrees are 1,2, and 4.
New behavior: Neo still quantizes with
tensor_parallel_degree=8
but the output will depend on customer input to Neo.tensor_parallel_degree
in serving.properties or through the environment variable (but not both):tensor_parallel_degree
will be passed through to the output.tensor_parallel_degree
in serving.properties AND the environment variable:tensor_parallel_degree
will be passed through to the output.tensor_parallel_degree
will not be included in the outputted serving.properties. Customer can update serving.properties manually, or pass an environment variable during serving.Neo environment variables updates
We will accept
SM_NEO_HF_CACHE_DIR
as the quantization dataset cache directory for forward-compatibility. This is in case future containers have both a compilation cache dir and HF/datasets cache dir.