-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TGI: optimize continuous batching and improve export #506
Conversation
04d0948
to
c987db5
Compare
7c4ee46
to
121f197
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just left a question.
start = time.time() | ||
model = NeuronModelForCausalLM.from_pretrained(model_id, export=True, **export_kwargs) | ||
end = time.time() | ||
logger.info(f"Model successfully exported in {end - start:.2f} s.") | ||
logger.info(f"Saving exported model to local storage under {export_path}.") | ||
log_cache_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be anything cached in HF_HUB_CACHE
? I thought there would only be a cache in neuron_cache_path
(/var/tmp/neuron-compile-cache
) after a fresh export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some space to expand the checkpoints and store the cached artifacts when we export a hub model.
Whenever we use the hubclientlibrary
to fetch something (when calling from_pretrained), it ends up under HF_HUB_CACHE
, so it is supposed to be writeable.
I therefore use the same destination when exporting.
When we deploy our own containers, we mount a writable /data
volume.
With SageMaker, we only have a writable /tmp
volume of 30G that can be expanded to 512G.
This is why I added those logs to identifiy issues when exporting the models.
What does this PR do?
transformers-neuronx
implementation.Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.
A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral,
transformers-neuronx
continuous batching is always on).For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.
During export, the model generation config is fetched to provide default values.