Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI: optimize continuous batching and improve export #506

Merged
merged 5 commits into from
Mar 6, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Mar 6, 2024

What does this PR do?

  1. This first modifies the TGI continuous batching implementation to take advantage of transformers-neuronx implementation.

Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.

A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral, transformers-neuronx continuous batching is always on).

  1. For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.

  2. During export, the model generation config is fetched to provide default values.

@dacorvo dacorvo marked this pull request as ready for review March 6, 2024 10:33
@dacorvo dacorvo force-pushed the optimize_tgi_continuous_batching branch from 04d0948 to c987db5 Compare March 6, 2024 10:35
@dacorvo dacorvo force-pushed the optimize_tgi_continuous_batching branch from 7c4ee46 to 121f197 Compare March 6, 2024 13:05
@dacorvo dacorvo changed the title Optimize TGI continuous batching TGI: optimize continuous batching and improve export Mar 6, 2024
Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left a question.

start = time.time()
model = NeuronModelForCausalLM.from_pretrained(model_id, export=True, **export_kwargs)
end = time.time()
logger.info(f"Model successfully exported in {end - start:.2f} s.")
logger.info(f"Saving exported model to local storage under {export_path}.")
log_cache_size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be anything cached in HF_HUB_CACHE? I thought there would only be a cache in neuron_cache_path(/var/tmp/neuron-compile-cache) after a fresh export.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some space to expand the checkpoints and store the cached artifacts when we export a hub model.
Whenever we use the hubclientlibrary to fetch something (when calling from_pretrained), it ends up under HF_HUB_CACHE, so it is supposed to be writeable.
I therefore use the same destination when exporting.
When we deploy our own containers, we mount a writable /data volume.
With SageMaker, we only have a writable /tmp volume of 30G that can be expanded to 512G.
This is why I added those logs to identifiy issues when exporting the models.

@dacorvo dacorvo merged commit 8f84127 into main Mar 6, 2024
1 check passed
@dacorvo dacorvo deleted the optimize_tgi_continuous_batching branch March 6, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants