Fix excessive CPU memory consumption on TGI startup #595

dacorvo · 2024-05-15T14:37:35Z

What does this PR do?

When launching a TGI instance with a non-neuron model as parameter, the model needs to be exported from cached neuron artifacts during the container startup.

Before this change, the export was done without minimizing the CPU memory, which made it impossible to use this kind of "on-the-fly" export on the smaller ml.inf2.xlarge instances.

michaelbenayoun · 2024-05-16T14:55:07Z

text-generation-inference/server/text_generation_server/model.py

@@ -91,7 +91,7 @@ def fetch_model(
        # Prefetch the neuron model from the Hub
        logger.info(f"Fetching revision [{revision}] for neuron model {model_id} under {HF_HUB_CACHE}")
        log_cache_size()
-        return snapshot_download(model_id, revision=revision)
+        return snapshot_download(model_id, revision=revision, ignore_patterns="*.bin")


fix(tgi): reduce CPU memory when loading model

168ca00

dacorvo requested review from philschmid and michaelbenayoun May 15, 2024 14:40

dacorvo marked this pull request as ready for review May 15, 2024 14:40

fix(tgi): avoid loading model when fetching weights

fe71226

dacorvo force-pushed the launch_tgi_xlarge branch from ac88ffb to fe71226 Compare May 15, 2024 16:01

michaelbenayoun approved these changes May 16, 2024

View reviewed changes

dacorvo merged commit 7ad464d into main May 16, 2024
1 check passed

dacorvo deleted the launch_tgi_xlarge branch May 16, 2024 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix excessive CPU memory consumption on TGI startup #595

Fix excessive CPU memory consumption on TGI startup #595

dacorvo commented May 15, 2024

michaelbenayoun May 16, 2024

Fix excessive CPU memory consumption on TGI startup #595

Fix excessive CPU memory consumption on TGI startup #595

Conversation

dacorvo commented May 15, 2024

What does this PR do?

michaelbenayoun May 16, 2024

Choose a reason for hiding this comment