-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with OpenLlama takes forever #98
Comments
Thanks for reporting this. Can you experiment with |
Thanks for responding, I experimented with different chunksize (n=1, 40, 1000), however it doesn't give any substantial difference. |
This seems wrong, I will need to investigate a bit. Is functionality given apart from performance? OpenLlama should be affected by #95. |
I found that the tokenizer implementation used by openlm/llama-X models seems to be outdated/faulty (see huggingface/transformers#23671 (comment)). When loaded via AutoTokenizer.from_pretrained it seems to take forever. HF folks advise to use the tokenizer of
I can't test it myself this instant, but this may fix it. Will try to test this soon. |
I checked this in the meantime. Unfortunately, it looks like This means to address this, we need to first progress with #95 to add compatibility with the LlamaTokenizer(Fast) implementation in HF. |
On branch For OpenLlama you can use I will continue to monitor this. Hopefully, OpenLlama will soon get a working LlamaTokenizerFast implementation, so we can put this issue to the books. |
Okay, on the updated branch the following seems to work great with OpenLlama:
Found this in openlm-research/open_llama#40. |
This fix now also works on the latest 0.0.6.5 version of LMQL. Closing this here, since the remaining fix will need to happen on the OpenLllama side of things. As of now, it does not seem to be merged. At the same time, from reading their bug tracker, a couple of fixes on the HF side will also soon ship as part of
|
I have the openllama weights locally, I am serving the model with :
and I am running :
I am running inference on 3 gpus, however it takes ages to get the expected output.
(when I use transformers it takes only few seconds)
any idea what is causing all this delay ?
The text was updated successfully, but these errors were encountered: