Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with OpenLlama takes forever #98

Closed
1hachem opened this issue Jun 18, 2023 · 8 comments
Closed

Running with OpenLlama takes forever #98

1hachem opened this issue Jun 18, 2023 · 8 comments

Comments

@1hachem
Copy link

1hachem commented Jun 18, 2023

I have the openllama weights locally, I am serving the model with :

lmql serve-model /mnt/nvme/openllama/7B --cuda --port 9999

and I am running :

import asyncio
import lmql


@lmql.query
async def greet(term):
    '''
    argmax
        """Greet {term}:
        Hello [WHO]
        """
    from
        lmql.model("/mnt/nvme/openllama/7B", endpoint="localhost:9999")
    where
        len(TOKENS(WHO)) < 1
    '''


output = asyncio.run(greet("Earth"))
print(output)

I am running inference on 3 gpus, however it takes ages to get the expected output.
(when I use transformers it takes only few seconds)

class Llama:
    def __init__(self, model_path: str = "/mnt/nvme/openllama/7B") -> None:
        super().__init__()
        self.model_path = model_path
        self.tokenizer = LlamaTokenizer.from_pretrained(model_path)
        self.model = LlamaForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto",
        )

    def __call__(self, prompt: str) -> str:
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        generation_output = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=15,
            temperature=0.0,
        )
        output = self.tokenizer.decode(generation_output[0], skip_special_tokens=True)
        output = output.replace(prompt, "")  # eq : return_full_sequence=False

        return output

any idea what is causing all this delay ?

@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Jun 18, 2023

Thanks for reporting this. Can you experiment with argmax(chunksize=<n>), this may be causing slow downs.

@1hachem
Copy link
Author

1hachem commented Jun 18, 2023

Thanks for responding, I experimented with different chunksize (n=1, 40, 1000), however it doesn't give any substantial difference.

@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Jun 18, 2023

This seems wrong, I will need to investigate a bit. Is functionality given apart from performance? OpenLlama should be affected by #95.

@lbeurerkellner
Copy link
Collaborator

I found that the tokenizer implementation used by openlm/llama-X models seems to be outdated/faulty (see huggingface/transformers#23671 (comment)). When loaded via AutoTokenizer.from_pretrained it seems to take forever.

HF folks advise to use the tokenizer of huggyllama/llama-7b instead. If the OpenLM models use the same tokenization (which I am not sure of), you can switch to a different implementation via, e.g.

lmql.model("local:openlm-research/open_llama_7b", tokenizer="huggyllama/llama-7b")

I can't test it myself this instant, but this may fix it. Will try to test this soon.

@lbeurerkellner
Copy link
Collaborator

I checked this in the meantime. Unfortunately, it looks like huggyllama/llama-7b and the openlm-research models do not use the same tokenization, i.e. lmql.model("openlm-research/open_llama_3b", tokenizer="huggyllama/llama-7b") is not a valid combination.

This means to address this, we need to first progress with #95 to add compatibility with the LlamaTokenizer(Fast) implementation in HF.

@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Jun 28, 2023

On branch llama-tokenizer (install via pip install git+https://github.com/eth-sri/lmql@llama-tokenizer), there is now a working integration for LlamaTokenizerFast.

For OpenLlama you can use lmql.model('openlm-research/open_llama_7b', use_fast=False) on that branch. However, constraining may not work properly yet, as it seems like some pending changes in OpenLlama and HF are still required for this (e.g. openlm-research/open_llama#40).

I will continue to monitor this. Hopefully, OpenLlama will soon get a working LlamaTokenizerFast implementation, so we can put this issue to the books.

@lbeurerkellner
Copy link
Collaborator

Okay, on the updated branch the following seems to work great with OpenLlama:

lmql.model("openlm-research/open_llama_3b", tokenizer="danielhanchen/open_llama_3b")

Found this in openlm-research/open_llama#40. danielhanchen/open_llama_3 includes a fixed version of the 'fast' OpenLlama tokenizer. Hopefully this will be merged to openlm-research/open_llama_3b on the hub soon.

@lbeurerkellner
Copy link
Collaborator

This fix now also works on the latest 0.0.6.5 version of LMQL. Closing this here, since the remaining fix will need to happen on the OpenLllama side of things. As of now, it does not seem to be merged. At the same time, from reading their bug tracker, a couple of fixes on the HF side will also soon ship as part of transformers, so that will also benefit LMQL.

Okay, on the updated branch the following seems to work great with OpenLlama:

lmql.model("openlm-research/open_llama_3b", tokenizer="danielhanchen/open_llama_3b")

Found this in openlm-research/open_llama#40. danielhanchen/open_llama_3 includes a fixed version of the 'fast' OpenLlama tokenizer. Hopefully this will be merged to openlm-research/open_llama_3b on the hub soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants