Running with OpenLlama takes forever #98

1hachem · 2023-06-18T12:57:53Z

I have the openllama weights locally, I am serving the model with :

lmql serve-model /mnt/nvme/openllama/7B --cuda --port 9999

and I am running :

import asyncio
import lmql


@lmql.query
async def greet(term):
    '''
    argmax
        """Greet {term}:
        Hello [WHO]
        """
    from
        lmql.model("/mnt/nvme/openllama/7B", endpoint="localhost:9999")
    where
        len(TOKENS(WHO)) < 1
    '''


output = asyncio.run(greet("Earth"))
print(output)

I am running inference on 3 gpus, however it takes ages to get the expected output.
(when I use transformers it takes only few seconds)

class Llama:
    def __init__(self, model_path: str = "/mnt/nvme/openllama/7B") -> None:
        super().__init__()
        self.model_path = model_path
        self.tokenizer = LlamaTokenizer.from_pretrained(model_path)
        self.model = LlamaForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto",
        )

    def __call__(self, prompt: str) -> str:
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        generation_output = self.model.generate(
            input_ids=input_ids,
            max_new_tokens=15,
            temperature=0.0,
        )
        output = self.tokenizer.decode(generation_output[0], skip_special_tokens=True)
        output = output.replace(prompt, "")  # eq : return_full_sequence=False

        return output

any idea what is causing all this delay ?

lbeurerkellner · 2023-06-18T13:01:44Z

Thanks for reporting this. Can you experiment with argmax(chunksize=<n>), this may be causing slow downs.

1hachem · 2023-06-18T13:14:49Z

Thanks for responding, I experimented with different chunksize (n=1, 40, 1000), however it doesn't give any substantial difference.

lbeurerkellner · 2023-06-18T23:01:03Z

This seems wrong, I will need to investigate a bit. Is functionality given apart from performance? OpenLlama should be affected by #95.

lbeurerkellner · 2023-06-23T00:14:26Z

I found that the tokenizer implementation used by openlm/llama-X models seems to be outdated/faulty (see huggingface/transformers#23671 (comment)). When loaded via AutoTokenizer.from_pretrained it seems to take forever.

HF folks advise to use the tokenizer of huggyllama/llama-7b instead. If the OpenLM models use the same tokenization (which I am not sure of), you can switch to a different implementation via, e.g.

lmql.model("local:openlm-research/open_llama_7b", tokenizer="huggyllama/llama-7b")

I can't test it myself this instant, but this may fix it. Will try to test this soon.

lbeurerkellner · 2023-06-27T13:48:29Z

I checked this in the meantime. Unfortunately, it looks like huggyllama/llama-7b and the openlm-research models do not use the same tokenization, i.e. lmql.model("openlm-research/open_llama_3b", tokenizer="huggyllama/llama-7b") is not a valid combination.

This means to address this, we need to first progress with #95 to add compatibility with the LlamaTokenizer(Fast) implementation in HF.

lbeurerkellner · 2023-06-28T17:06:16Z

On branch llama-tokenizer (install via pip install git+https://github.com/eth-sri/lmql@llama-tokenizer), there is now a working integration for LlamaTokenizerFast.

For OpenLlama you can use lmql.model('openlm-research/open_llama_7b', use_fast=False) on that branch. However, constraining may not work properly yet, as it seems like some pending changes in OpenLlama and HF are still required for this (e.g. openlm-research/open_llama#40).

I will continue to monitor this. Hopefully, OpenLlama will soon get a working LlamaTokenizerFast implementation, so we can put this issue to the books.

lbeurerkellner · 2023-06-29T07:56:49Z

Okay, on the updated branch the following seems to work great with OpenLlama:

lmql.model("openlm-research/open_llama_3b", tokenizer="danielhanchen/open_llama_3b")

Found this in openlm-research/open_llama#40. danielhanchen/open_llama_3 includes a fixed version of the 'fast' OpenLlama tokenizer. Hopefully this will be merged to openlm-research/open_llama_3b on the hub soon.

lbeurerkellner · 2023-07-17T10:13:12Z

This fix now also works on the latest 0.0.6.5 version of LMQL. Closing this here, since the remaining fix will need to happen on the OpenLllama side of things. As of now, it does not seem to be merged. At the same time, from reading their bug tracker, a couple of fixes on the HF side will also soon ship as part of transformers, so that will also benefit LMQL.

Okay, on the updated branch the following seems to work great with OpenLlama:
lmql.model("openlm-research/open_llama_3b", tokenizer="danielhanchen/open_llama_3b")
Found this in openlm-research/open_llama#40. danielhanchen/open_llama_3 includes a fixed version of the 'fast' OpenLlama tokenizer. Hopefully this will be merged to openlm-research/open_llama_3b on the hub soon.

lbeurerkellner added the performance label Jun 18, 2023

lbeurerkellner closed this as completed Jul 17, 2023

lbeurerkellner mentioned this issue Aug 3, 2023

Getting strange outputs when using LMQL with llama.cpp #154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running with OpenLlama takes forever #98

Running with OpenLlama takes forever #98

1hachem commented Jun 18, 2023 •

edited

Loading

lbeurerkellner commented Jun 18, 2023 •

edited

Loading

1hachem commented Jun 18, 2023

lbeurerkellner commented Jun 18, 2023 •

edited

Loading

lbeurerkellner commented Jun 23, 2023

lbeurerkellner commented Jun 27, 2023

lbeurerkellner commented Jun 28, 2023 •

edited

Loading

lbeurerkellner commented Jun 29, 2023

lbeurerkellner commented Jul 17, 2023

Running with OpenLlama takes forever #98

Running with OpenLlama takes forever #98

Comments

1hachem commented Jun 18, 2023 • edited Loading

lbeurerkellner commented Jun 18, 2023 • edited Loading

1hachem commented Jun 18, 2023

lbeurerkellner commented Jun 18, 2023 • edited Loading

lbeurerkellner commented Jun 23, 2023

lbeurerkellner commented Jun 27, 2023

lbeurerkellner commented Jun 28, 2023 • edited Loading

lbeurerkellner commented Jun 29, 2023

lbeurerkellner commented Jul 17, 2023

1hachem commented Jun 18, 2023 •

edited

Loading

lbeurerkellner commented Jun 18, 2023 •

edited

Loading

lbeurerkellner commented Jun 18, 2023 •

edited

Loading

lbeurerkellner commented Jun 28, 2023 •

edited

Loading