Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support llama_encode (WIP) #91

Merged
merged 3 commits into from
Jul 10, 2024
Merged

Support llama_encode (WIP) #91

merged 3 commits into from
Jul 10, 2024

Conversation

ngxson
Copy link
Owner

@ngxson ngxson commented Jul 8, 2024

await wllama.loadModelFromUrl("https://huggingface.co/Felladrin/gguf-flan-t5-large/resolve/main/flan-t5-large.Q2_K.gguf", {
  n_ctx: 1024,
});

output = await wllama.createCompletion("translate English to French: How old are you?", {
  nPredict: 20,
  sampling: { temp: 0 },
});

// output:  Les âges de vous êtes-vous?
// expected: Vous avez quel âge ?

@ngxson ngxson linked an issue Jul 8, 2024 that may be closed by this pull request
@ngxson
Copy link
Owner Author

ngxson commented Jul 9, 2024

Quantized model is not usable (seems like flan requires a lot of precision)

FP16 (answer is a bit more correct, but in french we never use "être" for asking age):

image

INT8 (answer is wrong):

image

@ngxson
Copy link
Owner Author

ngxson commented Jul 10, 2024

@felladrin I still can't get reliable results, but seems like problem comes from llama.cpp and not wllama.

This PR will be merged now.

@ngxson ngxson marked this pull request as ready for review July 10, 2024 09:01
@ngxson ngxson merged commit 97db6f5 into master Jul 10, 2024
2 checks passed
@felladrin
Copy link
Contributor

Thank you for implementing it, @ngxson!
I tested it with https://huggingface.co/Felladrin/gguf-MaxMini-Instruct-248M and it worked great!
Inference was considerably slower than a 248M decoder-only, but encoder-decoder models still have their uses!

@ngxson
Copy link
Owner Author

ngxson commented Jul 15, 2024

@felladrin Thanks for the info. I'm not sure why it's significant slower, probably something to be optimized from upstream.

And yeah I agree that encoder-decoder models are still useful. Personally I found that for more deterministic tasks like translating languages, it hallucinates less than decoder-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

T5 and Flan-T5 models support (llama_encode)
2 participants