Support DeepSeek Coder Model #278

jonastemplestein · 2023-11-13T12:03:56Z

Hey folks, I'm trying to use the deepseek-coder-1.3b-base model with bumblebee. I was delighted to find that the model, tokenizer and generation_config all load. But when trying to run inference I get the following error that's a bit hard for me to debug:

repo = {:hf, "deepseek-ai/deepseek-coder-1.3b-base"}
{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
serving = Bumblebee.Text.generation(model_info, tokenizer, generation_config)
prompt = "hello world"
Nx.Serving.run(serving, prompt)

** (ErlangError) Erlang error: "Could not decode field on position 1"
    (tokenizers 0.4.0) Tokenizers.Native.encoding_transform(#Tokenizers.Encoding<[length: 2, ids: [31702, 1835]]>, [pad: {2, [pad_id: nil, pad_token: "</s>", direction: :left]}])
    (elixir 1.15.7) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (bumblebee 0.4.2) lib/bumblebee/utils/tokenizers.ex:51: Bumblebee.Utils.Tokenizers.apply/4
    (nx 0.6.2) lib/nx.ex:4510: Nx.with_default_backend/2
    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:882: anonymous fn/4 in Bumblebee.Text.Generation.generation/4
    (nx 0.6.2) lib/nx/serving.ex:1704: anonymous fn/3 in Nx.Serving.handle_preprocessing/2
    (telemetry 1.2.1) /Users/jonas/Library/Caches/mix/installs/elixir-1.15.7-erts-14.1.1/f67c01eefcd351fd5b5511a96e61c42d/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    #cell:776q3ifvc2hexaoavrvlcde7ehfkvusl:7: (file)

I'm using bumblebee 0.4.2

Here's the model spec

spec: %Bumblebee.Text.Llama{
    architecture: :for_causal_language_modeling,
    vocab_size: 32256,
    max_positions: 16384,
    hidden_size: 2048,
    intermediate_size: 5504,
    num_blocks: 24,
    num_attention_heads: 16,
    activation: :silu,
    layer_norm_epsilon: 1.0e-6,
    initializer_scale: 0.02,
    output_hidden_states: false,
    output_attentions: false,
    num_labels: 2,
    id_to_label: %{},
    pad_token_id: 0
  }

And here's the tokenizer

%Bumblebee.Text.LlamaTokenizer{
  tokenizer: #Tokenizers.Tokenizer<[
    vocab_size: 32022,
    byte_fallback: false,
    continuing_subword_prefix: nil,
    dropout: nil,
    end_of_word_suffix: nil,
    fuse_unk: false,
    model_type: "bpe",
    unk_token: nil
  ]>,
  special_tokens: %{pad: "</s>", eos: "</s>", sep: "</s>", unk: "<unk>"},
  additional_special_tokens: []
}

It looks like the vocab size is not correct in the model spec, for example.

I think the tokenizer uses the correct vocabulary, because I can run this:

Bumblebee.Tokenizer.decode(tokenizer, [32015])

and it correctly returns <｜fim▁hole｜> , which is a deepseek specific token

Would be amazing if this model was supported, as deepseek-coder actually seems to be pretty good at elixir out of the box 🙇

Thank you so much for your help!

The text was updated successfully, but these errors were encountered:

seanmor5 · 2023-11-13T14:32:11Z

The reason this happens is that their repository doesn't have a special tokens map file. I added it here: https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base/discussions/1. You should be able to clone that branch and then load the tokenizer from {:local, "/path/to/local/model"} and the generation will work

The special tokens (padding, eos, bos, etc.) are in the tokenizer_config.json. Not sure if we can load from there. cc @jonatanklosko

jonatanklosko · 2023-11-13T15:36:41Z

@seanmor5 apparently they want to migrate off of these files

All the added token information now lies in the tokenizer_config.json. Nuking the special_tokens_map.json and added_tokens.json. ~ huggingface/transformers#23909

I will later revisit the loading logic.

@jonastemplestein meanwhile you can load from the PR commit directly with {:hf, "deepseek-ai/deepseek-coder-1.3b-base", revision: "7abe797c62ede1b47c4d00a17ed006be9659d657"}.

jonastemplestein · 2023-11-13T18:01:52Z

Thank you so much you two! <3

jonastemplestein · 2023-11-14T14:51:18Z

Hey folks, I've just had a chance to play with this and keep getting garbage output from the model compared to what I get using the HF transformers library

For example, running this in livebook:

# setup cell

Mix.install(
  [
    {:kino_bumblebee, "~> 0.4.0"},
    {:exla, ">= 0.0.0"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

# subsequent code cell

repo = {:hf, "deepseek-ai/deepseek-coder-1.3b-base", revision: "7abe797c62ede1b47c4d00a17ed006be9659d657"}
{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    stream: false,
    defn_options: [compiler: EXLA, lazy_transfers: :never]
  )

# # Should be supervised
Kino.start_child({Nx.Serving, name: Deepseek, serving: serving})

prompt = "<｜fim▁begin｜>def quick_sort(arr):\n  <｜fim▁hole｜> \n <｜fim▁end｜> "

Nx.Serving.batched_run(Deepseek, prompt)

results in this output:

<｜fim▁begin｜>def quick_sort(arr):\n  <｜fim▁hole｜> \n <｜fim▁end｜> ••••••••••••••••••••

In comparison, running the same model in HF transformers in python gives me something completely different and sensible:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True).cuda()

input_text = "<｜fim▁begin｜>def quick_sort(arr):\n  <｜fim▁hole｜> \n <｜fim▁end｜> "

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=328)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Output:

 if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less = [i for i in arr[1:] if i <= pivot]
        greater = [i for i in arr[1:] if i > pivot]
        return quick_sort(less) + [pivot] + quick_sort(greater)

Any ideas how to further debug this? 🤔 Any help is much appreciated

I'm on an M2 Pro in case that is relevant, but I also saw the same behaviour when I briefly ran this on an A100 with cuda

jonastemplestein · 2023-11-14T14:59:38Z

Also, slightly related question: Do you think I can use the fine-tuning process described here to fine-tune this model directly from livebook? Can you think of any reasons why that might not work?

I think it'd be a very cool demo to train an elixir code completion model for livebook in livebook :)

jonatanklosko · 2023-11-14T17:17:55Z

@jonastemplestein I tracked down the output difference and it turns out this checkpoint uses scaling for rotary embedding, which we don't have yet. I inlined scaling just to check and the output looks the same, except there is a bunch of newlines at the end (which makes the generation run for way too long). I will send a PR once I figure it out and get everything in sync :)

jonastemplestein · 2023-11-14T19:05:33Z

Amazing, thank you so much!

Regarding the excessive newlines, I've found that one way to reduce that in python land is to set a higher "repeat penalty" for token generation. Is there such a parameter in bumblebee? I saw there is :no_repeat_ngram_length but that works a bit differently.

Along similar lines, are there any plans to support a temperature parameter? Or can I somehow simulate the way temperature works using the sampling strategy?

jonatanklosko · 2023-11-16T19:03:52Z

@jonastemplestein with #285 it should match the Python implementation, you also need to load the tokenizer from this revision until upstream PR is merged:

{:ok, tokenizer} =
  Bumblebee.load_tokenizer(
    {:hf, "deepseek-ai/deepseek-coder-1.3b-base",
     revision: "e94f2b11bc28abbd67ecadfaad058c30b24a589f"}
  )

We don't have repetition penalty, only :no_repeat_ngram_length.

We don't have temperature either, but supporting that is trivial, I will open up an issue and implement soon.

jonastemplestein · 2023-11-16T22:48:12Z

Amazing, thank you so much!

If you'd like, you can see the model in action in this livebook instance, which has rudimentary code completion: https://livebookjonas.fly.dev/ (password elixircopilot)

The livebook instance powers down after some inactivity and then takes 30 seconds or so to come back up. If it just booted, you can open the starred notebook called Livebook Copilot Playground to play around.

Two questions

Could you make the same huggingface PR for the 6.7b version, please?
Can you think of a reason why this model would return not just new tokens, but also the entire prompt? I had to do this workaround but that wasn't necessary e.g. for codellama or even GPT2

Thanks again for your help with this, super amazing <3

jonatanklosko · 2023-11-17T08:06:27Z

Could you make the same huggingface PR for the 6.7b version, please?

Done. Note that the tokenizer should be the same, so you can load the tokenizer as above and model from any other checkpoint :)

jonatanklosko · 2023-11-17T08:10:09Z

Can you think of a reason why this model would return not just new tokens, but also the entire prompt? I had to do this workaround but that wasn't necessary e.g. for codellama or even GPT2

That's how text completion models usually work, we pass the input sequence, they generate tokens one by one, which we effectively keep appending to the input sentence for subsequent inferences. There are also encoder-decoder models, where the input first goes through an encoder and is treated as the context for generation, separate from the generation sequence (e.g. BART). That said, we should have an option to strip the tokens, tracked by #247 :)

jonastemplestein mentioned this issue Nov 13, 2023

Support Starcoder Model #279

Closed

jonatanklosko mentioned this issue Nov 16, 2023

Support more rotary embedding options for Llama #285

Merged

jonatanklosko mentioned this issue Nov 17, 2023

Use tokenizer_config.json as the primary source of metadata #289

Closed

jonatanklosko closed this as completed in #285 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DeepSeek Coder Model #278

Support DeepSeek Coder Model #278

jonastemplestein commented Nov 13, 2023

seanmor5 commented Nov 13, 2023

jonatanklosko commented Nov 13, 2023

jonastemplestein commented Nov 13, 2023

jonastemplestein commented Nov 14, 2023

jonastemplestein commented Nov 14, 2023 •

edited

Loading

jonatanklosko commented Nov 14, 2023 •

edited

Loading

jonastemplestein commented Nov 14, 2023

jonatanklosko commented Nov 16, 2023

jonastemplestein commented Nov 16, 2023 •

edited

Loading

jonatanklosko commented Nov 17, 2023

jonatanklosko commented Nov 17, 2023 •

edited

Loading

Support DeepSeek Coder Model #278

Support DeepSeek Coder Model #278

Comments

jonastemplestein commented Nov 13, 2023

seanmor5 commented Nov 13, 2023

jonatanklosko commented Nov 13, 2023

jonastemplestein commented Nov 13, 2023

jonastemplestein commented Nov 14, 2023

jonastemplestein commented Nov 14, 2023 • edited Loading

jonatanklosko commented Nov 14, 2023 • edited Loading

jonastemplestein commented Nov 14, 2023

jonatanklosko commented Nov 16, 2023

jonastemplestein commented Nov 16, 2023 • edited Loading

jonatanklosko commented Nov 17, 2023

jonatanklosko commented Nov 17, 2023 • edited Loading

jonastemplestein commented Nov 14, 2023 •

edited

Loading

jonatanklosko commented Nov 14, 2023 •

edited

Loading

jonastemplestein commented Nov 16, 2023 •

edited

Loading

jonatanklosko commented Nov 17, 2023 •

edited

Loading