Granite code support #1336

gabe-l-hart · 2024-10-31T18:48:37Z

Dependencies

This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:

Safetensors: Safetensors #1255
Bias tensors: Bias tensors #1259
Tied word embeddings: Tied word embeddings #1260
Tokenizers tokenizer: Tokenizers tokenizer #1261

Issues

Description

This PR adds support for Granite Code in 3B and 8B sizes. Given current limitations with the export of tokenizers, they will only work in the python environment with this PR.

Discussion

Usage

To test using these models, I did it both by running with the aliases and by running pointing directly at the checkpoint/tokenizer:

# Run with alias
python torchchat.py generate granite-code \
  --prompt "Write a python function to sort numbers and strings with numeric prefixes"

# Run with direct reference to artifacts
python torchchat.py generate \
  --prompt "Write a python function to sort numbers and strings with numeric prefixes" \
  --checkpoint-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/model.pth \
  --tokenizer-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json \
  --params-path torchchat/model_params/Granite-3B-Code.json

Open Questions

There are several outstanding issues, beyond the upstream tokenizers PR, that need to be solved before this PR is ready for full review:

It seems that in chat mode, the models produce very unreliable results, sometimes generating a single token while other times generating a reasonable result but stopping mid-sentence before reaching the max token limit. My current hypothesis is that the chat template is not being used anywhere and we're therefore using the llama chat template automatically.
The 8B model currently produces garbage after a few tokens. The main difference between the 3B and 8B models, besides common parameter differences like number of layers and hidden size, is that the 8B uses grouped query attention. I've seen similar behavior in other frameworks where the model starts on a good track, then devolves into garbage and in those cases GQA was also at play, so I suspect it's something along these lines here as well.

pytorch-bot · 2024-10-31T18:48:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1336

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 6fb3b98 with merge base fd1857a ():

NEW FAILURE - The following job has failed:

Run the aoti runner with CUDA using stories / test-runner-aot-cuda / linux-job (gh)
RuntimeError: Command docker exec -t da8cb52302fb66401f5bcd7fd441d20ab1854ad51eda0b970fa4f11cd02f3df4 /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / runner-aoti (16-core-ubuntu) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gabe-l-hart · 2024-10-31T19:01:01Z

Also, I used the following script to perform conversion of a pre-existing HF snapshot. It's similar to the if __name__ == "__main__" block in convert_hf_checkpoint.py:

convert_existing_checkpoint.py

#!/usr/bin/env python
"""
Simple script to convert an existing HF snapshot into torchchat format
"""

# Standard
import argparse
from pathlib import Path

# Local
from torchchat.cli.convert_hf_checkpoint import convert_hf_checkpoint, convert_hf_checkpoint_to_tune

def main():
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("checkpoint_dir", help="Directory containing HF checkpoint")
    parser.add_argument("--name", "-n", default=None, help="Name to use for the model")
    parser.add_argument("--torchtune", "-t", action="store_true", default=False, help="Convert to torchtune format")
    args = parser.parse_args()
    if args.torchtune:
        convert_hf_checkpoint_to_tune(model_dir=Path(args.checkpoint_dir), model_name=args.name)
    else:
        convert_hf_checkpoint(model_dir=Path(args.checkpoint_dir), model_name=args.name)

if __name__ == "__main__":
    main()

gabe-l-hart · 2024-11-07T00:30:13Z

I confirmed that it was falling back to the llama2 chat formatter because it wasn't using tiktoken. I've added basic jinja2 chat template support when using the HF tokenizer.

mikekgfb · 2024-11-08T02:57:57Z

A pointer to this PR and the example commands from the PR description would make a good starting point for docs/new_model.md to (at least partially?) address #1038 / #1041 in conjunction with some explanatory text

# wget artifacts here
# Run with direct reference to artifacts
python torchchat.py generate \
  --prompt "Write a python function to sort numbers and strings with numeric prefixes" \
  --checkpoint-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/model.pth \
  --tokenizer-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json \
  --params-path torchchat/model_params/Granite-3B-Code.json

Explain how to add to model list....

# Run with alias
python torchchat.py generate granite-code \
  --prompt "Write a python function to sort numbers and strings with numeric prefixes"

if added to .ci/scripts/run-docs new_model it might also make a testcase for the features used in granite.

gabe-l-hart · 2024-11-08T20:26:55Z

@Jack-Khuu I'm a bit stumped trying to get the 8B model working. I'm trying to mentally diff the Attention implementation in torchchat vs transformers to see if I can find anything that would indicate something behaving differently with Grouped Query Attention.

I'm not really following the different way that the torchchat version is manipulating the tensors for tensor parallel inference (need to do some background reading there), but this feels like it's got to be close to the root of the issue. The only other place that I could imagine things going wrong is in the unpacking of the unified wqkv here. Any insight you can offer would be much appreciated!

Results with 3B

?> python torchchat.py generate granite-code-3b --prompt "Write a python hello world function"
NumExpr defaulting to 16 threads.
PyTorch version 2.6.0.dev20241002 available.
lm_eval is not installed, GPTQ may not be usable
W1108 13:18:36.747000 52813 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Using device=mps 
Loading model...
Time to load model: 3.86 seconds
-----------------------------------------------------------
Write a python hello world function

```python
def say_hello():
    print("hello world")
```

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 19 tokens                 
Time for inference 1: 1.6639 sec total                 
Time to first token: 0.4289 sec with parallel prefill.                

      Total throughput: 12.0199 tokens/sec, 0.0832 s/token                 
First token throughput: 2.3316 tokens/sec, 0.4289 s/token                 
 Next token throughput: 15.3844 tokens/sec, 0.0650 s/token                     

Bandwidth achieved: 86.74 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 12.02                 
Average tokens/sec (first token): 2.33                 
Average tokens/sec (next tokens): 15.38

NOTE (because I feel compelled): The above snippet uses zero-width-spaces to escape the triple backticks inside the code blocks, so copy-paste at your own peril!

Results with 8B

?> python torchchat.py generate granite-code-8b -p "Write a python hello world function"
usage: torchchat [-h] {chat,generate,browser,export,download,list,remove,where,server,eval} ...
torchchat: error: unrecognized arguments: -p Write a python hello world function
(torchchat2) ghart@Mac [torchchat GraniteCodeSupport ?]$ python torchchat.py generate granite-code-8b --prompt "Write a python hello world function"
NumExpr defaulting to 16 threads.
PyTorch version 2.6.0.dev20241002 available.
lm_eval is not installed, GPTQ may not be usable
W1108 13:13:21.744000 51816 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Using device=mps 
Loading model...
Time to load model: 11.67 seconds
-----------------------------------------------------------
Write a python hello world function function function function function function function function function function function

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 11 tokens                 
Time for inference 1: 7.5729 sec total                 
Time to first token: 4.8976 sec with parallel prefill.                

      Total throughput: 1.5846 tokens/sec, 0.6311 s/token                 
First token throughput: 0.2042 tokens/sec, 4.8976 s/token                 
 Next token throughput: 4.1117 tokens/sec, 0.2432 s/token                     

Bandwidth achieved: 26.17 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 1.58                 
Average tokens/sec (first token): 0.20                 
Average tokens/sec (next tokens): 4.11

Jack-Khuu · 2024-11-09T01:50:54Z

Thanks for the details @gabe-l-hart

I'll try to give it a gander this weekend. It's weird that 3B works, but 8B doesn't. I assume they use the same template so that at least clears that part

byjlw · 2024-11-19T16:34:50Z

Looks like this has been open for a several weeks now.
Yeah, the template thing is super hacky right now and I knew it was going to hang up our ability to add new models.
In general we need to make a smoother path for new adding new models with different architectures, templates and storage locations.

It's been on @varunfb and @Jack-Khuu 's plate for a while but they've been swamped with other work.
Fortunately it's planning season and the design for this is on the list.
@gabe-l-hart would love to get your feedback on how best to support folks like yourself.

gabe-l-hart · 2024-11-19T16:52:06Z

Thanks @byjlw! I definitely understand juggling priorities. The path to adding new models in the model_params and model_config is relatively straightforward (could use a doc, but TBH I never read docs anyway, so easy-to-read code is always best). The real challenge has come up around places where the models differ from the llama series models. In particular, Granite Code uses the llama architecture, but uses several optional bits that the Meta Llama models don't (e.g. HF tokenizers, tied embeddings). Getting these pieces to work has been a decently steep learning curve (fun though!). I think the thing that would be most helpful would be some kind of compatibility matrix doc that shows architectures that have support, sub-features within architectures, and which "layers" they're supported in (e.g. python, c++, executorch). This would help a lot in figuring out where to dig in to add new model support.

For the specific issues for Granite Code, the place I'm a bit stuck is trying to figure out why the 8B model is flopping while the 3B model is working just fine. My gut is that it has something to do with the alternate attention mechanism in TC, but I'm not deeply versed in attention enough to spot it quickly. The only architectural difference between 3B and 8B is the use of grouped query attention, so it's either something there or there's some incompatibility between the attention implementations in transformers and TC that's only being exercised by the specific weights of the 8B. Any help and/or expert bug spotting would be much appreciated!

gabe-l-hart · 2024-11-20T17:21:58Z

I just rebased on main and it now looks like even the 3b model is producing only a single token as output in chat mode. Will try to get to the bottom of it.

mikekgfb · 2024-11-21T09:20:15Z

I just rebased on main and it now looks like even the 3b model is producing only a single token as output in chat mode. Will try to get to the bottom of it.

Have you tried bisecting the 3B fail? Even if the change was legit and necessary, the type of change that would break the 3B model might give insight in how to "fix" both the 3B and 8B models?
.

mikekgfb · 2024-11-21T09:25:08Z

The real challenge has come up around places where the models differ from the llama series models. In particular, Granite Code uses the llama architecture, but uses several optional bits that the Meta Llama models don't (e.g. HF tokenizers, tied embeddings). Getting these pieces to work has been a decently steep learning curve (fun though!).

I'm a bit surprised by this because chatgpt had this to say (understanding that I'm quoting chatgppt about an IBM model to an IBMer, so skating on seriously thin ice!!!):

what tokenization scheme does the ibm granite model use

Searched 4 sites
The IBM Granite models, including its base and instruction-tuned variants, utilize the Llama2 tokenizer for tokenization. This choice aligns with the models’ architectural similarity to Meta's Llama2 series, such as the Granite-7b model, which follows the Llama2-7B architecture and employs similar tokenization strategies. These tokenizers are designed to handle diverse data sources, including programming languages and natural language, ensuring compatibility and efficiency in tasks like code synthesis and language understanding

So in theory, SentencePiece should do the trick? Is it the pre and post processing with regexps? (I think I saw some discussion about regexps in one of your PRs or issues?)

In any event, it's cool that we have HF tokenizers because they are a proper superset of SentencePiece+TikToken. (I think @lessw2020 and @kwen2501 had also added some HF tokenizer support for distributed if I remember correctly?)

gabe-l-hart · 2024-11-21T17:10:17Z

Have you tried bisecting the 3B fail? Even if the change was legit and necessary, the type of change that would break the 3B model might give insight in how to "fix" both the 3B and 8B models?

That's on my todo list for my next chunk of uninterrupted dev time! I'm hoping that will be today.

I'm a bit surprised by this because chatgpt had this to say (understanding that I'm quoting chatgppt about an IBM model to an IBMer, so skating on seriously thin ice!!!):

Heh, as you know I'm sure, IBM is a big place, so I'm definitely doing a lot of learning myself in this space. My info from the models team is that we've been using the starcoder tokenizer up until now (including Granite Code and the Granite 3.0 series). When first trying to understand how best to support that in torchchat, I was missing a lot of knowledge about sentencepiece, so was working off of the tokenizer_config.json in HF. I suspect it would be possible to reverse-convert from tokenizers back to sentencepiece for this config, but I haven't done that work yet since I was already halfway down the rabbit hole of tokenizers support. We can certainly look into that as an alternative approach if the preference is to avoid the complexity of the c++ tokenizer buildout.

gabe-l-hart · 2024-11-21T23:07:12Z

@Jack-Khuu @mikekgfb @byjlw I figured out where the issues were coming from. It was two things:

The logic was always inserting a bos token at the beginning of the sequence which the 3b model was sometimes ok with, but the 8b never was
- To solve this, I added tokenizer_prepend_bos as a parameter in TransformerArgs and ModelArgs. It seemed a little klunky to plumb it through multiple blobs, but this got things working for both models with raw generation
The chat template logic was not robust beyond llama2 and llama3 templating. Solving this resulted in a fair bit of refactoring:
- Refactor the Llama2ChatFormatter and Llama3ChatFormatter to encapsulate all logic in a single abstract method encode_dialog_prompt
- Remove all formatter-specific logic from the primary generation loop in def chat
- Add the HFTokenizerChatFormatter
- Plumb the ability to use the chat template with jinja2 through HFTokenizer
  - NOTE: jinja2 was already a transitive dependency, so I just formalized it

To get to the bottom of all of this, I also tweaked the logging a bit. There was already a hard-coded logging config call in cli, so I just added the ability to parse the LOG_LEVEL env var to set it. I also added a fair number of additional log lines and uncommented some that were there but commented out.

NOTE: Many of the existing log lines were using f-strings which will cause the string to be interpolated regardless of whether the logger/level are enabled. I switched all of these to use lazy interpolation with percent-encoding so that it's safe to have them uncommented without a performance hit.

Finally, I was getting lost trying to make sure I didn't break anything in the chat templating, so I bit the bullet and added some basic unit tests. They only cover the chat formatting, but they're a place to start. I did not go any further with unit testing, including not adding pytest as a dependency or adding any CI steps to invoke the tests. If you're interested, I'd be happy to push on unit testing, but I didn't want to lump that conversation into this PR.

gabe-l-hart · 2024-11-21T23:09:20Z

install/requirements.txt

@@ -9,6 +9,10 @@ gguf
 # Tiktoken tokenizer for Llama 3 and other advanced models
 tiktoken

+# Tokenizers and jinja2 for other non-llama models that use HF tokenizers


I added these here, but did not add pytest (yet). I think there's a pending conversation about introducing optional dependency sets, so it would make sense to add a test or dev set at that point, but I didn't want to accidentally carry pytest along as a runtime dependency.

gabe-l-hart · 2024-11-21T23:10:46Z

tests/conftest.py

+import os
+import sys
+
+# Make sure tests can import torchchat


This would be a lot cleaner if we move to having a pyproject.toml or setup.py to bundle torchchat as a package that could be installed with pip install -e.

on the list

gabe-l-hart · 2024-11-21T23:12:04Z

torchchat/generate.py

        return tokens


-B_INST, E_INST = "[INST]", "[/INST]"


I moved these into the class as members to enforce the encapsulation that they should only be used in the context of this formatter

byjlw · 2024-11-23T00:03:01Z

Thanks @gabe-l-hart
Yeah a lot of this feedback resonates really well with me, and resolving it has already made it on our H1 roadmap such as making it easy to have model specific templates, adding test infra and guidelines around tests, abstracting and making the code more modular so that there is a specific module for core with well defined APIs that the CLI and API can use. We will also figure out the release strategy and publish two or three specific pip packages.

Will be able to share the details soon and will have them as RFCs on GH so everyone can comment and contribute.

byjlw

Longer term we want to implement the tokenizer the drop the dependency on HF tokenizer but this will get things going for now.

byjlw · 2024-11-25T17:12:29Z

tests/conftest.py

+import os
+import sys
+
+# Make sure tests can import torchchat


on the list

Jack-Khuu · 2024-12-09T23:36:38Z

Thanks for all the tests and logging!!!

Looks like the hf_tokenizer got caught in fff956c, so we'll need a tad of a rebase (we can keep the python code here for ease)

Jack-Khuu · 2024-12-10T18:33:16Z

Good(?) news, the rebase might be a no-op after backing out of the tokenizer migration: #1414

gabe-l-hart · 2024-12-10T18:34:56Z

Ah! I had a half-written comment asking for some guidance on that. Glad I waited. Will try the rebase now.

Jack-Khuu

This is looking amazing

Can you give python3 torchchat.py chat <> a quick check looks like there a little typo? (commented inline)

Jack-Khuu · 2024-12-11T00:30:41Z

torchchat/generate.py

+            if role == "system":
+                tokens += self.tokenizer.encode(f"{self.B_SYS}{content}{self.E_SYS}")
+                new_turn = False
+            elif role == "user":
+                tokens += self.tokenizer.encode(f"{content}{self.E_INST}")
+                new_turn = False
+            elif role == "assistant":
+                tokens += self.tokenizer.encode(f" {content} {self.tokenizer.eos}\n")
+                new_turn = True


Jack-Khuu · 2024-12-11T00:56:51Z

torchchat/generate.py

@@ -82,49 +121,83 @@ def encode_message(self, message) -> List[int]:
                        self.tokenizer.encode(content["text"], bos=False, eos=False)
                    )

-        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
+        tokens.append(self.tokenizer.special_tokens["<|eot_id|>\n"])


New line looks fishy (will fail when we play with chat python3 torchchat.py chat llama3.2-1b)

Why do we need \n?

Interesting. It looks like there's some inconsistency in the prompt template documentation on whether or not there should be a \n after <|eot_id|>:

Llama 3: https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/

It looks like there is a \n after <|eot_id|> when it's between assistant and user, but NOT always when it's between assistant and user (see the sample under System prompt and multiple turn...)

Llama 3.1: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#-special-tokens-

In almost all samples, there is a \n after <|eot_id|>, but similar to llama 3, it looks like some times it is omitted for the final header tag when adding the final assistant header tag (See sample under Built in Python based tool calling)

Llama 3.2: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/

This one is also inconsistent on whether or not there should be a \n. In the example under With the above function definition,..., it's exactly the opposite pattern of the others where there isn't a \n between the system and user, but then there is between user and the generation prompt tag for assistant 🤷

All of this is to say, it seems that the documentation is inconsistent. I'm happy to yank it out to keep it consistent with how you had it.

Jack-Khuu · 2024-12-14T00:35:08Z

Kicked off a rebase; fyi has a few broken tests (awaiting pin bump) so those are expected

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…HF tokenizers This is a much simplified version of the corresponding logic in transformers. I opted for this so that the full transformers dependency is not added here. CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1522 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This will allow the jinja2 templates for HF tokenizers to be applied without needing to hard-code the formatter logic. This will likely need to be duplicated in the embedded code version of chat. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

In generate, there were a number of commented-out log lines. These are safe to leave in as long as lazy string interpolation is used. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

And disable it for Granite Code models Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

… in classes The formatted strings may not be perfectly 1:1 with the previous impl, but they should be in line with the official model guidelines: * https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3 * https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-2 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

There's no formal execution framework for pytest yet, but these were helpful in ensuring that the formatting was working correctly! To run them, install pytest and run `pytest tests/` Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

There is an incompatibility with logging and torch._dynamo, so this disables it unless the developer asks for it explicitly. NOTE: The TC team has stated that they have holistic logging on the roadmap so this is a short-term solution pending a more robust approach. REF: https://github.com/pytorch/torchchat/actions/runs/11963066986/job/33493237302#step:14:3599 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

There's inconsistency in the documentation on whether or not there should be a n after <|eot_id|>, but this maintains consistency with previous formatting Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Jack-Khuu · 2024-12-19T10:13:43Z

Fantastic, thanks for the patience and add!!

* feat(models): Add models.json blocks for Granite Code 3b and 8b Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Initial model params for granite code 3b Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(model config): Fix model configs for Granite Code * Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(granite): Add model params for granite-code-8b Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(deps): Add tokenizers to the deps explicitly It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(tokenizer): Add basic support for jinja2 template rendering for HF tokenizers This is a much simplified version of the corresponding logic in transformers. I opted for this so that the full transformers dependency is not added here. CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1522 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(chat): Add HFTokenizerChatFormatter and use it for HF tokenizers This will allow the jinja2 templates for HF tokenizers to be applied without needing to hard-code the formatter logic. This will likely need to be duplicated in the embedded code version of chat. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(deps): Add jinja2 as an explicit dep It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(log): Add env-based LOG_LEVEL config to CLI Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(log): Add better logging in model and generate In generate, there were a number of commented-out log lines. These are safe to leave in as long as lazy string interpolation is used. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(generate): Make prepending BOS model-conigurable And disable it for Granite Code models Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(chat): Refactor chat template logic to encapsulate all formatting in classes The formatted strings may not be perfectly 1:1 with the previous impl, but they should be in line with the official model guidelines: * https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3 * https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-2 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(chat): Fix small formatting bugs in llama3 chat formatter Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add initial unit tests for chat formatters There's no formal execution framework for pytest yet, but these were helpful in ensuring that the formatting was working correctly! To run them, install pytest and run `pytest tests/` Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(logging): Disable logging in generate unless set in the env There is an incompatibility with logging and torch._dynamo, so this disables it unless the developer asks for it explicitly. NOTE: The TC team has stated that they have holistic logging on the roadmap so this is a short-term solution pending a more robust approach. REF: https://github.com/pytorch/torchchat/actions/runs/11963066986/job/33493237302#step:14:3599 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove trailing n from llama3 <|eot_id|> There's inconsistency in the documentation on whether or not there should be a n after <|eot_id|>, but this maintains consistency with previous formatting Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2024

gabe-l-hart marked this pull request as ready for review November 5, 2024 16:28

mikekgfb mentioned this pull request Nov 5, 2024

Tokenizers tokenizer #1261

Merged

3 tasks

gabe-l-hart force-pushed the GraniteCodeSupport branch 2 times, most recently from daeeb79 to 19aa6c7 Compare November 7, 2024 00:29

gabe-l-hart force-pushed the GraniteCodeSupport branch from 02a1b47 to b59e840 Compare November 8, 2024 19:54

gabe-l-hart force-pushed the GraniteCodeSupport branch from 11102e7 to 1b6d63e Compare November 15, 2024 18:02

gabe-l-hart mentioned this pull request Nov 15, 2024

Tokenizers cpp 1251 #1379

Closed

1 task

gabe-l-hart force-pushed the GraniteCodeSupport branch from 1b6d63e to 5607fff Compare November 20, 2024 17:16

gabe-l-hart commented Nov 21, 2024

View reviewed changes

byjlw approved these changes Nov 25, 2024

View reviewed changes

tests/conftest.py

import os

import sys

# Make sure tests can import torchchat

Copy link

Contributor

byjlw Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the list

gabe-l-hart force-pushed the GraniteCodeSupport branch from 2041515 to 10918a1 Compare November 25, 2024 18:41

gabe-l-hart force-pushed the GraniteCodeSupport branch from 10918a1 to 4882899 Compare December 10, 2024 20:37

Jack-Khuu approved these changes Dec 11, 2024

View reviewed changes

Jack-Khuu mentioned this pull request Dec 11, 2024

Integrate distributed inference with chat/server #1381

Merged

gabe-l-hart force-pushed the GraniteCodeSupport branch from d624ed3 to 43c7408 Compare December 11, 2024 20:56

gabe-l-hart added 16 commits December 18, 2024 12:04

feat(models): Add models.json blocks for Granite Code 3b and 8b

e3c2849

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Initial model params for granite code 3b

9d80e52

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(model config): Fix model configs for Granite Code

85057bc

* Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(granite): Add model params for granite-code-8b

1e3addc

Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(deps): Add tokenizers to the deps explicitly

5d342fa

It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(deps): Add jinja2 as an explicit dep

38a649a

It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(log): Add env-based LOG_LEVEL config to CLI

0b4f159

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(log): Add better logging in model and generate

526ce15

In generate, there were a number of commented-out log lines. These are safe to leave in as long as lazy string interpolation is used. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(generate): Make prepending BOS model-conigurable

c9f8a71

And disable it for Granite Code models Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(chat): Fix small formatting bugs in llama3 chat formatter

8d26923

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Remove trailing n from llama3 <|eot_id|>

78a3637

There's inconsistency in the documentation on whether or not there should be a n after <|eot_id|>, but this maintains consistency with previous formatting Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the GraniteCodeSupport branch from f25869d to 78a3637 Compare December 18, 2024 19:04

Merge branch 'main' into GraniteCodeSupport

6fb3b98

Jack-Khuu merged commit 90749d2 into pytorch:main Dec 19, 2024
51 of 53 checks passed

mikekgfb mentioned this pull request Dec 29, 2024

Supply Local Weights to an LLM instead of Downloading Weights from HuggingFace #1446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granite code support #1336

Granite code support #1336

gabe-l-hart commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

gabe-l-hart commented Oct 31, 2024

gabe-l-hart commented Nov 7, 2024

mikekgfb commented Nov 8, 2024 •

edited

Loading

gabe-l-hart commented Nov 8, 2024 •

edited

Loading

Jack-Khuu commented Nov 9, 2024

byjlw commented Nov 19, 2024

gabe-l-hart commented Nov 19, 2024

gabe-l-hart commented Nov 20, 2024 •

edited

Loading

mikekgfb commented Nov 21, 2024

mikekgfb commented Nov 21, 2024 •

edited

Loading

gabe-l-hart commented Nov 21, 2024

gabe-l-hart commented Nov 21, 2024

gabe-l-hart Nov 21, 2024

gabe-l-hart Nov 21, 2024

byjlw Nov 25, 2024

gabe-l-hart Nov 21, 2024

byjlw commented Nov 23, 2024

byjlw left a comment

byjlw Nov 25, 2024

Jack-Khuu commented Dec 9, 2024

Jack-Khuu commented Dec 10, 2024

gabe-l-hart commented Dec 10, 2024

Jack-Khuu left a comment

Jack-Khuu Dec 11, 2024

Jack-Khuu Dec 11, 2024

gabe-l-hart Dec 11, 2024

Jack-Khuu commented Dec 14, 2024

Jack-Khuu commented Dec 19, 2024

Granite code support #1336

Granite code support #1336

Conversation

gabe-l-hart commented Oct 31, 2024 • edited Loading

Dependencies

Issues

Description

Discussion

Usage

Open Questions

pytorch-bot bot commented Oct 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1336

❌ 1 New Failure, 1 Unrelated Failure

gabe-l-hart commented Oct 31, 2024

gabe-l-hart commented Nov 7, 2024

mikekgfb commented Nov 8, 2024 • edited Loading

gabe-l-hart commented Nov 8, 2024 • edited Loading

Jack-Khuu commented Nov 9, 2024

byjlw commented Nov 19, 2024

gabe-l-hart commented Nov 19, 2024

gabe-l-hart commented Nov 20, 2024 • edited Loading

mikekgfb commented Nov 21, 2024

mikekgfb commented Nov 21, 2024 • edited Loading

gabe-l-hart commented Nov 21, 2024

gabe-l-hart commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byjlw commented Nov 23, 2024

byjlw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jack-Khuu commented Dec 9, 2024

Jack-Khuu commented Dec 10, 2024

gabe-l-hart commented Dec 10, 2024

Jack-Khuu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jack-Khuu commented Dec 14, 2024

Jack-Khuu commented Dec 19, 2024

gabe-l-hart commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

mikekgfb commented Nov 8, 2024 •

edited

Loading

gabe-l-hart commented Nov 8, 2024 •

edited

Loading

gabe-l-hart commented Nov 20, 2024 •

edited

Loading

mikekgfb commented Nov 21, 2024 •

edited

Loading