Enable Tokenizer Loading from downloaded Teacher Model #343

khaledsulayman · 2024-11-07T16:24:54Z

Currently, the AutoTokenizer in the chunker tries to pull the tokenizer from the teacher model, but this hasn't been tested thoroughly outside of mixtral.

By default, AutoTokenizer.from_pretrained() will go to huggingface for the model, and in the case of mixtral, the repo is gated so you'd need to set $HF_TOKEN.

What should happen is we pull the tokenizer from the downloaded teacher model and raise an error if it cannot be found.

The text was updated successfully, but these errors were encountered:

bbrowning · 2024-11-07T16:54:42Z

Something that came up is multiple downstream implementations will not have the teacher model on disk where the ilab CLI is run, and instead they run the teacher model in separate server processes that are not local to the container/machine running the CLI.

bbrowning · 2024-11-08T00:39:00Z

@relyt0925 I'm tagging you into this for visibility - as part of the new chunking implementation in InstructLab, we'll attempt to use the actual tokenizer that the teacher model is using when chunking up documents. However, an implication of that is that we'll expect to be able to access that tokenizer from the ilab CLI process, which may not be available in your case as I believe you're running the teacher models separately and not on the same machine running the ilab data generate CLI.

This issue is to figure out how to handle loading this tokenizer model from the teacher, so if you have any input as a downstream user as to how we should handle falling back to something else if we can't find the teacher model's tokenizer, how users running teacher models outside of the CLI (via --endpoint-url) should get a tokenizer on-disk for the ilab CLI (or should they?), etc this is probably a good place to surface those concerns to the community.

bbrowning · 2024-11-08T00:47:30Z

Digging a bit into vllm, it has /tokenize and /detokenize endpoints added in vllm-project/vllm#5054 . And, llama-cpp-python has /extras/tokenize added in abetlen/llama-cpp-python#1136 . So, both of our default servers can give us a token count by calling an endpoint in the server as opposed to us needed to load and tokenize ourselves on the client-side. That seems cleaner to me, as we're guaranteed to already have the proper tokenizer loaded on the server and don't have to fuss with finding the tokenizer for the teacher model ourselves at all. The only reason this won't work is if we need to work with arbitrary servers that are not vllm or llama-cpp-python. Do we?

khaledsulayman · 2024-11-08T13:24:17Z

Something I just realized: we may want to decouple the tokenizer fetching from the actual model serving.

Looking forward to a potential refactor, the doc ingestion stuff may have to happen before the teacher model gets served at all. Our code right now just happens to serve the model before reading the taxonomy preparing datasets, but it likely won’t stay that way.

relyt0925 · 2024-11-27T05:27:53Z

@bbrowning sorry for missing this! We have a workaround for this now but I agree with everything you said.

khaledsulayman added the bug Something isn't working label Nov 7, 2024

khaledsulayman self-assigned this Nov 8, 2024

ktam3 mentioned this issue Nov 12, 2024

[Epic] Fully Utilize Docling V2 Capabilities #374

Open

6 tasks

ktam3 added the jira label Nov 13, 2024

khaledsulayman mentioned this issue Nov 13, 2024

Check for tokenizer in downloaded models directory #364

Merged

mergify bot closed this as completed in #364 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Tokenizer Loading from downloaded Teacher Model #343

Enable Tokenizer Loading from downloaded Teacher Model #343

khaledsulayman commented Nov 7, 2024

bbrowning commented Nov 7, 2024

bbrowning commented Nov 8, 2024

bbrowning commented Nov 8, 2024

khaledsulayman commented Nov 8, 2024

relyt0925 commented Nov 27, 2024

Enable Tokenizer Loading from downloaded Teacher Model #343

Enable Tokenizer Loading from downloaded Teacher Model #343

Comments

khaledsulayman commented Nov 7, 2024

bbrowning commented Nov 7, 2024

bbrowning commented Nov 8, 2024

bbrowning commented Nov 8, 2024

khaledsulayman commented Nov 8, 2024

relyt0925 commented Nov 27, 2024