Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Tokenizer Loading from downloaded Teacher Model #343

Closed
khaledsulayman opened this issue Nov 7, 2024 · 5 comments · Fixed by #364
Closed

Enable Tokenizer Loading from downloaded Teacher Model #343

khaledsulayman opened this issue Nov 7, 2024 · 5 comments · Fixed by #364
Assignees
Labels
bug Something isn't working jira

Comments

@khaledsulayman
Copy link
Member

Currently, the AutoTokenizer in the chunker tries to pull the tokenizer from the teacher model, but this hasn't been tested thoroughly outside of mixtral.

By default, AutoTokenizer.from_pretrained() will go to huggingface for the model, and in the case of mixtral, the repo is gated so you'd need to set $HF_TOKEN.

What should happen is we pull the tokenizer from the downloaded teacher model and raise an error if it cannot be found.

@khaledsulayman khaledsulayman added the bug Something isn't working label Nov 7, 2024
@bbrowning
Copy link
Contributor

Something that came up is multiple downstream implementations will not have the teacher model on disk where the ilab CLI is run, and instead they run the teacher model in separate server processes that are not local to the container/machine running the CLI.

@bbrowning
Copy link
Contributor

@relyt0925 I'm tagging you into this for visibility - as part of the new chunking implementation in InstructLab, we'll attempt to use the actual tokenizer that the teacher model is using when chunking up documents. However, an implication of that is that we'll expect to be able to access that tokenizer from the ilab CLI process, which may not be available in your case as I believe you're running the teacher models separately and not on the same machine running the ilab data generate CLI.

This issue is to figure out how to handle loading this tokenizer model from the teacher, so if you have any input as a downstream user as to how we should handle falling back to something else if we can't find the teacher model's tokenizer, how users running teacher models outside of the CLI (via --endpoint-url) should get a tokenizer on-disk for the ilab CLI (or should they?), etc this is probably a good place to surface those concerns to the community.

@bbrowning
Copy link
Contributor

Digging a bit into vllm, it has /tokenize and /detokenize endpoints added in vllm-project/vllm#5054 . And, llama-cpp-python has /extras/tokenize added in abetlen/llama-cpp-python#1136 . So, both of our default servers can give us a token count by calling an endpoint in the server as opposed to us needed to load and tokenize ourselves on the client-side. That seems cleaner to me, as we're guaranteed to already have the proper tokenizer loaded on the server and don't have to fuss with finding the tokenizer for the teacher model ourselves at all. The only reason this won't work is if we need to work with arbitrary servers that are not vllm or llama-cpp-python. Do we?

@khaledsulayman
Copy link
Member Author

Something I just realized: we may want to decouple the tokenizer fetching from the actual model serving.

Looking forward to a potential refactor, the doc ingestion stuff may have to happen before the teacher model gets served at all. Our code right now just happens to serve the model before reading the taxonomy preparing datasets, but it likely won’t stay that way.

@relyt0925
Copy link
Contributor

@bbrowning sorry for missing this! We have a workaround for this now but I agree with everything you said.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@bbrowning @relyt0925 @khaledsulayman @ktam3 and others