-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Tokenizer Loading from downloaded Teacher Model #343
Comments
Something that came up is multiple downstream implementations will not have the teacher model on disk where the |
@relyt0925 I'm tagging you into this for visibility - as part of the new chunking implementation in InstructLab, we'll attempt to use the actual tokenizer that the teacher model is using when chunking up documents. However, an implication of that is that we'll expect to be able to access that tokenizer from the This issue is to figure out how to handle loading this tokenizer model from the teacher, so if you have any input as a downstream user as to how we should handle falling back to something else if we can't find the teacher model's tokenizer, how users running teacher models outside of the CLI (via |
Digging a bit into vllm, it has |
Something I just realized: we may want to decouple the tokenizer fetching from the actual model serving. Looking forward to a potential refactor, the doc ingestion stuff may have to happen before the teacher model gets served at all. Our code right now just happens to serve the model before reading the taxonomy preparing datasets, but it likely won’t stay that way. |
@bbrowning sorry for missing this! We have a workaround for this now but I agree with everything you said. |
Currently, the AutoTokenizer in the chunker tries to pull the tokenizer from the teacher model, but this hasn't been tested thoroughly outside of mixtral.
By default,
AutoTokenizer.from_pretrained()
will go to huggingface for the model, and in the case of mixtral, the repo is gated so you'd need to set$HF_TOKEN
.What should happen is we pull the tokenizer from the downloaded teacher model and raise an error if it cannot be found.
The text was updated successfully, but these errors were encountered: