Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing fsep token from GPTRefactForCausalLM #8237

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

jpodivin
Copy link
Contributor

@jpodivin jpodivin commented Jul 1, 2024

The <filename> token used by Refact doesn't serve the same purpose as the <file_separator> from CodeGemma.

I've been going through docs for both Refact and CodeGemma, and came to conclusion that <filename> isn't a good fit for the file separator token. The biggest issue is the lack of documentation on the matter.
Only mention of the <filename> token on Huggingface and other sources comes from description of Starcoder training[0] where it is used to mark start of filename for prompt. This is substantially different from the way it is used by CodeGemma[1].

Repos of Small Magellanic Cloud AI on github and HF don't provide more information.

With that in mind, I believe it's better not to set it as fsep token.

[0] https://huggingface.co/blog/starcoder#evaluation
[1] https://ai.google.dev/gemma/docs/formatting

@github-actions github-actions bot added the python python script changes label Jul 1, 2024
@mofosyne mofosyne added medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 3, 2024
special_vocab._set_special_token("prefix", 1)
special_vocab._set_special_token("suffix", 3)
special_vocab._set_special_token("middle", 2)
special_vocab._set_special_token("fsep", 4) # is this correct?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token id 4 in Refact's tokenizer seems to be <fim_pad> anyway, which doesn't look like a separator, so it was not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've rebased the PR and resolved conflicts.

The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
@ggerganov ggerganov merged commit 5aefbce into ggml-org:master Jul 12, 2024
9 checks passed
@jpodivin jpodivin deleted the filetoken branch July 13, 2024 08:31
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants