Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Pull upstream changes into GptNeox #270

Merged
merged 3 commits into from
May 24, 2023
Merged

Pull upstream changes into GptNeox #270

merged 3 commits into from
May 24, 2023

Conversation

LLukas22
Copy link
Contributor

Fixes #246.

The tokenization behaviour has to be changed to make this work with every gpt-neox based model.
e.g. Redpajama doesn't like the added padding token at the start of a sequence.

Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Minor style nits but apart from that this is good to go from my end.

Regarding the BOS: does upstream have the same issue?

crates/models/gptneox/src/lib.rs Outdated Show resolved Hide resolved
crates/models/gptneox/src/lib.rs Outdated Show resolved Hide resolved
crates/models/gptneox/src/lib.rs Outdated Show resolved Hide resolved
crates/models/gptneox/src/lib.rs Show resolved Hide resolved
@LLukas22
Copy link
Contributor Author

Nope upstream is not affected by the BOS token schenanigans. Seams like they don't append anything at the beginning.

@philpax
Copy link
Collaborator

philpax commented May 23, 2023

Maybe we should just set bos=false for GPT-NeoX? That feels like a weird hack, but I can't see anything in the HF config that controls this.

@Narsil sorry about summoning you here, but for a given HF model, how do we know if the BOS token should be added / what I presume is add_special_tokens here (we're switching to HF in #271) should be turned on? Does it have to be bounced up to the user?

@philpax philpax marked this pull request as ready for review May 23, 2023 19:22
@Narsil
Copy link

Narsil commented May 24, 2023

@philpax No worries for the summon, happy to help.

add_special_tokens should be True by default yes.

But I think also what you are looking for is post_process.

The post_processor in tokenizers is the method that takes a vec of encoded strings (usually a single or a pair of tokenized inputs, like question and context for question answering)
This merges them into a single sequence to be consumed by the model:

Here is an example
https://github.com/Narsil/smelte-rs/blob/main/examples/gpt2.rs#L261-L262

Unfortunately, some tokenizers are not necessarily correctly configured on the hub, because some special tokens are sent directly within the string itself (especially useful for when it's a multilingual model like Whisper or a translation model). So some knowledge might remain out of the tokenizer itself.

encode + post_process should be the go-to I think.

@philpax philpax merged commit 0d36ab7 into rustformers:main May 24, 2023
@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to latest upstream GPT-NeoX implementation / Fix GPT-NeoX inference
3 participants