Skip to content

Commit

Permalink
fix(tokenize_tool): Fix logic for decoding if encoded is empty
Browse files Browse the repository at this point in the history
Not a terribly realistic usecase, but this avoids a corner case (that I
just might be hitting while tokenizers is stubbed out!)

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
  • Loading branch information
gabe-l-hart committed Oct 4, 2024
1 parent 9aedfdf commit 63f4096
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions tokenizer/tokenize_tool.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ int main(int argc, char* argv[]) {
// Encode
std::cout << "PROMPT:" << std::endl << prompt << std::endl << std::endl;
std::cout << "Encoding..." << std::endl;
const auto encoded = tok_ptr->encode(prompt, 1, 1);
const auto encoded = tok_ptr->encode(prompt, 0, 0);
std::cout << "[";
for (const auto tok_id : encoded) {
std::cout << " " << tok_id;
Expand All @@ -86,8 +86,10 @@ int main(int argc, char* argv[]) {

// Decode
std::cout << "Decoding..." << std::endl;
for (auto i = 1; i < encoded.size() - 1; ++i) {
std::cout << tok_ptr->decode(encoded[i-1], encoded[i]);
uint64_t prev = tok_ptr->bos_tok();
for (const auto& current : encoded) {
std::cout << tok_ptr->decode(prev, current);
prev = current;
}
std::cout << std::endl;

Expand Down

0 comments on commit 63f4096

Please sign in to comment.