Skip to content

Commit

Permalink
fix(tokenizers): Change the return type of pre_tokenize to allow owne…
Browse files Browse the repository at this point in the history
…rship

Some pretokenizers mutate the data as it is split (in particular, the byte-
level), so the returned set of pieces must have ownership over their data.
This could potentially be a cost hit since those that do not require
ownership will be making copies.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
  • Loading branch information
gabe-l-hart committed Nov 15, 2024
1 parent 0f1ba98 commit fd1a7cb
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 7 deletions.
8 changes: 4 additions & 4 deletions tokenizer/pre_tokenizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ RegexPreTokenizer::create_regex_(const std::string& pattern) {
return std::make_unique<re2::RE2>("(" + pattern + ")");
}

std::vector<re2::StringPiece> RegexPreTokenizer::pre_tokenize(re2::StringPiece& input) const {
std::vector<re2::StringPiece> result;
re2::StringPiece piece;
std::vector<std::string> RegexPreTokenizer::pre_tokenize(re2::StringPiece& input) const {
std::vector<std::string> result;
std::string piece;
while (RE2::FindAndConsume(&input, *regex_, &piece)) {
result.emplace_back(std::move(piece));
result.emplace_back(piece);
}
return result;
}
11 changes: 8 additions & 3 deletions tokenizer/pre_tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,13 @@
class PreTokenizer {
public:

/** Split the input string piece into sub-pieces */
virtual std::vector<re2::StringPiece> pre_tokenize(re2::StringPiece& input) const = 0;
/** Split the input string piece into sub-pieces
*
* This pre-tokenization may result in sub-pieces that are not contained
* within the original input, therefore the resulting pieces will be owned by
* the caller.
*/
virtual std::vector<std::string> pre_tokenize(re2::StringPiece& input) const = 0;
}; // end class PreTokenizer


Expand Down Expand Up @@ -58,7 +63,7 @@ class RegexPreTokenizer : public PreTokenizer {
{}

/** Pre-tokenize with the stored regex */
std::vector<re2::StringPiece> pre_tokenize(re2::StringPiece& input) const;
std::vector<std::string> pre_tokenize(re2::StringPiece& input) const;

protected:
static Re2UPtr create_regex_(const std::string& pattern);
Expand Down

0 comments on commit fd1a7cb

Please sign in to comment.