Skip to content

Commit

Permalink
Fix garbage outputs of Chinese analyzer (infiniflow#1306)
Browse files Browse the repository at this point in the history
### What problem does this PR solve?

Copy constructor of ChineseAnalyzer does not copy the stopwords, as a
result the outputs will contain stopword tokens.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
  • Loading branch information
yingfeng authored and wuxiaobai24 committed Jun 8, 2024
1 parent cdd1f56 commit 33028a5
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 4 deletions.
7 changes: 5 additions & 2 deletions src/common/analyzer/chinese_analyzer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,12 @@ ChineseAnalyzer::ChineseAnalyzer(const String &path) : dict_path_(path) {}
ChineseAnalyzer::ChineseAnalyzer(const ChineseAnalyzer &other) {
own_jieba_ = false;
jieba_ = other.jieba_;
stopwords_ = other.stopwords_;
}
ChineseAnalyzer::~ChineseAnalyzer() {
if (own_jieba_ && jieba_)
if (own_jieba_ && jieba_) {
delete jieba_;
}
}

Status ChineseAnalyzer::Load() {
Expand Down Expand Up @@ -98,8 +100,9 @@ Status ChineseAnalyzer::Load() {
void ChineseAnalyzer::LoadStopwordsDict(const String &stopwords_path) {
std::ifstream ifs(stopwords_path);
String line;
stopwords_ = MakeShared<FlatHashSet<String>>();
while (getline(ifs, line)) {
stopwords_.insert(line);
stopwords_->insert(line);
}
}

Expand Down
4 changes: 2 additions & 2 deletions src/common/analyzer/chinese_analyzer.cppm
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,13 @@ protected:

private:
void LoadStopwordsDict(const String &stopwords_path);
bool Accept_token(const String &term) { return !stopwords_.contains(term); }
bool Accept_token(const String &term) { return !stopwords_->contains(term); }

private:
cppjieba::Jieba *jieba_{nullptr};
String dict_path_;
bool own_jieba_{};
Vector<cppjieba::Word> cut_words_;
FlatHashSet<String> stopwords_;
SharedPtr<FlatHashSet<String>> stopwords_{};
};
} // namespace infinity

0 comments on commit 33028a5

Please sign in to comment.