[Feature Request]: Improve Chinese analyzer #1308

yingfeng · 2024-06-08T14:13:46Z

Is there an existing issue for the same feature request?

I have checked the existing issues.

Describe the feature you'd like

Current Jieba analyzer for Chinese has several problems:

Stopwords are supported through external dictionaries, therefore the eventual outputs do not have continious offsets which will affect phrase queries.
For English tokens, stemmer is not used
Query segmentation has smaller granularity which does not have a smart policy, it will affect ranking for Chinese text

Introduced CutGrain for Chinese analyzer Issue link:#1308 - [x] New Feature (non-breaking change which adds functionality) - [x] Test cases

### What problem does this PR solve? 1.Inherit from CommonLanguageAnalyzer instead of Analyzer 2.Return logical offset through CommonLanguageAnalyzer 3.Stemmer could be generated for Latin tokens Issue link:#1308 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring

Fix a bug of Chinese phrase query Issue link:#1308 - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Test cases

### What problem does this PR solve? 1. Chinese jieba analyzer will output " " for latin tokens. 2. Standard analyzer will output discontinuous if delimiter exists between tokens. Issue link:#1308 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)

### What problem does this PR solve? Use modified jieba query segmentation for fine grained Chinese analyzer. Issue :#1308 ### Type of change - [x] New Feature (non-breaking change which adds functionality)

yingfeng added the feature request New feature or request label Jun 8, 2024

yuzhichang mentioned this issue Jun 11, 2024

Introduced CutGrain for Chinese analyzer #1309

Merged

2 tasks

yingfeng pushed a commit that referenced this issue Jun 11, 2024

Introduced CutGrain for Chinese analyzer (#1309)

02dc959

Introduced CutGrain for Chinese analyzer Issue link:#1308 - [x] New Feature (non-breaking change which adds functionality) - [x] Test cases

yingfeng mentioned this issue Jun 11, 2024

Improve Chinese analyzer #1310

Merged

2 tasks

yuzhichang mentioned this issue Jun 12, 2024

Fix a bug of Chinese phrase query #1313

Merged

2 tasks

yuzhichang added a commit that referenced this issue Jun 12, 2024

Fix a bug of Chinese phrase query (#1313)

1e209d7

Fix a bug of Chinese phrase query Issue link:#1308 - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Test cases

yingfeng mentioned this issue Jun 20, 2024

Fix analyzer issues #1363

Merged

1 task

yingfeng mentioned this issue Jun 21, 2024

Replace the fine grained Chinese analyzer with a new one #1370

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Improve Chinese analyzer #1308

[Feature Request]: Improve Chinese analyzer #1308

yingfeng commented Jun 8, 2024 •

edited

Loading

[Feature Request]: Improve Chinese analyzer #1308

[Feature Request]: Improve Chinese analyzer #1308

Comments

yingfeng commented Jun 8, 2024 • edited Loading

Is there an existing issue for the same feature request?

Describe the feature you'd like

yingfeng commented Jun 8, 2024 •

edited

Loading