Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Improve Chinese analyzer #1308

Open
1 task done
yingfeng opened this issue Jun 8, 2024 · 0 comments
Open
1 task done

[Feature Request]: Improve Chinese analyzer #1308

yingfeng opened this issue Jun 8, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@yingfeng
Copy link
Member

yingfeng commented Jun 8, 2024

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Describe the feature you'd like

Current Jieba analyzer for Chinese has several problems:

  1. Stopwords are supported through external dictionaries, therefore the eventual outputs do not have continious offsets which will affect phrase queries.
  2. For English tokens, stemmer is not used
  3. Query segmentation has smaller granularity which does not have a smart policy, it will affect ranking for Chinese text
@yingfeng yingfeng added the feature request New feature or request label Jun 8, 2024
yingfeng pushed a commit that referenced this issue Jun 11, 2024
Introduced CutGrain for Chinese analyzer
Issue link:#1308

- [x] New Feature (non-breaking change which adds functionality)
- [x] Test cases
@yingfeng yingfeng mentioned this issue Jun 11, 2024
2 tasks
yingfeng added a commit that referenced this issue Jun 11, 2024
### What problem does this PR solve?

1.Inherit from CommonLanguageAnalyzer instead of Analyzer
2.Return logical offset through CommonLanguageAnalyzer 
3.Stemmer could be generated for Latin tokens

Issue link:#1308

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
yuzhichang added a commit that referenced this issue Jun 12, 2024
Fix a bug of Chinese phrase query
Issue link:#1308

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Test cases
@yingfeng yingfeng mentioned this issue Jun 20, 2024
1 task
yuzhichang pushed a commit that referenced this issue Jun 21, 2024
### What problem does this PR solve?

1. Chinese jieba analyzer will output " " for latin tokens.
2. Standard analyzer will output discontinuous if delimiter exists
between tokens.

Issue link:#1308

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
yingfeng added a commit that referenced this issue Jun 21, 2024
### What problem does this PR solve?

Use modified jieba query segmentation for fine grained Chinese analyzer.

Issue :#1308

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant