索引与查找使用相同的analyzer，结果无法命中 #1851

SxunS · 2023-10-17T06:20:54Z

以下是lucene9.7的官方示例，仅修改了保存值。

    @org.junit.jupiter.api.Test
    public void test3() throws IOException, ParseException {
        Analyzer analyzer = new HanLPAnalyzer();

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, config);
        Document doc = new Document();
        String text = "中国人";
        doc.add(new TextField("fieldname", text, Field.Store.YES));
        iwriter.addDocument(doc);
        iwriter.close();

        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser("fieldname", analyzer);
        Query query = parser.parse(text);
        ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
        assertEquals(1, hits.length);
        // Iterate through the results:
        StoredFields storedFields = isearcher.storedFields();
        for (int i = 0; i < hits.length; i++) {
            Document hitDoc = storedFields.document(hits[i].doc);
            assertEquals("中国人", hitDoc.get("fieldname"));
        }
        ireader.close();
        directory.close();
        IOUtils.rm(indexPath);
    }

运行结果：

org.opentest4j.AssertionFailedError: 
Expected :1
Actual   :0

调试过程中发现：analyzer的查找分词会将 中国人 分成 中国，人。导致查询不到。
但commit 和 search 是使用的同一个analyzer。

尝试将搜索条件修改成 A 中国人，发现可以命中结果，此时查询时分词正常，分成 A,中国人。、

这是一个bug还是特性？

System information

WIN11
HanLP-portable:1.8.4
hanlp-lucene-plugin:1.1.7

I've completed this form and searched the web for solutions.

The text was updated successfully, but these errors were encountered:

hankcs · 2023-10-18T00:53:16Z

hanlp-lucene-plugin目前支持的lucence版本为7.2.0，不支持lucene9.7。lucene9.7中不存在org.apache.lucene.analysis.util.TokenizerFactory这个类，所以你根本不可能编译通过，所以要么你跑的根本不是你所列出的代码而是别的分词器，要么你跑的不是官方版本。
lucence版本7.2.0不存在搜不到的问题：https://github.com/hankcs/hanlp-lucene-plugin/blob/c6be0de363022a38436490cd19761881ebad41e8/src/test/java/com/hankcs/lucene/HanLPAnalyzerTest.java#L87

    public void testIndexAndSearch() throws Exception
    {
        Analyzer analyzer = new HanLPAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        Directory directory = new RAMDirectory();
        IndexWriter indexWriter = new IndexWriter(directory, config);

        Document document = new Document();
        document.add(new TextField("content", "中国人", Field.Store.YES));
        indexWriter.addDocument(document);

        indexWriter.commit();
        indexWriter.close();

        IndexReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse("中国人");
        ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
        assertEquals(1, hits.length);
        for (ScoreDoc scoreDoc : hits)
        {
            Document targetDoc = isearcher.doc(scoreDoc.doc);
            System.out.println(targetDoc.getField("content").stringValue());
        }
    }

SxunS · 2023-10-18T03:58:04Z

不好意思，你是对的。由于是maven 构建的项目，没注意实际使用的org.apache.lucene.analysis.util.TokenizerFactory这个类,确实在lucene7.2.0中。所以编译没有报错（跑的是官方版本）.
对于上述测试用例，我又重新创建了一个干净的环境。maven依赖坐标如下

<dependencies>
    <dependency>
      <groupId>com.hankcs.nlp</groupId>
      <artifactId>hanlp-lucene-plugin</artifactId>
      <version>1.1.7</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13.2</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.hankcs</groupId>
      <artifactId>hanlp</artifactId>
      <version>portable-1.8.4</version>
    </dependency>
  </dependencies>

结果依然同问题描述的一样。
3. 尝试移出 portable-1.8.4依赖，结果正常检索出来，猜测可能和 com.hankcs:hanlp:portable-1.8.4有关。
4. 包含 portable-1.8.4依赖，测试结果：

5. 移除 portable-1.8.4依赖，测试结果：

SxunS · 2023-10-18T07:10:14Z

补充：

protable-1.7.6 查询正常
protable-1.8.4 查询有问题（图1）
使用方案2（release jar + data + properties）的方式，查询正常

protable 和 release jar 的区别是什么呢？就是data 词典和模型不一样吗？
使用protable 也是用的自定义的词典（下载自官方）。
properties 配置

#本配置文件中的路径的根目录，根目录+其他路径=完整路径（支持相对路径，请参考：https://github.com/hankcs/HanLP/pull/254）
#Windows用户请注意，路径分隔符统一使用/
root=E:/xx/demo/document-search/document-search/document-search-server/src/main/resources

#好了，以上为唯一需要修改的部分，以下配置项按需反注释编辑。
Normalization=true

hankcs · 2023-10-19T01:45:20Z

应该是 3a99bc6 引入了一个初始化的bug
portable版本默认加载小模型
该bug仅影响mini模型在JRE启动后第一次分词的结果
如果你使用mini模型，请使用 https://github.com/hankcs/HanLP/releases/tag/v1.8.1 以前的版本。否则无论portable与否，只要你的hanlp.properties里没有加载mini模型，都不影响。

感谢反馈，已经修复，请检查上面的commit是否解决了这个问题。
如果还有问题，欢迎重开issue。

SxunS added the bug label Oct 17, 2023

SxunS assigned hankcs Oct 17, 2023

hankcs closed this as completed Oct 18, 2023

hankcs added invalid and removed bug labels Oct 18, 2023

hankcs added bug and removed invalid labels Oct 19, 2023

hankcs added a commit that referenced this issue Oct 19, 2023

修复mini二元文法首次分词时可能出现的不一致 fix: #1851 (comment)

676b266

hankcs added a commit that referenced this issue Oct 19, 2023

修复mini二元文法在JRE初始化后第一次分词可能出现的不一致 fix: #1851 (comment)

4b2686c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

索引与查找使用相同的analyzer，结果无法命中 #1851

索引与查找使用相同的analyzer，结果无法命中 #1851

SxunS commented Oct 17, 2023 •

edited by hankcs

Loading

hankcs commented Oct 18, 2023

SxunS commented Oct 18, 2023

SxunS commented Oct 18, 2023 •

edited

Loading

hankcs commented Oct 19, 2023

索引与查找使用相同的analyzer，结果无法命中 #1851

索引与查找使用相同的analyzer，结果无法命中 #1851

Comments

SxunS commented Oct 17, 2023 • edited by hankcs Loading

hankcs commented Oct 18, 2023

SxunS commented Oct 18, 2023

SxunS commented Oct 18, 2023 • edited Loading

hankcs commented Oct 19, 2023

SxunS commented Oct 17, 2023 •

edited by hankcs

Loading

SxunS commented Oct 18, 2023 •

edited

Loading