Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViterbiSegment加载自定义词典时未正确替换DoubleArrayTrie #1834

Closed
1 task done
wxy929629 opened this issue Aug 11, 2023 · 2 comments
Closed
1 task done
Assignees
Labels

Comments

@wxy929629
Copy link

wxy929629 commented Aug 11, 2023

Describe the bug
ViterbiSegment加载自定义词典时未正确替换DoubleArrayTrie

Code to reproduce the issue
com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java

    private void loadCustomDic(String customPath, boolean isCache)
    {
        if (TextUtility.isBlank(customPath))
        {
            return;
        }
        logger.info("开始加载自定义词典:" + customPath);
        DoubleArrayTrie<CoreDictionary.Attribute> dat = new DoubleArrayTrie<CoreDictionary.Attribute>();
        String path[] = customPath.split(";");
        String mainPath = path[0];
        StringBuilder combinePath = new StringBuilder();
        for (String aPath : path)
        {
            combinePath.append(aPath.trim());
        }
        File file = new File(mainPath);
        mainPath = file.getParent() + "/" + Math.abs(combinePath.toString().hashCode());
        mainPath = mainPath.replace("\\", "/");
        DynamicCustomDictionary.loadMainDictionary(mainPath, path, dat, isCache, config.normalization);
    }

com/hankcs/hanlp/seg/SegmentTest.java

    public void testExtendViterbi() throws Exception
    {
        HanLP.Config.enableDebug(false);
        String path = System.getProperty("user.dir") + "/" + "data/dictionary/custom/CustomDictionary.txt;" +
            System.getProperty("user.dir") + "/" + "data/dictionary/custom/全国地名大全.txt";
        path = path.replace("\\", "/");
        String text = "一半天帕克斯曼是走不出丁字桥镇的";
        Segment segment = HanLP.newSegment().enableCustomDictionary(false);
        Segment seg = new ViterbiSegment(path);
        System.out.println("不启用字典的分词结果:" + segment.seg(text));
        System.out.println("默认分词结果:" + HanLP.segment(text));
        seg.enableCustomDictionaryForcing(true).enableCustomDictionary(true);
        List<Term> termList = seg.seg(text);
        System.out.println("自定义字典的分词结果:" + termList);
    }

Describe the current behavior
加载CustomDictionary.txt与全国地名大全.txt中, 应该包含'丁字桥镇'词条, 但实际的分词中并未切出
image
image

Expected behavior
'丁字桥镇'词条应被切出

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):macos 13.3.1 (a) (22E772610a)
  • Python version: n/a
  • HanLP version: 1.8.4

Other info / logs
com/hankcs/hanlp/seg/Viterbi/ViterbiSegment.java中的loadCustomDic(String customPath, boolean isCache)在加载完DoubleArrayTrie后应替换对应词典
image

  • I've completed this form and searched the web for solutions.
@wxy929629
Copy link
Author

wxy929629 commented Aug 11, 2023

详见pull request: #1835, 如有不足之处请指教, 感谢

@hankcs
Copy link
Owner

hankcs commented Aug 13, 2023

已经merge,感谢pr!

@hankcs hankcs closed this as completed Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants