采用word2vec的api接口训练新模型完成时报java.lang.ArrayIndexOutOfBoundsException: 200错误 #821

muzier · 2018-05-11T08:43:36Z

注意事项

请确认下列注意事项：

我已仔细阅读下列文档，都没有找到答案：
我已经通过Google和issue区检索功能搜索了我的问题，也没有找到答案。
我明白开源社区是出于兴趣爱好聚集起来的自由社区，不承担任何责任或义务。我会礼貌发言，向每一个帮助我的人表示感谢。
[x ] 我在此括号内输入x打钩，代表上述事项确认完毕。

版本号

当前最新版本号是：1.6.3
我使用的版本是：1.6.3

采用Word2VecTrainer训练自己的模型时，虽然能够完成训练，但最后会数组越界错误，我也翻阅了issues区的相关数组越界的话题，但还是没有找到解决办法；在引用新模型进行文本分类时也会报数组越界的错误，求指导~

复现问题

步骤

我参考wiki上写了一个Word2VecTrain的类；
里面注册了一个回调函数callbackLP，然后构建了一个训练trainerBuilder，只设置了上面的回调函数和并行线程数为4（在单机上测试，希望能够加快训练）
具体代码如下：

触发代码

public class Word2VecTrain {

    public static void main(String[] args) {

        TrainingCallback callbackLP = null;
        final long timeStart = System.currentTimeMillis();
        if (callbackLP == null){
            callbackLP = new TrainingCallback() {
                public void corpusLoading(float percent) {
                    System.out.printf("\r加载训练语料：%.2f%%", percent);
                }

                public void corpusLoaded(int vocWords, int trainWords, int totalWords) {
                    System.out.println();
                    System.out.printf("词表大小：%d\n", vocWords);
                    System.out.printf("训练词数：%d\n", trainWords);
                    System.out.printf("语料词数：%d\n", totalWords);
                }

                public void training(float alpha, float progress) {
                    System.out.printf("\r学习率：%.6f  进度：%.2f%%", alpha, progress);
                    long timeNow = System.currentTimeMillis();
                    long costTime = timeNow - timeStart + 1;
                    progress /= 100;
                    String etd = Utility.humanTime((long) (costTime / progress * (1.f - progress)));
                    if (etd.length() > 0) System.out.printf("  剩余时间：%s", etd);
                    System.out.flush();
                }
            };
        }

//      构建训练方法
        Word2VecTrainer trainerBuilder = new Word2VecTrainer();

        trainerBuilder.setCallback(callbackLP);
        trainerBuilder.useNumThreads(4);

        WordVectorModel wordVectorModel = trainerBuilder.train
                ("D://DESKTOP//20180427//data-for-1.6.2//data//test//koubei_classify_seg.txt",
                "D://DESKTOP//20180427//data-for-1.6.2//data//test//msr_koubei_vectors_default.txt");
    }
}

期望输出

加载训练语料：100.00%
词表大小：24845
训练词数：2269300
语料词数：2368390
学习率：0.000163  进度：99.67%  剩余时间：01 s
训练结束，一共耗时：1 m 54 s 
正在保存模型到磁盘中……
模型已保存到：msr_koubei_vectors_default.txt

实际输出

加载训练语料：100.00%
词表大小：176977
训练词数：32511717
语料词数：32511717

学习率：0.000005  进度：100.00%

训练结束，一共耗时：1 h 48 m 15 s 
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 200
	at com.hankcs.hanlp.mining.word2vec.VectorsReader.readVectorFile(VectorsReader.java:50)
	at com.hankcs.hanlp.mining.word2vec.WordVectorModel.loadVectorMap(WordVectorModel.java:38)
	at com.hankcs.hanlp.mining.word2vec.WordVectorModel.<init>(WordVectorModel.java:32)
	at com.hankcs.hanlp.mining.word2vec.Word2VecTrainer.train(Word2VecTrainer.java:221)
	at com.autohome.Word2VecTrain.main(Word2VecTrain.java:53)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

实际输出入下图

http://attachbak.dataguru.cn/attachments/album/201805/11/163829w8bczfibvazzcbfu.png

其他信息

a) 个人推测：是不是和缓存有关？因为我第一次运行训练程序时，设置了维度是200，但也报错，后来改为默认设置（默认维度应该是100），也是报数组越界200的异常，不知道是否和计算缓存有关？
b) 另外，在调用这个新模型时也会报数据越界的异常，没有另开issue，我想是模型本身没有训练成功的缘故，只是看本地文件貌似也正常，所以尝试着调用了一下新模型，但同样报数据越界异常。
c) 还有一个问题，就是wiki里写的训练模型保存文件格式是 .bin，但是 DemoWord2Vec里加载模型是用的文件格式是 .txt，所以不知道这两个之间有没有其他方法可以转换或者生成呢？

The text was updated successfully, but these errors were encountered:

hankcs · 2018-05-12T03:38:53Z

我大概知道是什么原因。模型文件的分隔符为\s，如果有的单词含有空格就会出错。
感谢反馈，已经修复，请参考上面的commit。
如果还有问题，欢迎重开issue。

hankcs · 2018-05-12T03:39:30Z

另外，这个模块没有bin的转换。

muzier · 2018-05-12T08:33:04Z

感谢回复。我自己也排查了一下，发现训练生成的模型文件里有不可见字符：�，在notepad里看是这样的： http://attachbak.dataguru.cn/attachments/album/201805/12/162942b7u1kg1mpuuj13cp.png
相当于生成的词本身是不可见字符，在readvector时被当做\s了，所以越界了，供参考。

hankcs added a commit that referenced this issue May 12, 2018

增强词向量读取时的健壮性：#821

b83b2d3

hankcs closed this as completed May 12, 2018

hankcs added the bug label May 12, 2018

hankcs added a commit that referenced this issue Jan 10, 2020

增强词向量读取时的健壮性：#821

be5f872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

采用word2vec的api接口训练新模型完成时报java.lang.ArrayIndexOutOfBoundsException: 200错误 #821

采用word2vec的api接口训练新模型完成时报java.lang.ArrayIndexOutOfBoundsException: 200错误 #821

muzier commented May 11, 2018 •

edited

Loading

hankcs commented May 12, 2018

hankcs commented May 12, 2018

muzier commented May 12, 2018

采用word2vec的api接口训练新模型完成时报java.lang.ArrayIndexOutOfBoundsException: 200错误 #821

采用word2vec的api接口训练新模型完成时报java.lang.ArrayIndexOutOfBoundsException: 200错误 #821

Comments

muzier commented May 11, 2018 • edited Loading

注意事项

版本号

采用Word2VecTrainer训练自己的模型时，虽然能够完成训练，但最后会数组越界错误，我也翻阅了issues区的相关数组越界的话题，但还是没有找到解决办法；在引用新模型进行文本分类时也会报数组越界的错误，求指导~

复现问题

步骤

触发代码

期望输出

实际输出

其他信息

hankcs commented May 12, 2018

hankcs commented May 12, 2018

muzier commented May 12, 2018

muzier commented May 11, 2018 •

edited

Loading