Training for a large dataset, Java heap space memory out #20

pzhanggithub · 2015-05-20T17:13:23Z

I trained with Text8 from Eclipse, worked OK. But when I loaded a 2.1 GB text file, OutOfMemoryError occurred even I increased the Eclipse memory to maximum. Any ideas about this?

Thanks.

Pengchu

guerda · 2015-05-21T09:23:44Z

Hi @pzhanggithub !
Could you tell us, which version of Word2VecJava you used? Do you use the stable 0.9.0 or any development version from this repository?
Additionally, how did you "load" the text file? Can you provide some source code? The training call would be interesting, too.

pzhanggithub · 2015-05-21T13:08:20Z

Thanks for this quick response.

The version I used is what I downloaded from the port a couple of days ago, it is the only version. I Mavened it to Eclipse on my Mac. It passed all tests as you provided. And the tasks of reading, training and interacting were all successfully down.

For test, I used the class of word2vecExamples with any change but the input file name:
File f = new File("/Users/pzhang/Documents/SandReportData/sandreportall/sand25years.txt");
if (!f.exists())
throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
List read = Common.readToList(f);
List<List> partitioned = Lists.transform(read, new Function<String, List>() {
@OverRide
public List apply(String input) {
return Arrays.asList(input.split(" "));
}
});

    Word2VecModel model = Word2VecModel.trainer()
            .setMinVocabFrequency(5)
            .useNumThreads(20)
            .setWindowSize(8)
            .type(NeuralNetworkType.CBOW)
            .setLayerSize(200)
            .useNegativeSamples(25)
            .setDownSamplingRate(1e-4)
            .setNumIterations(5)
            .setListener(new TrainingProgressListener() {
                @Override public void update(Stage stage, double progress) {
                    System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                }
            })
            .train(partitioned);

The error occurs at:

List read = Common.readToList(f);

Thanks.

Pengchu

pzhanggithub · 2015-05-21T13:13:51Z

The file was extracted from report repository and converted into a text file. The same file was used with word2vec.c (by Google) successfully.

dirkgr · 2015-07-15T17:37:46Z

Can you try running this with the changes in #23?

Hronom · 2015-07-29T19:08:23Z

Tested with fix from #23 and looks like it's work for me. Thank you!

dirkgr mentioned this issue Jul 15, 2015

Makes sure we don't pull the whole corpus into memory when training #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training for a large dataset, Java heap space memory out #20

Training for a large dataset, Java heap space memory out #20

pzhanggithub commented May 20, 2015

guerda commented May 21, 2015

pzhanggithub commented May 21, 2015

pzhanggithub commented May 21, 2015

dirkgr commented Jul 15, 2015

Hronom commented Jul 29, 2015

Training for a large dataset, Java heap space memory out #20

Training for a large dataset, Java heap space memory out #20

Comments

pzhanggithub commented May 20, 2015

guerda commented May 21, 2015

pzhanggithub commented May 21, 2015

pzhanggithub commented May 21, 2015

dirkgr commented Jul 15, 2015

Hronom commented Jul 29, 2015