Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training for a large dataset, Java heap space memory out #20

Open
pzhanggithub opened this issue May 20, 2015 · 5 comments
Open

Training for a large dataset, Java heap space memory out #20

pzhanggithub opened this issue May 20, 2015 · 5 comments

Comments

@pzhanggithub
Copy link

I trained with Text8 from Eclipse, worked OK. But when I loaded a 2.1 GB text file, OutOfMemoryError occurred even I increased the Eclipse memory to maximum. Any ideas about this?

Thanks.

Pengchu

@guerda
Copy link
Contributor

guerda commented May 21, 2015

Hi @pzhanggithub !
Could you tell us, which version of Word2VecJava you used? Do you use the stable 0.9.0 or any development version from this repository?
Additionally, how did you "load" the text file? Can you provide some source code? The training call would be interesting, too.

@pzhanggithub
Copy link
Author

Thanks for this quick response.

The version I used is what I downloaded from the port a couple of days ago, it is the only version. I Mavened it to Eclipse on my Mac. It passed all tests as you provided. And the tasks of reading, training and interacting were all successfully down.

For test, I used the class of word2vecExamples with any change but the input file name:
File f = new File("/Users/pzhang/Documents/SandReportData/sandreportall/sand25years.txt");
if (!f.exists())
throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
List read = Common.readToList(f);
List<List> partitioned = Lists.transform(read, new Function<String, List>() {
@OverRide
public List apply(String input) {
return Arrays.asList(input.split(" "));
}
});

    Word2VecModel model = Word2VecModel.trainer()
            .setMinVocabFrequency(5)
            .useNumThreads(20)
            .setWindowSize(8)
            .type(NeuralNetworkType.CBOW)
            .setLayerSize(200)
            .useNegativeSamples(25)
            .setDownSamplingRate(1e-4)
            .setNumIterations(5)
            .setListener(new TrainingProgressListener() {
                @Override public void update(Stage stage, double progress) {
                    System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                }
            })
            .train(partitioned);

The error occurs at:

List read = Common.readToList(f);

Thanks.

Pengchu

@pzhanggithub
Copy link
Author

The file was extracted from report repository and converted into a text file. The same file was used with word2vec.c (by Google) successfully.

@dirkgr
Copy link
Contributor

dirkgr commented Jul 15, 2015

Can you try running this with the changes in #23?

@Hronom
Copy link
Contributor

Hronom commented Jul 29, 2015

Tested with fix from #23 and looks like it's work for me. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants