Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy rate seems to be 20% lower than the original C version #40

Open
hankcs opened this issue Jul 21, 2016 · 0 comments
Open

Accuracy rate seems to be 20% lower than the original C version #40

hankcs opened this issue Jul 21, 2016 · 0 comments

Comments

@hankcs
Copy link

hankcs commented Jul 21, 2016

Hello, dear medallia staffs.
Thank you for your nice Java code. It is beautiful, neatly but seems not precious.

I computed the accuracy rate, and it is 20% lower than the original version.
I trained on text8 with the same parameters, which are:

Java

File f = new File("text8");
        if (!f.exists())
            throw new IllegalStateException("Please download and unzip the text8 example from http://mattmahoney.net/dc/text8.zip");
        List<String> read = Common.readToList(f);
        List<List<String>> partitioned = Lists.transform(read, new Function<String, List<String>>() {
            @Override
            public List<String> apply(String input) {
                return Arrays.asList(input.split(" "));
            }
        });

        Word2VecModel model = Word2VecModel.trainer()
                .setMinVocabFrequency(5)
                .useNumThreads(20)
                .setWindowSize(8)
                .type(NeuralNetworkType.CBOW)
                .setLayerSize(200)
                .useNegativeSamples(25)
                .setDownSamplingRate(1e-4)
                .setNumIterations(15)
                .setListener(new TrainingProgressListener() {
                    @Override public void update(Stage stage, double progress) {
                        System.out.println(String.format("%s is %.2f%% complete", Format.formatEnum(stage), progress * 100));
                    }
                })
                .train(partitioned);

        try(final OutputStream os = Files.newOutputStream(Paths.get("vectors.bin"))) {
            model.toBinFile(os);
        }

C

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Use the same judge program and test file:

./compute-accuracy vectors.bin 30000 < questions-words.txt

Your Java implementation:

capital-common-countries:
ACCURACY TOP1: 58.30 %  (295 / 506)
Total accuracy: 58.30 %   Semantic accuracy: 58.30 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 36.78 %  (534 / 1452)
Total accuracy: 42.34 %   Semantic accuracy: 42.34 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 12.69 %  (34 / 268)
Total accuracy: 38.77 %   Semantic accuracy: 38.77 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 25.21 %  (396 / 1571)
Total accuracy: 33.16 %   Semantic accuracy: 33.16 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 55.23 %  (169 / 306)
Total accuracy: 34.80 %   Semantic accuracy: 34.80 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 8.07 %  (61 / 756)
Total accuracy: 30.64 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.07 % 
gram2-opposite:
ACCURACY TOP1: 9.48 %  (29 / 306)
Total accuracy: 29.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 8.47 % 
gram3-comparative:
ACCURACY TOP1: 38.25 %  (482 / 1260)
Total accuracy: 31.13 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.63 % 
gram4-superlative:
ACCURACY TOP1: 23.91 %  (121 / 506)
Total accuracy: 30.60 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 24.50 % 
gram5-present-participle:
ACCURACY TOP1: 22.08 %  (219 / 992)
Total accuracy: 29.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 23.87 % 
gram6-nationality-adjective:
ACCURACY TOP1: 63.17 %  (866 / 1371)
Total accuracy: 34.50 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.25 % 
gram7-past-tense:
ACCURACY TOP1: 26.35 %  (351 / 1332)
Total accuracy: 33.47 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.64 % 
gram8-plural:
ACCURACY TOP1: 44.25 %  (439 / 992)
Total accuracy: 34.39 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 34.17 % 
gram9-plural-verbs:
ACCURACY TOP1: 18.15 %  (118 / 650)
Total accuracy: 33.53 %   Semantic accuracy: 34.80 %   Syntactic accuracy: 32.90 % 
Questions seen / total: 12268 19544   62.77 % 

Original C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant