Skip to content
This repository has been archived by the owner on Mar 29, 2022. It is now read-only.

high-moctane/nextword-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Nextword-data

🎉 NEW PREDICTION ENGINE MOCWORD IS AVAILABLE 🎉

Mocword is more advanced engine than Nextword.

  • Less data file size
    • 1.63GB (Nextword) -> 655MB (Mocword)
  • Using latest Google Ngram dataset
    • 2012 data (Nextword) -> 2020 data (Mocword)
  • More appropriate prediction
  • Less noisy vocabularies

A dataset for nextword.

Install

  1. (Recommended) Star this repository (`・ω・´)★

  2. Visit releases page.

  3. Download zip or tar.gz.

    You can choose larger or smaller one.

    Zip size Total size
    Small 152.2 MB 493.1 MB
    Large 483.3 MB 1.63 GB
  4. Decompress downloaded data.

  5. Set $NEXTWORD_DATA_PATH environment variable.

    Example:

    export NEXTWORD_DATA_PATH=/path/to/nextword-data

Uninstall

  1. Remove $NEXTWORD_DATA_PATH environment variable.

  2. Remove nextword-data directory.

Format

(n-1)gram tab candidates newline

Candidates are sorted by appearance order.

Example

You can find the line

empty milk	bottles carton bottle cartons cans

at line 59349 in file 3gram-e.txt.

This line describes the word "bottles" is the most likely word after "empty milk" and "carton" is the next.

Recipe

  1. Fetch data.

    $ mkdir fetch
    $ nwgen-fetch fetch
    
  2. Run xonsh script.

    dstdir = "dstdir"
    mkdir -p @(dstdir)/format
    mkdir -p @(dstdir)/concat
    
    ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname
    
    ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname
    
    nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*
    
    for n in [2,3,4,5]:
        for c in [chr(i) for i in range(97, 97+26)]:
            nwgen-concat @(dstdir)/concat/@(n)gram-@(c).txt.gz @(dstdir)/format/@(n)gram-@(c)*
    
    cp -R @(dstdir)/concat @(dstdir)/data
    
    gunzip @(dstdir)/data/*

Notice

Nextword-data is based on Google Books Ngram Viewer English Version 20120701 which is distributed under a Creative Commons Attribution 3.0 Unported. See NOTICE.txt.

License

Nextword-data is distributed under a Creative Commons Attribution 4.0 International. See LICENSE.txt.