Nextword-data

🎉 NEW PREDICTION ENGINE MOCWORD IS AVAILABLE 🎉

Mocword is more advanced engine than Nextword.

Less data file size
- 1.63GB (Nextword) -> 655MB (Mocword)
Using latest Google Ngram dataset
- 2012 data (Nextword) -> 2020 data (Mocword)
More appropriate prediction
Less noisy vocabularies

A dataset for nextword.

Install

(Recommended) Star this repository (｀･ω･´)★
Visit releases page.
Download zip or tar.gz.

You can choose larger or smaller one.

Zip size Total size

Small 152.2 MB 493.1 MB

Large 483.3 MB 1.63 GB
Decompress downloaded data.
Set $NEXTWORD_DATA_PATH environment variable.

Example:
```
export NEXTWORD_DATA_PATH=/path/to/nextword-data
```

Uninstall

Remove $NEXTWORD_DATA_PATH environment variable.
Remove nextword-data directory.

Format

(n-1)gram tab candidates newline

Candidates are sorted by appearance order.

Example

You can find the line

empty milk	bottles carton bottle cartons cans

at line 59349 in file 3gram-e.txt.

This line describes the word "bottles" is the most likely word after "empty milk" and "carton" is the next.

Recipe

Fetch data.
```
$ mkdir fetch
$ nwgen-fetch fetch
```

Run xonsh script.

dstdir = "dstdir"
mkdir -p @(dstdir)/format
mkdir -p @(dstdir)/concat

ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname

ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname

ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname

ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname

ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname

nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*

for n in [2,3,4,5]:
    for c in [chr(i) for i in range(97, 97+26)]:
        nwgen-concat @(dstdir)/concat/@(n)gram-@(c).txt.gz @(dstdir)/format/@(n)gram-@(c)*

cp -R @(dstdir)/concat @(dstdir)/data

gunzip @(dstdir)/data/*

Notice

Nextword-data is based on Google Books Ngram Viewer English Version 20120701 which is distributed under a Creative Commons Attribution 3.0 Unported. See NOTICE.txt.

License

Nextword-data is distributed under a Creative Commons Attribution 4.0 International. See LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nextword-data

🎉 NEW PREDICTION ENGINE MOCWORD IS AVAILABLE 🎉

Install

Uninstall

Format

Example

Recipe

Notice

License

About

Releases

Packages

	Zip size	Total size
Small	152.2 MB	493.1 MB
Large	483.3 MB	1.63 GB

License

high-moctane/nextword-data

Folders and files

Latest commit

History

Repository files navigation

Nextword-data

🎉 NEW PREDICTION ENGINE MOCWORD IS AVAILABLE 🎉

Install

Uninstall

Format

Example

Recipe

Notice

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages