GitHub - PedroBarcha/context-spelling-correction: Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phrase for the suggestion. The software was originally developed for correcting OCR output.

THE PROJECT

Currently, the program wraps single or several text files into 8 words blocks, query them to yandex and if it suggests a spell correction (like google's "did you mean"), then the original phrase is subtituted by the given suggestion. Otherwise, the phrase reamins the same in the file.

DEPENDECIES

All you need is python 2 and its standard libraries.

SET UP

Get an API Key at https://tech.yandex.com/xml/ ;
In order to grant access to 10.000 quries/day, confirm your cel number at https://passport.yandex.com/profile ;
Run git clone https://github.com/PedroBarcha/Context-Spelling-Correction.git to clone the repo;
From the terminal, navigate to the repo directory and run python config.py , in order to set you API key and username.

USAGE

From the terminal, run inside the repo's directory: python correction.py PATH_TO_THE_FILE . If you wish to correct several files, stored in the same directory, use instead: python correction.py PATH_TO_THE_DIR/* .

IMPORTANT

Before running the program, always make sure that you have your correct IP set at https://xml.yandex.com/settings/

OUTPUT

For every file specified in the input, a file with the same name and extension ".corrected" will be generated.
A single file named "yandex_suggestions.txt" is produced, containing all of Yandex's suggestions for you file(s) and also runtime statistics at the end of it.

NOTES

According to the API's doc., you need to convert special chars into escape sequences. However, the result of Yandex's spell check field ( in XML) is ASCII (bug?), not UTF8. So, we don't make the conversion of the queries before sending them simply because it doesn't matter for us. What this means is: any UTF8 char in your text will be rounded to ASCII if a suggestion is made by Yandex. (e.g: “ and ” turn to ". Also, — will become -). Unfortunately nothing can be done about it, since it is an API issue.
Book pages often contain hyphen (-) linking parts of a single word, when it reaches the end of a line. In cases like this, Yandex usually suggests the whole word binded together(without the hyphen) as a spell correction. Eg: "unfor- tunate" becomes "unfortunate". If you don't want that, you can easily reprogram it to check if the query sent to Yandex, without the hyphen, provides the same result as Yandex's suggestion for it. If so, you can mantain the former query instead of the suggestion.

TODO

Make a new algorithm for wrapper.py. Currently the program wrap the text into N words phrases and query them to Yandex. However, this is problematic as it doesn't really get the context of the phrases. The idea of the new algorithm is to wrap the text according to the dots contained in the text.
After the above is done, make a setup.py and publish the program on PyPi

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
lib		lib
temp		temp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
correction.py		correction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THE PROJECT

DEPENDECIES

SET UP

USAGE

IMPORTANT

OUTPUT

NOTES

TODO

About

Releases

Packages

Languages

License

PedroBarcha/context-spelling-correction

Folders and files

Latest commit

History

Repository files navigation

THE PROJECT

DEPENDECIES

SET UP

USAGE

IMPORTANT

OUTPUT

NOTES

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages