GitHub - vi3k6i5/synonym-extractor: Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm

This project has moved to Flash Text.

synonym-extractor

Synonym Extractor is a python library that is loosely based on Aho-Corasick algorithm.

The idea is to extract words that we care about from a given sentence in one pass.

Basically say I have a vocabulary of 10K words and I want to get all the words from that set present in a sentence. A simple regex match will take a lot of time to loop over the 10K documents.

Hence we use a simpler yet much faster algorithm to get the desired result.

Installation

pip install synonym-extractor

Usage

# import module
from synonym.extractor import SynonymExtractor

# Create an object of SynonymExtractor
synonym_extractor = SynonymExtractor()

# add synonyms
synonym_names = ['NY', 'new-york', 'SF']
clean_names = ['new york', 'new york', 'san francisco']

for synonym_name, clean_name in zip(synonym_names, clean_names):
    synonym_extractor.add_to_synonym(synonym_name, clean_name)

synonyms_found = synonym_extractor.get_synonyms_from_sentence('I love SF and NY. new-york is the best.')

synonyms_found
>> ['san francisco', 'new york', 'new york']

Algorithm

synonym-extractor is based on Aho-Corasick algorithm.

Documentation

Documentation can be found at Read the Docs.

Why

Say you have a corpus where similar words appear frequently.

eg: Last weekened I was in NY.: I am traveling to new york next weekend.

If you train a word2vec model on this or do any sort of NLP it will treat NY and new york as 2 different words.

Instead if you create a synonym dictionary like:

eg: NY=>new york: new york=>new york

Then you can extract NY and new york as the same text.

To do the same with regex it will take a lot of time:

Docs count	# Synonyms	:	Regex	synonym-extractor
1.5 million	2K	:	16 hours	NA
2.5 million	10K	:	15 days	15 mins

The idea for this library came from the following StackOverflow question.

License

The project is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
docs		docs
synonym		synonym
test		test
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This project has moved to Flash Text.

synonym-extractor

Installation

Usage

Algorithm

Documentation

Why

License

About

Releases

Packages

Languages

License

vi3k6i5/synonym-extractor

Folders and files

Latest commit

History

Repository files navigation

This project has moved to Flash Text.

synonym-extractor

Installation

Usage

Algorithm

Documentation

Why

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages