This project has moved to Flash Text.
Synonym Extractor is a python library that is loosely based on Aho-Corasick algorithm.
The idea is to extract words that we care about from a given sentence in one pass.
Basically say I have a vocabulary of 10K words and I want to get all the words from that set present in a sentence. A simple regex match will take a lot of time to loop over the 10K documents.
Hence we use a simpler yet much faster algorithm to get the desired result.
pip install synonym-extractor
# import module from synonym.extractor import SynonymExtractor # Create an object of SynonymExtractor synonym_extractor = SynonymExtractor() # add synonyms synonym_names = ['NY', 'new-york', 'SF'] clean_names = ['new york', 'new york', 'san francisco'] for synonym_name, clean_name in zip(synonym_names, clean_names): synonym_extractor.add_to_synonym(synonym_name, clean_name) synonyms_found = synonym_extractor.get_synonyms_from_sentence('I love SF and NY. new-york is the best.') synonyms_found >> ['san francisco', 'new york', 'new york']
synonym-extractor is based on Aho-Corasick algorithm.
Documentation can be found at Read the Docs.
Say you have a corpus where similar words appear frequently.
- eg: Last weekened I was in NY.
- I am traveling to new york next weekend.
If you train a word2vec model on this or do any sort of NLP it will treat NY and new york as 2 different words.
Instead if you create a synonym dictionary like:
- eg: NY=>new york
- new york=>new york
Then you can extract NY and new york as the same text.
To do the same with regex it will take a lot of time:
Docs count | # Synonyms | : | Regex | synonym-extractor |
---|---|---|---|---|
1.5 million | 2K | : | 16 hours | NA |
2.5 million | 10K | : | 15 days | 15 mins |
The idea for this library came from the following StackOverflow question.
The project is licensed under the MIT license.