-
Notifications
You must be signed in to change notification settings - Fork 10
Stop words
Welcome to the wiki page of stop-words. In this page, you will find out how Greek stop-words list is produced.
In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
-
First, get a dump of Greek wikipedia:
wget https://dumps.wikimedia.org/elwiki/latest/elwiki-latest-pages-articles.xml.bz2
-
Secondly, use the following code to count word frequencies from the wikipedia dump and save the 300 most frequent words in a file, following this format.
import multiprocessing from collections import defaultdict from gensim.corpora import WikiCorpus, MmCorpus words = defaultdict(int) wiki = WikiCorpus("elwiki-latest-pages-articles.xml.bz2",lemmatize=False, dictionary={}) sentences = list(wiki.get_texts()) for sentence in sentences: for token in sentence: words[token]+=1
The full script can be found here.
Note: A file with frequencies of Greek words can be found here. The first column contains the occurrences of the word, the second column the number of documents in which the word occurred and the third column the word itself.
The list extracted from Wikipedia is not enough, because it doesn't include a lot of personal forms, which for some applications might be good stop-word additions.
Because of that, we found it useful to add some words from the Open Subtitles list of words with their frequencies. The list can be found here.
The most frequent words from Wikipedia dump list and Open Subtitles list were concatenated and the output was checked manually in order to ensure the quality of the stop-words list.
The final stop-words list can be found here.