Skip to content
Radim Řehůřek edited this page Dec 23, 2013 · 34 revisions

Add your useful code snippets and recipes here. You can also post a short question -- please only ask questions that can be fully answered in a sentence or two. No open-ended questions or discussions here.

###Q1: How many times does a feature with id 123 appear in a corpus? A: total_sum = sum(dict(doc).get(123, 0) for doc in corpus)


###Q2: How do you calculate the vector length of a term? A: (note that "vector length" only makes sense for non-zero vectors):

  1. If the input vector vec is in gensim sparse format (a list of 2-tuples) : length = math.sqrt(sum(val**2 for _, val in vec)), or use length = gensim.matutils.veclen(vec).
  2. If the input vector is a numpy array: length = gensim.matutils.blas_nrm2(vec)
  3. If the input vector is in a scipy.sparse format: length = numpy.sqrt(numpy.sum(vec.tocsr().data**2))

Also note that if you want the length just to normalize a vector to unit length, you might as well call gensim.matutils.unitvec(vec), which accepts any of these three formats as input.


###Q3: How do you calculate the matrix V in LSI space? A: Given a model lsi = LsiModel(X, ...), with the truncated singular value decomposition of your corpus X being X=U*S*V^T, doing lsi[X] computes U^-1*X, which equals V*S (basic linear algebra). So if you want V, divide lsi[X] by S:

V = gensim.matutils.corpus2dense(lsi[X], len(lsi.projection.s)).T / lsi.projection.s, to get V as a 2d numpy array.

###Q4: How do you output the U, S, V^T matrices of LSI? A: After creating the LSI model lsi = models.LsiModel(corpus, ...), the U and S matrices are in lsi.projection.u and lsi.projection.s. The V (or V^T) matrix is not stored explicitly, because it may not fit in memory (its shape is num_docs * num_topics). If you need V, you can compute it with an extra pass over corpus, using gensim's streaming lsi[corpus] API (see Q3 above).

###Q5: I am getting out of memory errors with LSI. How much memory do I need? A: The final model is stored as a matrix of num_terms x num_topics numbers. With 8 bytes per number (double precision), that's 8 * num_terms * num_topics, i.e. for 100k terms in dictionary and 500 topics, the model will be 8*100,000*500 = 400MB.

That's just the output -- during the actual computation of this model, temporary copies are needed, so in practice, you'll need about 3x that amount. For the 100k dictionary and 500 topics example, you'll actually need ~1.2GB to create the LSI model.

When out of memory, you'll have to either reduce the dictionary size or the number of topics. The memory footprint is not affected by the number of training documents, though.

###Q6: I have many text files under a directory, each file is a single document. How do I create a corpus from that? A: See http://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time . If you're having trouble going through the files, have a look at the following snippet (it accepts all .txt files, even in nested subdirectories):

def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: file.endswith('.txt'), files):
            document = open(os.path.join(root, file)).read() # read the entire document, as one big string
            yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
        self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

corpus = MyCorpus('/tmp/test') # create a dictionary
for vector in corpus: # convert each document to a bag-of-word vector
    print vector
    ...

###Q7: I have many text files under a directory, each file is a single document. How do I create a word2vec model from that?

A: (by Christian Ledermann)

This code makes the simplifying assumption that sentence-ending punctuation should be excluded from the text and that . and : always end a sentence.

class DirOfPlainTextCorpus(object):
    """Iterate over sentences of all plaintext files in a directory """
    SPLIT_SENTENCES = re.compile(u"[.!?:]\s+")  # split sentences on these characters

    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fn in os.listdir(self.dirname):
            text = open(os.path.join(self.dirname, fn)).read()
            for sentence in self.SPLIT_SENTENCES.split(text)
                yield gensim.utils.simple_preprocess(sentence, deacc=True)

model = gensim.models.Word2Vec(DirOfPlainTextCorpus('/path/to/dir'), size=200, min_count=5, workers=2)

###Q8: How can I filter a saved corpus and its corresponding dictionary?

A: (by Yaser Martinez)

The function dictionary.filter_extremes changes the original IDs so we need to reread and (optionally) rewrite the old corpus using a transformation:

import copy 
from gensim.models import VocabTransform

# filter the dictionary
old_dict = corpora.Dictionary.load('old.dict')
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')

# now transform the corpus
corpus = corpora.MmCorpus('corpus.mm')
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus], id2word=new_dict)
Clone this wiki locally