-
Notifications
You must be signed in to change notification settings - Fork 0
Recipes & FAQ
Add your useful code snippets and recipes here. You can also post a short question -- please only ask questions that can be fully answered in a sentence or two. No open-ended questions or discussions here.
###Q1: How many times does a feature with id 123
appear in a corpus?
A: total_sum = sum(dict(doc).get(123, 0) for doc in corpus)
###Q2: How do you calculate the vector length of a term? A: (note that "vector length" only makes sense for non-zero vectors):
- If the input vector
vec
is in gensim sparse format (a list of 2-tuples) :length = math.sqrt(sum(val**2 for _, val in vec))
, or uselength = gensim.matutils.veclen(vec)
. - If the input vector is a numpy array:
length = gensim.matutils.blas_nrm2(vec)
- If the input vector is in a
scipy.sparse
format:length = numpy.sqrt(numpy.sum(vec.tocsr().data**2))
Also note that if you want the length just to normalize a vector to unit length, you might as well call gensim.matutils.unitvec(vec)
, which accepts any of these three formats as input.
###Q3: How do you calculate the matrix V in LSI space?
A: Given a model lsi = LsiModel(X, ...)
, with the truncated singular value decomposition of your corpus X
being X=U*S*V^T
, doing lsi[X]
computes U^-1*X
, which equals V*S
(basic linear algebra). So if you want V
, divide lsi[X]
by S
:
V = gensim.matutils.corpus2dense(lsi[X], len(lsi.projection.s)).T / lsi.projection.s
, to get V
as a 2d numpy array.
###Q4: How do you output the U, S, V^T matrices of LSI?
A: After creating the LSI model lsi = models.LsiModel(corpus, ...)
, the U and S matrices are in lsi.projection.u
and lsi.projection.s
. The V (or V^T) matrix is not stored explicitly, because it may not fit in memory (its shape is num_docs * num_topics
). If you need V, you can compute it with an extra pass over corpus
, using gensim's streaming lsi[corpus]
API (see Q3 above).
###Q5: I am getting out of memory errors with LSI. How much memory do I need?
A: The final model is stored as a matrix of num_terms x num_topics
numbers. With 8 bytes per number (double precision), that's 8 * num_terms * num_topics
, i.e. for 100k terms in dictionary and 500 topics, the model will be 8*100,000*500 = 400MB
.
That's just the output -- during the actual computation of this model, temporary copies are needed, so in practice, you'll need about 3x that amount. For the 100k dictionary and 500 topics example, you'll actually need ~1.2GB to create the LSI model.
When out of memory, you'll have to either reduce the dictionary size or the number of topics. The memory footprint is not affected by the number of training documents, though.
###Q6: I have many text files under a directory, each file is a single document. How do I create a corpus from that?
A: See http://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time . If you're having trouble going through the files, have a look at the following snippet (it accepts all .txt
files, even in nested subdirectories):
def iter_documents(top_directory):
"""Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
for root, dirs, files in os.walk(top_directory):
for file in filter(lambda file: file.endswith('.txt'), files):
document = open(os.path.join(root, file)).read() # read the entire document, as one big string
yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you
class MyCorpus(object):
def __init__(self, top_dir):
self.top_dir = top_dir
self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params
def __iter__(self):
for tokens in iter_documents(self.top_dir):
yield self.dictionary.doc2bow(tokens)
corpus = MyCorpus('/tmp/test') # create a dictionary
for vector in corpus: # convert each document to a bag-of-word vector
print vector
...
###Q7: I have many text files under a directory, each file is a single document. How do I create a word2vec model from that?
A: (by Christian Ledermann)
This code makes the simplifying assumption that sentence-ending punctuation should be excluded from the text and that .
and :
always end a sentence.
class DirOfPlainTextCorpus(object):
"""Iterate over sentences of all plaintext files in a directory """
SPLIT_SENTENCES = re.compile(u"[.!?:]\s+") # split sentences on these characters
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fn in os.listdir(self.dirname):
text = open(os.path.join(self.dirname, fn)).read()
for sentence in self.SPLIT_SENTENCES.split(text)
yield gensim.utils.simple_preprocess(sentence, deacc=True)
model = gensim.models.Word2Vec(DirOfPlainTextCorpus('/path/to/dir'), size=200, min_count=5, workers=2)
###Q8: How can I filter a saved corpus and its corresponding dictionary?
A: (by Yaser Martinez)
The function dictionary.filter_extremes
changes the original IDs so we need to reread and (optionally) rewrite the old corpus using a transformation:
import copy
from gensim.models import VocabTransform
# filter the dictionary
old_dict = corpora.Dictionary.load('old.dict')
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')
# now transform the corpus
corpus = corpora.MmCorpus('corpus.mm')
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus], id2word=new_dict)